Mailing List Archive: Python too slow for real world

Python too slow for real world

a.mueller at icrf

Apr 23, 1999, 5:34 AM

Post #1 of 46 (4924 views)

Hi All,

first off all: Sorry for that slightly provoking subject ;-) ...

I just switched from perl to python because I think python makes live
easyer in bigger software projects. However I found out that perl is
more then 10 times faster then python in solving the following probelm:

I've got a file (130 MB) with ~ 300000 datasets of the form:

>px0034 hypothetical protein or whatever description
LSADQISTVQASFDKVKGDPVGILYAVFKADPSIMAKFTQFAGKDLESIKGTAPFETHAN
RIVGFFSKIIGELPNIEADVNTFVASHKPRGVTHDQLNNFRAGFVSYMKAHTDFAGAEAA
WGATLDTFFGMIFSKM

The word floowing the '>' is an identifier, the uppercase letters in the
lines following the identifier are the data. Now I want to read and
write the contens of that file excluding some entries (given by a
dictionary with identifiers, e.g. 'px0034').

The following python code does the job:

from re import *
from sys import *

def read_write(i, o, exclude):
name = compile('^>(\S+)') # regex to fetch the identifier
l = i.readline()
while l:
if l[0] == '>': # are we in new dataset?
m = name.search(l)
if m and exclude.has_key(m.group(1)): # excluding current
dataset?
l = i.readline()
while l and l[0] != '>': # skip this dataset
l = i.readline()
pass
o.write(l)
l = i.readline()

f = open('my_very_big_data_file','r') # datafile with ~300000 records
read_write(f, stdout, {}) # for a simple test I don't exclude anything!

It took 503.90 sec on a SGI Power Challange (R10000 CPU). An appropiate
perl script does the same job in 32 sec (Same method, same loop
structure)!

Since I've to call this routine about 1500 times it's a very big
difference in time and not realy accaptable.

I'd realy like to know why python is so slow (or perl is so fast?) and
what I can do to improove speed of that routine.

I don't want to switch back to perl - but honestly, is python the right
language to process souch huge amount of data?

If you want to generate a test set you could use the following lines to
print 10000 datasets to stdout:

for i in xrange(1, 10001):
print
'>px%05d\nLSADQISTVQASFDKVKGDPVGILYAVFKADPSIMAKFTQFAGKDLESIKGTAPFETHAN\n\
RIVGFFSKIIGELPNIEADVNTFVASHKPRGVTHDQLNNFRAGFVSYMKAHTDFAGAEAA\n\
WGATLDTFFGMIFSKM\n' % i

And if you don't believe me that perl does the job quicker you can try
the perl code below:

#!/usr/local/bin/perl -w
open(IN,"test.dat");
my %ex = ();
read_write(%ex);

sub read_write{

$l = <IN>;
OUTER: while( defined $l ){
if( (($x) = $l =~ /^>(\S+)/) ){
if( exists $ex{$x} ){
$l = <IN>;
while( defined $l && !($l =~ /^>(\S+)/) ){
$l = <IN>;
}
next OUTER;
}
}
print $l;
$l = <IN>;
}
}

Please do convince me being a python programmer does not mean being slow
;-)

Thanks very much for any help,

Arne

Python too slow for real world [ In reply to ]

jmrober1 at ingr

Apr 23, 1999, 6:32 AM

Post #2 of 46 (4876 views)

Hi,

For what you state here, you don't even really need to read the 'data' at
all.
Just read your descriptors, and store the offsets and len of the data in a
dictionary (i.e. index it).

readline
if first char == >
get id
get current position using seek method
store id, pos in dict
#for each id, we now have its byte posisition in the file

Then have a filter method which keeps or discards the records by criteria.

for each key in dict
if key passes filter test
store key in filtered dict

Then only at the time you really need that data do you go get it.
for each in filtered_dict
use seek to position
read data until next line with > at 0

This way you can create views on your data without actually trying to load it
all. The tradeoff of course is memory for fileaccess time, but I found
fileaccess to be faster than doing all the work 'up front'. Besides my
project reached the point where we ran out of memory often, some datasets are
on 8+ cdroms!

Hope that was relevant, but maybe I misunderstood the question.
Joe Robertson,
jmrobert@ro.com

Arne Mueller wrote:

> Hi All,
>
> first off all: Sorry for that slightly provoking subject ;-) ...
>
> I just switched from perl to python because I think python makes live
> easyer in bigger software projects. However I found out that perl is
> more then 10 times faster then python in solving the following probelm:
>
> I've got a file (130 MB) with ~ 300000 datasets of the form:
>
> >px0034 hypothetical protein or whatever description
> LSADQISTVQASFDKVKGDPVGILYAVFKADPSIMAKFTQFAGKDLESIKGTAPFETHAN
> RIVGFFSKIIGELPNIEADVNTFVASHKPRGVTHDQLNNFRAGFVSYMKAHTDFAGAEAA
> WGATLDTFFGMIFSKM
>
> The word floowing the '>' is an identifier, the uppercase letters in the
> lines following the identifier are the data. Now I want to read and
> write the contens of that file excluding some entries (given by a
> dictionary with identifiers, e.g. 'px0034').
>
> The following python code does the job:
>
> from re import *
> from sys import *
>
> def read_write(i, o, exclude):
> name = compile('^>(\S+)') # regex to fetch the identifier
> l = i.readline()
> while l:
> if l[0] == '>': # are we in new dataset?
> m = name.search(l)
> if m and exclude.has_key(m.group(1)): # excluding current
> dataset?
> l = i.readline()
> while l and l[0] != '>': # skip this dataset
> l = i.readline()
> pass
> o.write(l)
> l = i.readline()
>
> f = open('my_very_big_data_file','r') # datafile with ~300000 records
> read_write(f, stdout, {}) # for a simple test I don't exclude anything!
>
> It took 503.90 sec on a SGI Power Challange (R10000 CPU). An appropiate
> perl script does the same job in 32 sec (Same method, same loop
> structure)!
>
> Since I've to call this routine about 1500 times it's a very big
> difference in time and not realy accaptable.
>
> I'd realy like to know why python is so slow (or perl is so fast?) and
> what I can do to improove speed of that routine.
>
> I don't want to switch back to perl - but honestly, is python the right
> language to process souch huge amount of data?
>
> If you want to generate a test set you could use the following lines to
> print 10000 datasets to stdout:
>
> for i in xrange(1, 10001):
> print
> '>px%05d\nLSADQISTVQASFDKVKGDPVGILYAVFKADPSIMAKFTQFAGKDLESIKGTAPFETHAN\n\
> RIVGFFSKIIGELPNIEADVNTFVASHKPRGVTHDQLNNFRAGFVSYMKAHTDFAGAEAA\n\
> WGATLDTFFGMIFSKM\n' % i
>
> And if you don't believe me that perl does the job quicker you can try
> the perl code below:
>
> #!/usr/local/bin/perl -w
> open(IN,"test.dat");
> my %ex = ();
> read_write(%ex);
>
> sub read_write{
>
> $l = <IN>;
> OUTER: while( defined $l ){
> if( (($x) = $l =~ /^>(\S+)/) ){
> if( exists $ex{$x} ){
> $l = <IN>;
> while( defined $l && !($l =~ /^>(\S+)/) ){
> $l = <IN>;
> }
> next OUTER;
> }
> }
> print $l;
> $l = <IN>;
> }
> }
>
> Please do convince me being a python programmer does not mean being slow
> ;-)
>
> Thanks very much for any help,
>
> Arne

Python too slow for real world [ In reply to ]

Apr 23, 1999, 6:55 AM

Post #3 of 46 (4866 views)

Arne Mueller <a.mueller@icrf.icnet.uk> writes:

> Hi All,
>
> first off all: Sorry for that slightly provoking subject ;-) ...
[...]
>
> The following python code does the job:
[...]
> f = open('my_very_big_data_file','r') # datafile with ~300000 records
> read_write(f, stdout, {}) # for a simple test I don't exclude
> anything!

Well -- re is known to be slow. If you have to be fast, maybe you
should try not to use regular expressions; You could perhaps use
something from the string module (several options there) or maybe even
consider fixed-length fields for the identifiers, which should speed
up things a bit.

> It took 503.90 sec on a SGI Power Challange (R10000 CPU). An appropiate
> perl script does the same job in 32 sec (Same method, same loop
> structure)!

Hm. Perl probably has a more efficient implementation of Perl regexes
than Python, naturally enough...

> I'd realy like to know why python is so slow (or perl is so fast?) and
> what I can do to improove speed of that routine.

Well -- at least I have made one suggestion... Though it may not
explain it all...

>
> I don't want to switch back to perl - but honestly, is python the right
> language to process souch huge amount of data?
>
> If you want to generate a test set you could use the following lines to
> print 10000 datasets to stdout:
>
> for i in xrange(1, 10001):
> print
> '>px%05d\nLSADQISTVQASFDKVKGDPVGILYAVFKADPSIMAKFTQFAGKDLESIKGTAPFETHAN\n\
> RIVGFFSKIIGELPNIEADVNTFVASHKPRGVTHDQLNNFRAGFVSYMKAHTDFAGAEAA\n\
> WGATLDTFFGMIFSKM\n' % i
>
> And if you don't believe me that perl does the job quicker you can try
> the perl code below:
[...]

OK. Using your testset, I tried the following program (It may not work
exactly like your script...)

I have made the assumption that all the id's have a constant length of
7.

----------

import fileinput

exclude = {'px00003': 1}
skip = 0

for line in fileinput.input():
if line[0] == '>':
id = line[1:8]
if exclude.has_key(id):
skip = 1
else:
skip = 0
if not skip:
print line,

-----------

It took about 12 seconds.

>
> Please do convince me being a python programmer does not mean being slow
> ;-)
>

At least I tried...

> Thanks very much for any help,
>
> Arne

--
> Hi! I'm the signature virus 99!
Magnus > Copy me into your signature and join the fun!
Lie
Hetland http://arcadia.laiv.org <arcadia@laiv.org>

Python too slow for real world [ In reply to ]

tismer at appliedbiometrics

Apr 23, 1999, 7:05 AM

Post #4 of 46 (4859 views)

Arne Mueller wrote:

> def read_write(i, o, exclude):
> name = compile('^>(\S+)') # regex to fetch the identifier
> l = i.readline()
> while l:
> if l[0] == '>': # are we in new dataset?
> m = name.search(l)
> if m and exclude.has_key(m.group(1)): # excluding current
> dataset?
> l = i.readline()
> while l and l[0] != '>': # skip this dataset
> l = i.readline()
> pass
> o.write(l)
> l = i.readline()
>
> f = open('my_very_big_data_file','r') # datafile with ~300000 records
> read_write(f, stdout, {}) # for a simple test I don't exclude anything!
>
> It took 503.90 sec on a SGI Power Challange (R10000 CPU). An appropiate
> perl script does the same job in 32 sec (Same method, same loop
> structure)!

Without changing the program structure, I could make it run about 3 or 4
times faster by using string.split instead of a regex here.
To get more speed, one would have to do more.

Summarizing: Stay with the Perl code, if you need it so fast.
Perl is made for this real world low-level stuff.
Python is for the real world high level stuff.

ciao - chris

--
Christian Tismer :^) <mailto:tismer@appliedbiometrics.com>
Applied Biometrics GmbH : Have a break! Take a ride on Python's
Kaiserin-Augusta-Allee 101 : *Starship* http://starship.python.net
10553 Berlin : PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF
we're tired of banana software - shipped green, ripens at home

Python too slow for real world [ In reply to ]

Apr 23, 1999, 7:13 AM

Post #5 of 46 (4859 views)

Joseph Robertson <jmrober1@ingr.com> writes:

> Hi,
>
> For what you state here, you don't even really need to read the 'data' at
> all.
> Just read your descriptors, and store the offsets and len of the data in a
> dictionary (i.e. index it).
>
> readline
> if first char == >
> get id
> get current position using seek method
> store id, pos in dict
> #for each id, we now have its byte posisition in the file

Well... You have to read all the lines to find all the descriptors,
don't you? Is there really any great speadup here?

Of course, you would get some speedup later, when using the same
structure again...

>
> Then have a filter method which keeps or discards the records by criteria.
>
[...]

If the number of excluded element isn't very high, this method will
only add to the burden of processing, won't it?

(By seek -- do you mean os.lseek? Or is ther another one... Just curious.)

>
> This way you can create views on your data without actually trying to load it
> all. The tradeoff of course is memory for fileaccess time, but I found
> fileaccess to be faster than doing all the work 'up front'.

Hm. Yes.

If the size (in lines) of the records is constant, then you could, of
course, use seek to skip all the data while processing as well...

> Besides my
> project reached the point where we ran out of memory often, some datasets are
> on 8+ cdroms!
>
> Hope that was relevant, but maybe I misunderstood the question.
> Joe Robertson,
> jmrobert@ro.com
>
>
>
>
[...]
--
> Hi! I'm the signature virus 99!
Magnus > Copy me into your signature and join the fun!
Lie
Hetland http://arcadia.laiv.org <arcadia@laiv.org>

Python too slow for real world [ In reply to ]

Brian at digicool

Apr 23, 1999, 7:34 AM

Post #6 of 46 (4861 views)

> Hi All,
>
> first off all: Sorry for that slightly provoking subject ;-) ...
>
> I just switched from perl to python because I think python makes live
> easyer in bigger software projects. However I found out that perl is
> more then 10 times faster then python in solving the
> following probelm:
>
> I've got a file (130 MB) with ~ 300000 datasets of the form:
>
> >px0034 hypothetical protein or whatever description
> LSADQISTVQASFDKVKGDPVGILYAVFKADPSIMAKFTQFAGKDLESIKGTAPFETHAN
> RIVGFFSKIIGELPNIEADVNTFVASHKPRGVTHDQLNNFRAGFVSYMKAHTDFAGAEAA
> WGATLDTFFGMIFSKM
>
> The word floowing the '>' is an identifier, the uppercase
> letters in the
> lines following the identifier are the data. Now I want to read and
> write the contens of that file excluding some entries (given by a
> dictionary with identifiers, e.g. 'px0034').
>
> The following python code does the job:
>
> from re import *
> from sys import *
>
> def read_write(i, o, exclude):
> name = compile('^>(\S+)') # regex to fetch the identifier
> l = i.readline()
> while l:
> if l[0] == '>': # are we in new dataset?
> m = name.search(l)
> if m and exclude.has_key(m.group(1)): # excluding current
> dataset?
> l = i.readline()
> while l and l[0] != '>': # skip this dataset
> l = i.readline()
> pass
> o.write(l)
> l = i.readline()
>
> f = open('my_very_big_data_file','r') # datafile with ~300000 records
> read_write(f, stdout, {}) # for a simple test I don't exclude
> anything!
>
> It took 503.90 sec on a SGI Power Challange (R10000 CPU). An
> appropiate
> perl script does the same job in 32 sec (Same method, same loop
> structure)!
>
> Since I've to call this routine about 1500 times it's a very big
> difference in time and not realy accaptable.
>
> I'd realy like to know why python is so slow (or perl is so fast?) and
> what I can do to improove speed of that routine.
>
> I don't want to switch back to perl - but honestly, is python
> the right
> language to process souch huge amount of data?
>
> ...
> Please do convince me being a python programmer does not mean
> being slow
> ;-)
>
> Thanks very much for any help,
>
> Arne
>

Arne,

While I'm not going to go near comparing Python to Perl, I
will comment that different languages are just that - different.
As such, the approach you would take in one language may not be
the most appropriate (or comparable in speed or efficiency) to
the approach you would take in another.

The question here (IMHO) is not Python's appropriateness for processing
large datasets (a fair number of scientist-types do this all the time),
or even the speed of Python in general, but using the most appropriate
algorithms in the context of the language in use.

For example, Perl is very regex-centric, so your example Perl
implementation is probably perfectly appropriate for Perl. Python
tends to be more optimized for the general case, and if it were
_me_, I wouldn't bother with using regular expressions in this
case,. Since you have a predictable file format, there are more
specific (and efficient) Python tools that you could use here.

There are also some general optimizations that can be used in
places where speed is an issue, such as avoiding repeated
attribute lookups (esp. in loops). This version of your read_write
function uses the same basic algorithm, but forgoes re for more
specific tools (slicing, string.split) and has some examples of
optimizations to mimimize attribute lookups. I haven't timed it
or anything, but I'd be surprised if it wasn't noticeably
faster.

Hope this helps!

import sys, string

def read_write(input, output, exclude):

# These assignments will save us a lot of attribute
# lookups over the course of the big loop...

ignore=exclude.has_key
readline=input.readline
write=output.write
split=string.split

line=readline()
while line:
if line[0]=='>':
# knowing that the first char is a '>' and that
# the rest of the chars up to the first space are
# the id, we can avoid using re here...
key=split(line)[0][1:]
if ignore(key):
line=readline()
while line and line[0] != '>':
# skip this record
line=readline()
continue
write(line)
line=readline()

file=open('my_very_big_data_file','r') # datafile with ~300000 records
read_write(f, sys.stdout, {})

Brian Lloyd brian@digicool.com
Software Engineer 540.371.6909
Digital Creations http://www.digicool.com

Python too slow for real world [ In reply to ]

Apr 23, 1999, 8:25 AM

Post #7 of 46 (4841 views)

"Magnus L. Hetland" wrote:

> Joseph Robertson <jmrober1@ingr.com> writes:
> > This way you can create views on your data without actually trying to load it
> > all. The tradeoff of course is memory for fileaccess time, but I found
> > fileaccess to be faster than doing all the work 'up front'.
>
> Hm. Yes.
>
> If the size (in lines) of the records is constant, then you could, of
> course, use seek to skip all the data while processing as well...

Of course, if you were really SERIOUS about finishing the project quickly, you
could ignore "seek" and just skip processing the data at all...

*shy grin*
-Jim

Python too slow for real world [ In reply to ]

gjohnson at showmaster

Apr 23, 1999, 9:03 AM

Post #8 of 46 (4850 views)

I find python syntax less taxing then perl's (IE less lines) You may need
to check your python code and see how you can optimize it further...

Tony Johnson
System Administrator
Demand Publishing Inc.

-----Original Message-----
From: python-list-request@cwi.nl [mailto:python-list-request@cwi.nl]On
Behalf Of Arne Mueller
Sent: Friday, April 23, 1999 7:35 AM
To: python-list@cwi.nl
Subject: Python too slow for real world

Hi All,

first off all: Sorry for that slightly provoking subject ;-) ...

I just switched from perl to python because I think python makes live
easyer in bigger software projects. However I found out that perl is
more then 10 times faster then python in solving the following probelm:

I've got a file (130 MB) with ~ 300000 datasets of the form:

>px0034 hypothetical protein or whatever description
LSADQISTVQASFDKVKGDPVGILYAVFKADPSIMAKFTQFAGKDLESIKGTAPFETHAN
RIVGFFSKIIGELPNIEADVNTFVASHKPRGVTHDQLNNFRAGFVSYMKAHTDFAGAEAA
WGATLDTFFGMIFSKM

The word floowing the '>' is an identifier, the uppercase letters in the
lines following the identifier are the data. Now I want to read and
write the contens of that file excluding some entries (given by a
dictionary with identifiers, e.g. 'px0034').

The following python code does the job:

from re import *
from sys import *

def read_write(i, o, exclude):
name = compile('^>(\S+)') # regex to fetch the identifier
l = i.readline()
while l:
if l[0] == '>': # are we in new dataset?
m = name.search(l)
if m and exclude.has_key(m.group(1)): # excluding current
dataset?
l = i.readline()
while l and l[0] != '>': # skip this dataset
l = i.readline()
pass
o.write(l)
l = i.readline()

f = open('my_very_big_data_file','r') # datafile with ~300000 records
read_write(f, stdout, {}) # for a simple test I don't exclude anything!

It took 503.90 sec on a SGI Power Challange (R10000 CPU). An appropiate
perl script does the same job in 32 sec (Same method, same loop
structure)!

Since I've to call this routine about 1500 times it's a very big
difference in time and not realy accaptable.

I'd realy like to know why python is so slow (or perl is so fast?) and
what I can do to improove speed of that routine.

I don't want to switch back to perl - but honestly, is python the right
language to process souch huge amount of data?

If you want to generate a test set you could use the following lines to
print 10000 datasets to stdout:

for i in xrange(1, 10001):
print
'>px%05d\nLSADQISTVQASFDKVKGDPVGILYAVFKADPSIMAKFTQFAGKDLESIKGTAPFETHAN\n\
RIVGFFSKIIGELPNIEADVNTFVASHKPRGVTHDQLNNFRAGFVSYMKAHTDFAGAEAA\n\
WGATLDTFFGMIFSKM\n' % i

And if you don't believe me that perl does the job quicker you can try
the perl code below:

#!/usr/local/bin/perl -w
open(IN,"test.dat");
my %ex = ();
read_write(%ex);

sub read_write{

$l = <IN>;
OUTER: while( defined $l ){
if( (($x) = $l =~ /^>(\S+)/) ){
if( exists $ex{$x} ){
$l = <IN>;
while( defined $l && !($l =~ /^>(\S+)/) ){
$l = <IN>;
}
next OUTER;
}
}
print $l;
$l = <IN>;
}
}

Please do convince me being a python programmer does not mean being slow
;-)

Thanks very much for any help,

Arne

Python too slow for real world [ In reply to ]

Apr 23, 1999, 9:24 AM

Post #9 of 46 (4843 views)

On Fri, Apr 23, 1999 at 04:05:20PM +0200, Christian Tismer wrote:
>
> Summarizing: Stay with the Perl code, if you need it so fast.
> Perl is made for this real world low-level stuff.
> Python is for the real world high level stuff.

There's nothing more to add - just some remarks.
We are running several production processes that are mainly
based on Python in several ways - we use use Python
as middleware component for combining databases like Oracle,
workflow systems like staffeware, Corba components ....
Are systems consiss of several thousands lines of code and
the code is still manageable. Have you ever seen a Perl
script with a thousand lines that has been readable and
understandable ? And speed has never been a real problem
for Python. Ok - Perl's regex engine seems to be faster
but not the whole world consists of regular expressions.
Python is in every case more open and flexible for building
large systems - take Perl to hack your scrips and build
real systems with Python :-)

Cheers,
Andreas

Python too slow for real world [ In reply to ]

a.mueller at icrf

Apr 23, 1999, 9:38 AM

Post #10 of 46 (4857 views)

Hi All,

thanks very much for all the suggestions how to speed up things and how
to THINK about programming in python. I got alot of inspiration from
your replys. However the problem of reading/writing larges files line by
line is the source of slowing down the whole process.

def rw(input, output):
while 1:
line = input.readline()
if not line: break
output.write(line)

f = open('very_large_file','r')
rw(f, stdout)

The file I read in contains 2053927 lines and it takes 382 sec to
read/write it where perl does it in 15 sec. These simple read/write
functions use the functions from the C standard library, don't they? So,
readline/write don't seem to be implemented very efficently ... (?)

I can't read in the whole file as a single block, it's too big, if
readline/write is slow the program will never get realy fast :-(

thanks a lot for discussion,

Arne

Python too slow for real world [ In reply to ]

dfan at harmonixmusic

Apr 23, 1999, 10:16 AM

Post #11 of 46 (4852 views)

Arne Mueller <a.mueller@icrf.icnet.uk> writes:

| I can't read in the whole file as a single block, it's too big, if
| readline/write is slow the program will never get realy fast :-(

You can use the 'sizehint' parameter to readlines() to get some of the
efficiency of readlines() without reading in the whole file. The
following code isn't optimized, but it shows the idea:

class BufferedFileReader:
def __init__ (self, file):
self.file = file
self.lines = []
self.numlines = 0
self.index = 0

def readline (self):
if (self.index >= self.numlines):
self.lines = self.file.readlines(65536)
self.numlines = len(self.lines)
self.index = 0
if (self.numlines == 0):
return ""
str = self.lines[self.index]
self.index = self.index + 1
return str

--
Dan Schmidt -> dfan@harmonixmusic.com, dfan@alum.mit.edu
Honest Bob & the http://www2.thecia.net/users/dfan/
Factory-to-Dealer Incentives -> http://www2.thecia.net/users/dfan/hbob/
Gamelan Galak Tika -> http://web.mit.edu/galak-tika/www/

Python too slow for real world [ In reply to ]

sdm7g at Virginia

Apr 23, 1999, 11:26 AM

Post #12 of 46 (4853 views)

On Fri, 23 Apr 1999, Arne Mueller wrote:

> Hi All,
>
> thanks very much for all the suggestions how to speed up things and how
> to THINK about programming in python. I got alot of inspiration from
> your replys. However the problem of reading/writing larges files line by
> line is the source of slowing down the whole process.
>
> def rw(input, output):
> while 1:
> line = input.readline()
> if not line: break
> output.write(line)
>
> f = open('very_large_file','r')
> rw(f, stdout)
>
> The file I read in contains 2053927 lines and it takes 382 sec to
> read/write it where perl does it in 15 sec. These simple read/write
> functions use the functions from the C standard library, don't they? So,
> readline/write don't seem to be implemented very efficently ... (?)
>
> I can't read in the whole file as a single block, it's too big, if
> readline/write is slow the program will never get realy fast :-(
>

My guess would be that a difference this big is due to the file
buffering mode.

See 'open' in the library reference docs:
<http://www.python.org/doc/lib/built-in-funcs.html>

| open (filename[, mode[, bufsize]])

[...]

| ... The optional bufsize argument
| specifies the file's desired buffer size: 0 means unbuffered, 1 means
| line buffered, any other positive value means use a buffer of
| (approximately) that size. A negative bufsize means to use the system
| default, which is usually line buffered for for tty devices and fully
| buffered for other files. If omitted, the system default is used.[2.10]

Note that last sentence.
If your really testing this by writing to the standard output, it may
be using line buffered io. ( On a related note, I think it was AIX that
had a horribly misfeatured /dev/null implementation that caused io
tests dumped to /dev/null to be slower than if you used an actual device!)

Adding the following wrapper function to your 'rw' function, you
can test the effect of different buffer sizes or options.

from time import clock

def test1( filename, buf=None ):
if buf == None:
inp = open( filename, 'r' )
else:
inp = open( filename, 'r', buf )
out = open( 'junk', 'w' )
c0 = clock()
rw( inp, out )
c1 = clock()
return c1 - c0

On the Mac, this makes about a *37 difference.
( I got tired of waiting for it to finish on 'big.file' , so
I cut down the size. )

>>> iotest.makebigfile( 'not.so.big.file', 4001 )
>>> iotest.test1( 'not.so.big.file' )
1.18333333333
>>> iotest.test1( 'not.so.big.file', buf=1 )
1.88333333333
>>> iotest.test1( 'not.so.big.file', buf=0 )
68.3833333333

I surely HOPE this is your problem!

---| Steven D. Majewski (804-982-0831) <sdm7g@Virginia.EDU> |---
---| Department of Molecular Physiology and Biological Physics |---
---| University of Virginia Health Sciences Center |---
---| P.O. Box 10011 Charlottesville, VA 22906-0011 |---

Caldera Open Linux: "Powerful and easy to use!" -- Microsoft(*)
(*) <http://www.pathfinder.com/fortune/1999/03/01/mic.html>

Python too slow for real world [ In reply to ]

tismer at appliedbiometrics

Apr 23, 1999, 12:07 PM

Post #13 of 46 (4852 views)

Arne Mueller wrote:
>
> Hi All,
>
> I can't read in the whole file as a single block, it's too big, if
> readline/write is slow the program will never get realy fast :-(

Please try this one.
For me, it was about 6-7 times faster than the first one.
I don't read by line, also I don't read all in one.
Let me know how it performs on your machine.
I think I'm down to measuring I/O time.
Well, the code is a bit long.

But fast :-)

I believe nothing more can be done, but to
use P2C to get the interpreter overhead away.

def read_write_bulk(input, output, exclude):

bufsize = 1 << 16
splitter = ">"

ignore=exclude.has_key
split=string.split
No = None

buffer = input.read(bufsize)
got = len(buffer)
while len(buffer)>1 :
pieces = split(buffer, splitter)
idx = 0
inner = pieces[1:-1]
for piece in inner:
idx = idx+1 ; key = split(piece, No, 1)[0]
if ignore(key):
del inner[idx] ; idx = idx-1
output.write("<")
output.write(string.join(inner, splitter))
if got==0:
break
chunk = input.read(bufsize)
buffer = splitter+pieces[-1] + chunk
got = len(chunk)
if got==0:
buffer = buffer+splitter # spill last one

#:-) end of hack

--
Christian Tismer :^) <mailto:tismer@appliedbiometrics.com>
Applied Biometrics GmbH : Have a break! Take a ride on Python's
Kaiserin-Augusta-Allee 101 : *Starship* http://starship.python.net
10553 Berlin : PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF
we're tired of banana software - shipped green, ripens at home

Python too slow for real world [ In reply to ]

tismer at appliedbiometrics

Apr 23, 1999, 12:15 PM

Post #14 of 46 (4861 views)

Just did a little more cleanup to the code.
This it is:

def read_write_bulk(input, output, exclude):
bufsize = 1 << 16 ; splitter = ">"
ignore=exclude.has_key ; split=string.split ; No = None

buffer = input.read(bufsize)
got = len(buffer)
while 1 :
pieces = split(buffer, splitter)
idx = 0
inner = pieces[1:-1]
for piece in inner:
idx = idx+1 ; key = split(piece, No, 1)[0]
if ignore(key):
del inner[idx] ; idx = idx-1
output.write(splitter)
output.write(string.join(inner, splitter))
if got==0:
break
chunk = input.read(bufsize)
buffer = splitter+pieces[-1] + chunk
got = len(chunk)
if got==0:
buffer = buffer+splitter # spill last one

Also, I think with this I/O layout, buffering of the
files doesn't count any more at all.
Let me know if it is still much slower than Perl.

ciao - chris

--
Christian Tismer :^) <mailto:tismer@appliedbiometrics.com>
Applied Biometrics GmbH : Have a break! Take a ride on Python's
Kaiserin-Augusta-Allee 101 : *Starship* http://starship.python.net
10553 Berlin : PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF
we're tired of banana software - shipped green, ripens at home

Python too slow for real world [ In reply to ]

tismer at appliedbiometrics

Apr 23, 1999, 12:27 PM

Post #15 of 46 (4858 views)

"Steven D. Majewski" wrote:
...
> My guess would be that a difference this big is due to the file
> buffering mode.

This seems to be true.
After removing line I/O at all and doing some optimization,
the Python program now runs more than twice as fast than
the Perl version on my Linux box.
Although unfair, since I would have to optimize the latter
as well :-)

ciao - chris

--
Christian Tismer :^) <mailto:tismer@appliedbiometrics.com>
Applied Biometrics GmbH : Have a break! Take a ride on Python's
Kaiserin-Augusta-Allee 101 : *Starship* http://starship.python.net
10553 Berlin : PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF
we're tired of banana software - shipped green, ripens at home

Python too slow for real world [ In reply to ]

Apr 23, 1999, 2:40 PM

Post #16 of 46 (4865 views)

Christian Tismer <tismer@appliedbiometrics.com> writes:

> Just did a little more cleanup to the code.
> This it is:

Hm. This code is nice enough (although not very intuitive...) But
isn't it a bit troublesome that this sort of thing (which in many ways
is a natural application for Python) is so much simpler to implement
(in an efficient enough way) in Perl?

Can something be done about it? Perhaps a buffering parameter to
fileinput? In that case, a lot of the code could be put in that
module, as part of the standard distribution... Even so -- you would
somehow have to be able to treat the buffers as blocks... Hm.

(And... How about builtin regexes in P2?)

> ciao - chris

--
> Hi! I'm the signature virus 99!
Magnus > Copy me into your signature and join the fun!
Lie
Hetland http://arcadia.laiv.org <arcadia@laiv.org>

Python too slow for real world [ In reply to ]

Apr 23, 1999, 3:41 PM

Post #17 of 46 (4863 views)

Arne Mueller wrote:
> However the problem of reading/writing larges files line by
> line is the source of slowing down the whole process.
>
> def rw(input, output):
> while 1:
> line = input.readline()
> if not line: break
> output.write(line)
>
> f = open('very_large_file','r')
> rw(f, stdout)
>
> The file I read in contains 2053927 lines and it takes 382 sec to
> read/write it where perl does it in 15 sec.

I saw a mention of using readlines with a buffer size to get the
benefits of large reads without requiring that you read the entire file
into memory at once. Here's a concrete example. I use this idiom
(while loop over readlines() and a nested for loop processing each line)
all the time for processing large files that I don't need to have in
memory all at once.

The input file, /tmp/words2, was generated from /usr/dict/words:

sed -e 's/\(.*\)/\1 \1 \1 \1 \1/' < /usr/dict/words > /tmp/words
cat /tmp/words /tmp/words /tmp/words /tmp/words /tmp/words >
/tmp/words2

It's not as big as your input file (10.2MB, 227k lines), but still big
enough to measure differences. The script below prints (on the second
of two runs to make sure the file is in memory)

68.9596179724
7.96663999557

suggesting about a 8x speedup between your original function and my
readlines version. It's still not going to be as fast as Perl, but it's
probably close enough that some other bottleneck will probably pop up
now...

import sys, time

def rw(input, output):
while 1:
line = input.readline()
if not line: break
output.write(line)

f = open('/tmp/words2','r')
devnull = open('/dev/null','w')

t = time.time()
rw(f, devnull)
print time.time() - t

def rw2(input, output):
lines = input.readlines(100000)
while lines:
output.writelines(lines)
lines = input.readlines(100000)

f = open('/tmp/words2','r')

t = time.time()
rw2(f, devnull)
print time.time() - t

Cheers,

--
Skip Montanaro | Mojam: "Uniting the World of Music"
http://www.mojam.com/
skip@mojam.com | Musi-Cal: http://www.musi-cal.com/
518-372-5583

Python too slow for real world [ In reply to ]

justin at linus

Apr 23, 1999, 7:09 PM

Post #18 of 46 (4857 views)

mlh@idt.ntnu.no (Magnus L. Hetland) writes:

> (And... How about builtin regexes in P2?)

Um, why? I don't see any need at all for them to move from
module-status to core-language-status.

The only way that I could understand the desire for it would be if one
wanted to write little scripts that were basically just some control
flow around regexes and string substitution. That is, something that
looked like most of the programs written in that other P language. ;-)

In all seriousness, what reason do you have for making that
suggestion? I am willing to believe that there might be a good reason
to do so, but it certainly isn't immediately obvious.

-Justin

Python too slow for real world [ In reply to ]

paul at prescod

Apr 23, 1999, 11:04 PM

Post #19 of 46 (4854 views)

Justin Sheehy wrote:
>
> > (And... How about builtin regexes in P2?)
>
> In all seriousness, what reason do you have for making that
> suggestion? I am willing to believe that there might be a good reason
> to do so, but it certainly isn't immediately obvious.

One benefit would be that the compiler could compile regexps at the same
time everything else is being compiled.

--
Paul Prescod - ISOGEN Consulting Engineer speaking for only himself
http://itrc.uwaterloo.ca/~papresco

Company spokeswoman Lana Simon stressed that Interactive
Yoda is not a Furby. Well, not exactly.

"This is an interactive toy that utilizes Furby technology,"
Simon said. "It will react to its surroundings and will talk."
- http://www.wired.com/news/news/culture/story/19222.html

Python too slow for real world [ In reply to ]

tismer at appliedbiometrics

Apr 24, 1999, 6:58 AM

Post #20 of 46 (4860 views)

"Magnus L. Hetland" wrote:
>
> Christian Tismer <tismer@appliedbiometrics.com> writes:
>
> > Just did a little more cleanup to the code.
> > This it is:
>
> Hm. This code is nice enough (although not very intuitive...) But
> isn't it a bit troublesome that this sort of thing (which in many ways
> is a natural application for Python) is so much simpler to implement
> (in an efficient enough way) in Perl?

Well, Python has its trouble with its generalism, all the
object protocols, the stack machine, the name lookups, which
all apply even for simplest problems like Arne's.
This leads to non-intutive optimization tricks which I showed.
Although my buffering techique applies to other languages as
well. The brain damaging concept is running over big, partial
chunks of memory, trying to process them effectively without
much object creation, and making sure that the parts glue
together correctly, the last record isn't missing and so on.
The real work is hidden somewhere between like a side effect.

> Can something be done about it? Perhaps a buffering parameter to
> fileinput? In that case, a lot of the code could be put in that
> module, as part of the standard distribution... Even so -- you would
> somehow have to be able to treat the buffers as blocks... Hm.

I think someting can be done.
First, I think I can set up a framework for this class of
problems, which takes a line oriented algorithm and spits
out such a convoluted thing which does the same.

Another thing which appears worthwhile is generalizing the
realine function. I used that in my own buffered files,
but this would be twice as fast if readline/s could do
this alone.

What I need is a variable line delimiter which can be set
as a property for a file object. In this case, I would
use ">" as delimiter. For a fast XML scanner (which just
works right partitioning of XML pieces, nothing else),
I would use "<" as delimiter, read such chunks and break
them on ">", with a little repair code for comments,
">" appearing in attributes etc.

Conclusion:
My readline would be parameterized by a delimiter string.
I would *not* leave it attached to a line (like the CR's),
instead I would return the delimiter as EOF indicator.

> (And... How about builtin regexes in P2?)

No. Noo! Please never! :-)
I really hate them from design, and they shouldn't imfluence
Python in any way. What I likemuch better is Marc Lemburg's
tagging engine, which could have been used for this problem.
One should think of a nicer interface, which allows it to
build readable, efficient tagging engines from Python code,
since at the moment, this is a little at the assembly level :-)
All in all, I'd like to express little engines in Python,
but not these ugly undebuggable unreadable flie dirt strings
which they call "regexen".
But that's my private opinion which should not be an attack
to anybody. I just prefer little machines whcih can interact
with Python directly.

ciao - chris

--
Christian Tismer :^) <mailto:tismer@appliedbiometrics.com>
Applied Biometrics GmbH : Have a break! Take a ride on Python's
Kaiserin-Augusta-Allee 101 : *Starship* http://starship.python.net
10553 Berlin : PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF
we're tired of banana software - shipped green, ripens at home

Python too slow for real world [ In reply to ]

Apr 24, 1999, 9:01 AM

Post #21 of 46 (4857 views)

In article <37215EFB.433AFCA6@prescod.net>,
Paul Prescod <paul@prescod.net> wrote:
>Justin Sheehy wrote:
>>
>>> (And... How about builtin regexes in P2?)
>>
>> In all seriousness, what reason do you have for making that
>> suggestion? I am willing to believe that there might be a good reason
>> to do so, but it certainly isn't immediately obvious.
>
>One benefit would be that the compiler could compile regexps at the same
>time everything else is being compiled.

<shrug> If you really care and if you're going to run the same program
multiple times, just use pickle.
--
--- Aahz (@netcom.com)

Hugs and backrubs -- I break Rule 6 <*> http://www.rahul.net/aahz/
Androgynous poly kinky vanilla queer het

Hi! I'm a beta for SigVirus 2000! Copy me into your .sigfile!

Python too slow for real world [ In reply to ]

Apr 24, 1999, 9:57 AM

Post #22 of 46 (4857 views)

Paul Prescod <paul@prescod.net> wrote:
> One benefit would be that the compiler could compile regexps at the same
> time everything else is being compiled.

It is extremely rare that regex compilation time is a major issue. If
you're using a regex inside your inner loop and the amount of data you
feed through it is large enough to matter, you should be compiling it
yourself outside the loop. It's still done at run-time, but it's only
done once so it's almost certainly a trivial amount of time devoted to
that.

Python too slow for real world [ In reply to ]

Apr 25, 1999, 10:31 AM

Post #23 of 46 (4844 views)

Justin Sheehy <justin@linus.mitre.org> writes:

> mlh@idt.ntnu.no (Magnus L. Hetland) writes:
>
> > (And... How about builtin regexes in P2?)
>
> Um, why? I don't see any need at all for them to move from
> module-status to core-language-status.
[...]
>
> In all seriousness, what reason do you have for making that
> suggestion? I am willing to believe that there might be a good reason
> to do so, but it certainly isn't immediately obvious.

Now, that's really simple -- because re.py is slow. I thought maybe
some of the slowness might be improved by a c-implementation, that's
all. Not too important to me...
>
> -Justin
>

--
> Hi! I'm the signature virus 99!
Magnus > Copy me into your signature and join the fun!
Lie
Hetland http://arcadia.laiv.org <arcadia@laiv.org>

Python too slow for real world [ In reply to ]

tismer at appliedbiometrics

Apr 25, 1999, 1:05 PM

Post #24 of 46 (4861 views)

"Magnus L. Hetland" wrote:
>
> Justin Sheehy <justin@linus.mitre.org> writes:
>
> > mlh@idt.ntnu.no (Magnus L. Hetland) writes:
> >
> > > (And... How about builtin regexes in P2?)
> >
> > Um, why? I don't see any need at all for them to move from
> > module-status to core-language-status.
> [...]
> >
> > In all seriousness, what reason do you have for making that
> > suggestion? I am willing to believe that there might be a good reason
> > to do so, but it certainly isn't immediately obvious.
>
> Now, that's really simple -- because re.py is slow. I thought maybe
> some of the slowness might be improved by a c-implementation, that's
> all. Not too important to me...

Now, true for re.py, but if you just care about speed,
you can abuse the underlying pcre module which is compiled.
For simple cases where I just needed to match something,
the "code" attribute of a compiled re object can be used
instead of the re object.
I admit this is a hack, but it gave me a speedup of factor 3
for a simple matching case. pcre's speed seems to be
in the same order of magnitude as regex was.

In the long term, I think it makes sense to build the rest of
re.py also into pcre. Then I still would not see any reason
to embed its functionaliy into the language.

ciao - chris

--
Christian Tismer :^) <mailto:tismer@appliedbiometrics.com>
Applied Biometrics GmbH : Have a break! Take a ride on Python's
Kaiserin-Augusta-Allee 101 : *Starship* http://starship.python.net
10553 Berlin : PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF
we're tired of banana software - shipped green, ripens at home

Python too slow for real world [ In reply to ]

Apr 25, 1999, 1:25 PM

Post #25 of 46 (4849 views)

Christian Tismer <tismer@appliedbiometrics.com> writes:

> "Magnus L. Hetland" wrote:
> >
[...]
>
> In the long term, I think it makes sense to build the rest of
> re.py also into pcre. Then I still would not see any reason
> to embed its functionaliy into the language.

Agreed. I guess this is what I really wanted. :)

--
> Hi! I'm the signature virus 99!
Magnus > Copy me into your signature and join the fun!
Lie
Hetland http://arcadia.laiv.org <arcadia@laiv.org>