Mailing List Archive

1 2  View All
Python too slow for real world [ In reply to ]
On 25 Apr 1999, Magnus L. Hetland wrote:

> > Um, why? I don't see any need at all for them to move from
> > module-status to core-language-status.
> [...]
> >
> > In all seriousness, what reason do you have for making that
> > suggestion? I am willing to believe that there might be a good reason
> > to do so, but it certainly isn't immediately obvious.
>
> Now, that's really simple -- because re.py is slow. I thought maybe
> some of the slowness might be improved by a c-implementation, that's
> all. Not too important to me...

Um....two wrong assumptions here:
1. C implementation is /not/ the same as core status: C extension modules
are numerous and wonderful, for example...
2. re.py is a (very thin) wrapper around pcre, a C extension module for
Perl Compatible Regular Expressions.

Which just goes to say that while pcre can certainly be optimized, it
can't be done by simply rewriting it in C.
<0.5 wink>

but-we-can-always-call-Perl-on-the-fly-to-evaluate-regular-expression-ly
y'rs,
Z.
--
Moshe Zadka <mzadka@geocities.com>.
QOTD: What fun to me! I'm not signing permanent.
Python too slow for real world [ In reply to ]
Moshe Zadka wrote in message ...
>Um....two wrong assumptions here:
>1. C implementation is /not/ the same as core status: C extension modules
>are numerous and wonderful, for example...

Why not? I can only think of differences for types, and even these arent
significant. I cant see any distinction between core modules and extension
modules, other than the fact you need an extra file hanging around.

Mark.
Python too slow for real world [ In reply to ]
Moshe Zadka wrote:
>
> On 25 Apr 1999, Magnus L. Hetland wrote:
...
> > Now, that's really simple -- because re.py is slow. I thought maybe
> > some of the slowness might be improved by a c-implementation, that's
> > all. Not too important to me...
>
> Um....two wrong assumptions here:
> 1. C implementation is /not/ the same as core status: C extension modules
> are numerous and wonderful, for example...

As Mark already pointed out, what's the difference?
You will not see any performance change, wether a module
is in the core or in an extra dll. The calling mechanism
is always the same, (well, the call instr under X86 takes 1 byte more
to the dll :-) and not very fast. So the less calls, the better.

> 2. re.py is a (very thin) wrapper around pcre, a C extension module for
> Perl Compatible Regular Expressions.

Right, but it carries the class protocol overhead all the time.
Returning mathes always involves creation of an instance of
a match object, and a number of tuple building operations
are involved. This is where unnecessary interpreter overhead
can be saved, and results could be created more efficiently
from a c extension, since it is not forced to hold every
intermediate result by a Python object which involves memory
allocation, and so on.

> Which just goes to say that while pcre can certainly be optimized, it
> can't be done by simply rewriting it in C.
> <0.5 wink>

Surely not since it is written in C <1.5 wink>.
If you are referring to re.py, a (nearly) direct translation into
C would indeed not help too much. P2C does that, but since it can
only remove the interpreter overhead, you will not save more
than 30-40 percent. A hand-coded C version would try to avoid
as much overhead as possible. The main difference is that you
know the data types which you are dealing with, so you will optimize
this case, instead of having to take care of the general case
as Python does.

But if Python had an optional strong type concept already, plus
some sealing option for modules and classes which would allow
to use compiled method tables instead of attribute lookups,
things could change dramatically. Given that, re.py could
be made fast enough without involving C, I believe.

ciao - chris

--
Christian Tismer :^) <mailto:tismer@appliedbiometrics.com>
Applied Biometrics GmbH : Have a break! Take a ride on Python's
Kaiserin-Augusta-Allee 101 : *Starship* http://starship.python.net
10553 Berlin : PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF
we're tired of banana software - shipped green, ripens at home
Python too slow for real world [ In reply to ]
Moshe Zadka writes:
>2. re.py is a (very thin) wrapper around pcre, a C extension module for
>Perl Compatible Regular Expressions.
>
>Which just goes to say that while pcre can certainly be optimized, it
>can't be done by simply rewriting it in C.

Things like search-and-replace, though, take substantial hits
from being implemented in Python code; rewriting re.py in C is one of
the top items on my TODO list for the regex code.

I also want some sort of code generation interface so that
custom syntaxes can be implemented. The problem there is how to do it
safely so that it's not possible to crash the interpreter by
generating buggy expressions. (We'd want the re module usable by code
in a restricted execution environment, after all.)

>but-we-can-always-call-Perl-on-the-fly-to-evaluate-regular-expression-ly
>y'rs,

In one early experiment, there was a version of the regex module that
used Perl's pregcomp() and pregexec() functions. It worked fine in
some admittedly limited testing, but the compiled module was something
like 500K because the regex code pulled in the rest of Perl. I
couldn't figure out how to extract the regex code to fix this, so that
experiment came to a quick end. Pity, really.

--
A.M. Kuchling http://starship.python.net/crew/amk/
You see, you're all wrong. My name isn't Brain Eater. I am the Head. And I
don't simply eat brains. It's something deeper than that. Something magical.
Something transubstantiational. Excuse me one moment while I ... <shlurppp>
-- The first villain, in ENIGMA #1: "The Lizard, The Head, The Enigma"
Python too slow for real world [ In reply to ]
Roy Smith wrote:
>
> It is extremely rare that regex compilation time is a major issue.

I didn't mean to imply it was. Justin asked about the benefits
of "first class regexps" so I pointed out one.

> If
> you're using a regex inside your inner loop and the amount of data you
> feed through it is large enough to matter, you should be compiling it
> yourself outside the loop. It's still done at run-time, but it's only
> done once so it's almost certainly a trivial amount of time devoted to
> that.

It is a performance issue if you don't know that regexps are supposed to
be compiled. Plus it is a code complexity issue. Having the language
handle it for you automatically is in keeping with other aspects of
Python's design. Heck, I don't even have to compile Python *programs*.

--
Paul Prescod - ISOGEN Consulting Engineer speaking for only himself
http://itrc.uwaterloo.ca/~papresco

Company spokeswoman Lana Simon stressed that Interactive
Yoda is not a Furby. Well, not exactly.

"This is an interactive toy that utilizes Furby technology,"
Simon said. "It will react to its surroundings and will talk."
- http://www.wired.com/news/news/culture/story/19222.html
Python too slow for real world [ In reply to ]
Paul Prescod <paul@prescod.net> wrote:
> It is a performance issue if you don't know that regexps are supposed to
> be compiled.

I hope this doesn't sound as bad as I fear it might, but part of being a
good programmer (or at least a good computer scientist) is to understand
performance issues like this.

Regular expression theory hasn't changed a whole bunch in the last 20
years; it's the same stuff in C, Perl, and any other language that has RE
functionality (either built-in or through some library). The idea of
factoring constant operations out of loops is the same today is it was 10,
20, 30 years ago.

If you don't know that RE's get compilied (and that the compilation stage
can be expensive), you don't understand the tool you're using. If you
don't understand that factoring the expensive constant compilation process
out of a loop is important to make your program run fast, you aren't a
good programmer. No programming language can help that.

--
Roy Smith <roy@popmail.med.nyu.edu>
New York University School of Medicine
Python too slow for real world [ In reply to ]
In article <613145F79272D211914B0020AFF6401914DAD8@gandalf.digicool.com>,
Brian Lloyd <Brian@digicool.com> wrote:
> There are also some general optimizations that can be used in
> places where speed is an issue, such as avoiding repeated
> attribute lookups (esp. in loops). This version of your read_write
> function uses the same basic algorithm, but forgoes re for more
> specific tools (slicing, string.split) and has some examples of
> optimizations to mimimize attribute lookups. I haven't timed it
> or anything, but I'd be surprised if it wasn't noticeably
> faster.

Brian, just to followup on your post I profiled his original code and yours:
PII 450, 128M, WinNT
Original: 5.126 seconds
Your Ver: 1.512 seconds

Tom

-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own
Python too slow for real world [ In reply to ]
Pythonistas--

Roy Smith wrote:
>
> I hope this doesn't sound as bad as I fear it might, but part of being a
> good programmer (or at least a good computer scientist) is to understand
> performance issues like this.
>
> Regular expression theory hasn't changed a whole bunch in the last 20
> years; it's the same stuff in C, Perl, and any other language that has RE
> functionality (either built-in or through some library). The idea of
> factoring constant operations out of loops is the same today is it was 10,
> 20, 30 years ago.
>
> If you don't know that RE's get compilied (and that the compilation stage
> can be expensive), you don't understand the tool you're using. If you
> don't understand that factoring the expensive constant compilation process
> out of a loop is important to make your program run fast, you aren't a
> good programmer. No programming language can help that.
>

Roy, you're absolutely right. However, the person who originally posted
the question (and I'm sorry, but I've forgotten who) was speaking from a
newbie's viewpoint. Stuff that most of us on the list take for granted
is not always obvious to someone who is new both to the language and to
the field in general. And, things that are obvious in C or C++ or some
other language are not always obvious when moving to a new language, and
some idioms simply don't exist in some languages. Look at the questions
posted the other day by the person who wanted to know how to do things
in Python that he had to do all the time in C++ ... it was just not
obvious to him that those were neither necessary nor desirable in
Python.

In sum, I think tips on how to optimize source code need to be part of
the introductory documentation of any language; not everyone has gone to
school and learned these things in Pascal 101, and even if they have
there are idioms that won't translate.

<if-python-were-pascal-we'd-all-be-in-trouble>-ly y'rs,
Ivan
----------------------------------------------
Ivan Van Laningham
Callware Technologies, Inc.
ivanlan@callware.com
http://www.pauahtun.org
See also:
http://www.foretec.com/python/workshops/1998-11/proceedings.html
Army Signal Corps: Cu Chi, Class of '70
----------------------------------------------
Python too slow for real world [ In reply to ]
Roy Smith wrote:
>
> Paul Prescod <paul@prescod.net> wrote:
> > It is a performance issue if you don't know that regexps are supposed to
> > be compiled.
>
> I hope this doesn't sound as bad as I fear it might, but part of being a
> good programmer (or at least a good computer scientist) is to understand
> performance issues like this.

No doubt. But Python is not designed to be a programming language
exclusively for good programmers or computer scientists. As I understand
the design, it also caters to those that are new programmers and to those
that don't want to worry about any more details than they absolutely have
to.

> If you
> don't understand that factoring the expensive constant compilation process
> out of a loop is important to make your program run fast, you aren't a
> good programmer. No programming language can help that.

That is absolutely not true. It is, however, a common myth. A programming
language can have default behaviours that support beginning programmers.
Python does that in most cases.

In the last Perl/Python flame war someone used the analogy of clumsiness.
Sure, a clumsy person could kill themself with a dull knife but that
doesn't mean that a loaded gun and a dull knife are equally dangerous.

I'm not promoting any particular language change here. I'm arguing against
the throw-your-hands-in-the-air-and-expect-programmers-to-be-experts
argument that is used to shout down any change that makes the language
easier for non-experts. That's the design philosophy that gives rise to
other languages that start with P: "You mean you aren't familiar with the
string interpolation conventions of the C-shell and the input argument
defaulting of Awk?"

--
Paul Prescod - ISOGEN Consulting Engineer speaking for only himself
http://itrc.uwaterloo.ca/~papresco

Company spokeswoman Lana Simon stressed that Interactive
Yoda is not a Furby. Well, not exactly.

"This is an interactive toy that utilizes Furby technology,"
Simon said. "It will react to its surroundings and will talk."
- http://www.wired.com/news/news/culture/story/19222.html
Python too slow for real world [ In reply to ]
Roy Smith wrote:
[elided]
> I hope this doesn't sound as bad as I fear it might, but part of being a
> good programmer (or at least a good computer scientist) is to understand
> performance issues like this.
[elided]

I disagree with the assumption behind that statement. The assumption Roy
makes is that only trained programmers or computer scientists will be using
a tool like Python. I believe the audience that would benefit most from an
easy to use language like Python are "Subject Matter Experts" (SMEs). An SME
knows their field (e.g. accountant, biologist, physicist, network manager,
etc.) and may find need to automate or compute something whose scope or size
does not justify calling in a computer programmer or scientist. This is
where a simple to learn language such as Python finds a ready home.

When an SME finds the tool too slow, it would be nice if they could post
their problem to a group like this without fear of insult, intended or not.
Python too slow for real world [ In reply to ]
James Logajan <JamesL@Lugoj.Com> wrote:
> When an SME finds the tool too slow, it would be nice if they could post
> their problem to a group like this without fear of insult, intended or not.

There was no intent to insult.

The thread was moving in the direction of "my program using regex is too
slow, and I think the solution would be to move regex into the python core
to make it faster". I was just pointing out why that is flawed reasoning.

--
Roy Smith <roy@popmail.med.nyu.edu>
New York University School of Medicine
Python too slow for real world [ In reply to ]
roy@popmail.med.nyu.edu (Roy Smith) writes:
|James Logajan <JamesL@Lugoj.Com> wrote:
|> When an SME finds the tool too slow, it would be nice if they could post
|> their problem to a group like this without fear of insult, intended or not.
|
| There was no intent to insult.
|
| The thread was moving in the direction of "my program using regex is too
| slow, and I think the solution would be to move regex into the python core
| to make it faster". I was just pointing out why that is flawed reasoning.

I didn't read any insult. You venture into a somewhat sensitive area
when you start making distinctions like ``good programmer'', and I reckon
some criticism is inevitable. It's going to take a while to come to terms
with this issue (Python for people who don't qualify as expert programmers),
and in this process we need to feel free to speak up without too much fear
of saying something politically incorrect.

Donn Cave, University Computing Services, University of Washington
donn@u.washington.edu
Python too slow for real world [ In reply to ]
Roy Smith wrote:

> Paul Prescod <paul@prescod.net> wrote:
> > It is a performance issue if you don't know that regexps are supposed to
> > be compiled.
>

Hi,

Sorry if this is obvious, but does the Regex class cache compilations? I
couldn't find any info on this. If it was, it would support subject matter
experts nicely, like someone else mentioned. But even as a CS person, I like
how the Java "ORO" Regex Library has automatic Least Recently Used caching.
This lets application code be simpler, as well as making it easier to share
compiled expressions much more easily between objects.

- Robb
Python too slow for real world [ In reply to ]
Robb Shecter wrote:
>
> Roy Smith wrote:
>
> > Paul Prescod <paul@prescod.net> wrote:
> > > It is a performance issue if you don't know that regexps are supposed to
> > > be compiled.
> >
>
> Hi,
>
> Sorry if this is obvious, but does the Regex class cache compilations? I

It has. You can look at it simply by opening re.py .

> couldn't find any info on this. If it was, it would support subject matter
> experts nicely, like someone else mentioned. But even as a CS person, I like
> how the Java "ORO" Regex Library has automatic Least Recently Used caching.
> This lets application code be simpler, as well as making it easier to share
> compiled expressions much more easily between objects.

Python does what it can to compile and cache regexen.
It just looses a lot of time with postprocessing of
results, calls to the internal regex machine in pcre
and so on. This part of the regex system will be
coded in C as well in some future, which should
minimize the overhead.

ciao - chris

--
Christian Tismer :^) <mailto:tismer@appliedbiometrics.com>
Applied Biometrics GmbH : Have a break! Take a ride on Python's
Kaiserin-Augusta-Allee 101 : *Starship* http://starship.python.net
10553 Berlin : PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF
we're tired of banana software - shipped green, ripens at home
Python too slow for real world [ In reply to ]
Robb Shecter wrote:
> Sorry if this is obvious, but does the Regex class cache compilations? I
> couldn't find any info on this. If it was, it would support subject matter
> experts nicely, like someone else mentioned. But even as a CS person, I like
> how the Java "ORO" Regex Library has automatic Least Recently Used caching.
> This lets application code be simpler, as well as making it easier to share
> compiled expressions much more easily between objects.

re.py keeps up to 20 patterns in a cache
(see the _cachecompile function at the
top of that file).

(however, the cache cleaning code looks
a bit odd:

if len(_cache) >= _MAXCACHE:
_cache.clear()

maybe an LRU or a
del _cache[random.choice(_cache.keys())]
would be better?)

</F>
Python too slow for real world [ In reply to ]
Concerning regular expression slow downs...

I am interested in running a lot of text through dozens of different
regexen (precompiled, of course). However, I am interested in only
whether or not each chunk of text passed which regexen. I don't care
about what matched and retrieving it--just a truth value. Is there a way
to get this without the overhead of the features I am not using?




----------------------------------
Nathan Clegg
nathan@islanddata.com
Python too slow for real world [ In reply to ]
Nathan Clegg <nathan@islanddata.com> wrote:
> I am interested in running a lot of text through dozens of different
> regexen (precompiled, of course). However, I am interested in only
> whether or not each chunk of text passed which regexen. I don't care
> about what matched and retrieving it--just a truth value. Is there a way
> to get this without the overhead of the features I am not using?

here's one way to do it:

import re, string

patterns = [
r"\d+",
r"abc\d{2,4}",
r"p\w+"
]

def combined_pattern(patterns):
p = re.compile(
string.join(map(lambda x: "("+x+")", patterns), "|")
)
def fixup(v, m=p.code.match, r=range(1,len(patterns)+1)):
regs = m(v)
try:
for i in r:
if regs[i] != (-1, -1):
return i-1
except:
return None # no match
return fixup

p = combined_pattern(patterns)

# p returns the index of the matching
# pattern, or None

print p("129391")
print p("abc800")
print p("abc1600")
print p("python")
print p("perl")
print p("tcl")

</F>
Python too slow for real world [ In reply to ]
> def fixup(v, m=p.code.match, r=range(1,len(patterns)+1)):

That is *perfect*. Why isn't the 'code' attribute documented??
Thanks!

----------------------------------
Nathan Clegg
nathan@islanddata.com
Python too slow for real world [ In reply to ]
Nathan Clegg writes:
>> def fixup(v, m=p.code.match, r=range(1,len(patterns)+1)):
>That is *perfect*. Why isn't the 'code' attribute documented??
>Thanks!

Because it's an internal thing that may go away in the future;
it should really be _code. For example, when re.py is translated to
C, there will be little reason to expose a public code attribute.

--
A.M. Kuchling http://starship.python.net/crew/amk/
I was offered a lot of money for my story, _John Parlabane as I Knew Him_, and
the services of a ghost to write it up from my verbal confession. (It was
assumed that, as a student, I would not be capable of coherent expression.)
-- Robertson Davies, _The Rebel Angels_
Python too slow for real world [ In reply to ]
Then I hope there will be some new facility for circumventing the
MatchObject creation when it is not necessary. It really is a shame for
an unwanted feature to slow down code. I think the MatchObject is
implemented quite well, but sometimes I don't need it--why pay for it?


On 03-May-99 Andrew M. Kuchling wrote:
> Nathan Clegg writes:
>>> def fixup(v, m=p.code.match, r=range(1,len(patterns)+1)):
>>That is *perfect*. Why isn't the 'code' attribute documented??
>>Thanks!
>
> Because it's an internal thing that may go away in the future;
> it should really be _code. For example, when re.py is translated to
> C, there will be little reason to expose a public code attribute.


----------------------------------
Nathan Clegg
nathan@islanddata.com
Python too slow for real world [ In reply to ]
[/F]
> re.py keeps up to 20 patterns in a cache
> (see the _cachecompile function at the
> top of that file).
>
> (however, the cache cleaning code looks
> a bit odd:
>
> if len(_cache) >= _MAXCACHE:
> _cache.clear()
>
> maybe an LRU or a
> del _cache[random.choice(_cache.keys())]
> would be better?)

The (deeply!) hidden puzzle here is how to make the cache thread-safe. You
don't want the overhead of fussing with a lock, and the current mass
deletion doesn't need one. Wrapping your "del" in a "try: del ... except:
pass" block would work (there may be no keys remaining by the time .keys()
is executed, so random.choice([]) may gripe; and some other thread may have
deleted the same key before you get a chance, so the "del" may gripe too).

If/when re.py is recoded in C, the cache thread-safety problem goes away
(thanks to the global lock).

memory-like-a-threaded-octopus's-ly y'rs - tim

1 2  View All