Mailing List Archive

Re: regexp performance
[.Cc'ed to the String-SIG; sheesh, what's the point of having SIGs
otherwise?]

Fredrik Lundh writes:
>any special pattern constructs that are in need of per-
>formance improvements? (compared to Perl, that is).

In the 1.5 source tree, I think one major slowdown is coming from the
malloc'ed failure stack. This was introduced in order to prevent an
expression like (x)* from filling the stack when applied to a string
contained 50,000 'x' characters (hence 50,000 recursive function
calls). I'd like to get rid of this stack because it's slow and
requires much tedious patching of the upstream PCRE.

>or maybe anyone has an extensive performance test
>suite for perlish regular expressions? (preferrably based
>on how real people use regular expressions, not only on
>things that are known to be slow if not optimized)

Friedl's book describes several optimizations which aren't implemented
in PCRE. The problem is that PCRE never builds a parse tree, and
parse trees are easy to analyse recursively. Instead, PCRE's
functions actually look at the compiled byte codes (for example, look
at find_firstchar or is_anchored in pypcre.c), but this makes analysis
functions hard to write, and rearranging the code near-impossible.

--
A.M. Kuchling http://starship.python.net/crew/amk/
I didn't say it was my fault. I said it was my responsibility. I know the
difference.
-- Rose Walker, in SANDMAN #60: "The Kindly Ones:4"
RE: [String-SIG] Re: regexp performance [ In reply to ]
[Andrew M. Kuchling]
> ...
> Friedl's book describes several optimizations which aren't implemented
> in PCRE. The problem is that PCRE never builds a parse tree, and
> parse trees are easy to analyse recursively. Instead, PCRE's
> functions actually look at the compiled byte codes (for example, look
> at find_firstchar or is_anchored in pypcre.c), but this makes analysis
> functions hard to write, and rearranging the code near-impossible.

This is wonderfully & ironically Pythonic. That is, the Python compiler
itself goes straight to byte code, and the optimization that's done works at
the latter low level. Luckily <wink>, very little optimization is
attempted, and what's there only replaces one bytecode with another of the
same length. If it tried to do more, it would have to rearrange the code
...

the-more-things-differ-the-more-things-don't-ly y'rs - tim
Re: RE: [String-SIG] Re: regexp performance [ In reply to ]
Tim Peters <tim_one@email.msn.com> wrote:
> > The problem is that PCRE never builds a parse tree, and
> > parse trees are easy to analyse recursively. Instead, PCRE's
> > functions actually look at the compiled byte codes (for example, look
> > at find_firstchar or is_anchored in pypcre.c), but this makes analysis
> > functions hard to write, and rearranging the code near-impossible.
>
> This is wonderfully & ironically Pythonic. That is, the Python compiler
> itself goes straight to byte code, and the optimization that's done works at
> the latter low level.

yeah, but by some reason, people (including GvR) expect a
regular expression machinery to be more optimized than the
language interpreter ;-)

</F>