Mailing List Archive

Tcl 8.1's regexp code (was RE: [Python-Dev] Why Foo is better than Baz)
I've consistently found that the best way to kill a thread is to rename it
accurately <wink>.

Agree w/ Guido that few people really care about the differing semantics.

Agree w/ Andrew that it's bad to pull a semantic switcheroo at this stage
anyway: code will definitely break. Like

\b(?:
(?P<keyword>and|if|else|...) |
(?P<identifier>[a-zA-Z_]\w*)
)\b

The (special)|(general) idiom relies on left-to-right match-and-out
searching of alternatives to do its job correctly. Not to mention that \b
is not a word-boundary assertion in the new pkg (talk about pointlessly
irritating differences! at least this one could be easily hidden via
brainless preprocessing).

Over the long run, moving to a DFA locks Python out of the directions Perl
is *moving*, namely embedding all sorts of runtime gimmicks in regexps that
exploit knowing the "state of the match so far". DFAs don't work that way.
I don't mind losing those possibilities, because I think the regexp
sublanguage is strained beyond its limits already. But that's a decision
with Big Consequences, so deserves some thought.

I'd definitely like the (sometimes dramatically) increased speed a DFA can
offer (btw, this code appears to use a lazily-generated DFA, to avoid the
exponential *compile*-time a straightforward DFA implementation can
suffer -- the code is very complex and lacks any high-level internal docs,
so we better hope Henry stays in love with it <0.5 wink>).

> ...
> My guess is that people don't so much translate long Perl regexp's
> to Python but simply transport their (always incomplete -- Larry Wall
> *wants* it that way :-) knowledge of Perl regexps to Python.

This is directly proportional to the number of feeble CGI programmers Python
attracts <wink>. The good news is that they wouldn't know an NFA from a DFA
if Larry bit Henry on the ass ...

> My meta-guess is that this is also Henry Spencer's and John
> Ousterhout's guess.

I think Spencer strongly favors DFA semantics regardless of fashion, and
Ousterhout is a pragmatist. So I trust JO's judgment more <0.9 wink>.

> As for Larry Wall, I guess he really doesn't care :-)

I expect he cares a lot! Because a DFA would prevent Perl from going even
more insane in its present direction.


About the age of the code, postings to comp.lang.tcl have Henry saying he
was working on the alpha version intensely as recently as Decemeber ('98).
A few complaints about the alpha release trickled in, about regexp compile
speed and regexp matching speed in specific cases. Perhaps paradoxically,
the latter were about especially simple regexps with long fixed substrings
(where this mountain of sophisticated machinery is likely to get beat cold
by an NFA with some fixed-substring lookahead smarts -- which latter Henry
intended to graft into this pkg too).

[Andrew]
> BTW, there's an interesting reference, I assume to this code, in
> _Mastering Regular Expressions_; Spencer is quoted on page 121 as
> saying it's "at worst quadratic in text size.".

[Guido]
> Not sure if that was the same code -- this is *new* code, not
> Spencer's old code. I think Friedl's book is older than the current
> code.

I expect this is an invariant, though: it's not natural for a DFA to know
where subexpression matches begin and end, and there's a pile of xxx_dissect
functions in regexec.c that use what strongly appear to be worst-case
quadratic-time algorithms for figuring that out after it's known that the
overall expression has *a* match. Expect too, but don't know, that only
pathological cases are actually expensive.


Question: has this package been released in any other context, or is it
unique to Tcl? I searched in vain for an announcement (let alone code) from
Henry, or any discussion of this code outside the Tcl world.

whatever-happens-i-vote-we-let-them-debug-it<wink>-ly y'rs - tim
Re: Tcl 8.1's regexp code [ In reply to ]
On Wed, 5 May 1999, Tim Peters wrote:
>...
> Question: has this package been released in any other context, or is it
> unique to Tcl? I searched in vain for an announcement (let alone code) from
> Henry, or any discussion of this code outside the Tcl world.

Apache uses it.

However, the Apache guys have considered possibility updating the thing. I
gather that they have a pretty old snapshot. Another guy mentioned PCRE
and I pointed out that Python uses it for its regex support. In other
words, if Apache *does* update the code, then it may be that Apache will
drop the HS engine in favor of PCRE.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/
Re: Tcl 8.1's regexp code (was RE: [Python-Dev] Why Foo is better than Baz) [ In reply to ]
>>>>> "TP" == Tim Peters <tim_one@email.msn.com> writes:

TP> Over the long run, moving to a DFA locks Python out of the
TP> directions Perl is *moving*, namely embedding all sorts of
TP> runtime gimmicks in regexps that exploit knowing the "state of
TP> the match so far". DFAs don't work that way. I don't mind
TP> losing those possibilities, because I think the regexp
TP> sublanguage is strained beyond its limits already. But that's
TP> a decision with Big Consequences, so deserves some thought.

I know zip about the internals of the various regexp package. But as
far as the Python level interface, would it be feasible to support
both as underlying regexp engines underneath re.py? The idea would be
that you'd add an extra flag (re.PERL / re.TCL ? re.DFA / re.NFA ?
re.POSIX / re.USEFUL ? :-) that would select the engine and compiler.
Then all the rest of the magic happens behind the scenes, with
appropriate exceptions thrown if there are syntax mismatches in the
regexp that can't be worked around by preprocessors, etc.

Or would that be more confusing than yet another different regexp
module?

-Barry
RE: Tcl 8.1's regexp code [ In reply to ]
[Tim]
> Question: has this package [Tcl's 8.1 regexp support] been released in
> any other context, or is it unique to Tcl? I searched in vain for an
> announcement (let alone code) from Henry, or any discussion of this code
> outside the Tcl world.

[Greg Stein]
> Apache uses it.
>
> However, the Apache guys have considered possibility updating the thing. I
> gather that they have a pretty old snapshot. Another guy mentioned PCRE
> and I pointed out that Python uses it for its regex support. In other
> words, if Apache *does* update the code, then it may be that Apache will
> drop the HS engine in favor of PCRE.

Hmm. I just downloaded the Apache 1.3.4 source to check on this, and it
appears to be using a lightly massaged version of Spencer's old (circa
'92-'94) just-POSIX regexp package. Henry has been distributing regexp pkgs
for a loooong time <wink>.

The Tcl 8.1 regexp pkg is much hairier. If the Apache folk want to switch
in order to get the Perl regexp syntax extensions, this Tcl version is worth
looking at too. If they want to switch for some other reason, it would be
good to know what that is!

The base pkg Apache uses is easily available all over the web; the pkg Tcl
8.1 is using I haven't found anywhere except in the Tcl download (which is
why I'm wondering about it -- so far, it doesn't appear to be distributed by
Spencer himself, in a non-Tcl-customized form).

looks-like-an-entirely-new-pkg-to-me-ly y'rs - tim
RE: Tcl 8.1's regexp code (was RE: [Python-Dev] Why Foo is better than Baz) [ In reply to ]
[.Tim notes that moving to a DFA regexp engine would rule out some future
aping of Perl mistakes <wink>]

[Barry "The Great Compromiser" Warsaw]
> I know zip about the internals of the various regexp package. But as
> far as the Python level interface, would it be feasible to support
> both as underlying regexp engines underneath re.py? The idea would be
> that you'd add an extra flag (re.PERL / re.TCL ? re.DFA / re.NFA ?
> re.POSIX / re.USEFUL ? :-) that would select the engine and compiler.
> Then all the rest of the magic happens behind the scenes, with
> appropriate exceptions thrown if there are syntax mismatches in the
> regexp that can't be worked around by preprocessors, etc.
>
> Or would that be more confusing than yet another different regexp
> module?

It depends some on what percentage of the Python distribution Guido wants to
devote to regexp code <0.6 wink>; the Tcl pkg would be the largest block of
code in Modules/, where regexp packages already consume more than anything
else.

It's a lot of delicate, difficult code. Someone would need to step up and
champion each alternative package. I haven't asked Andrew lately, but I'd
bet half a buck the thrill of supporting pcre has waned.

If there were competing packages, your suggested interface is fine. I just
doubt the Python developers will support more than one (Andrew may still be
young, but he can't possibly still be naive enough to sign up for two of
these nightmares <wink>).

i'm-so-old-i-never-signed-up-for-one-ly y'rs - tim