Mailing List Archive: Why Foo is better than Baz

Why Foo is better than Baz

May 3, 1999, 7:01 AM

Post #1 of 12 (3779 views)

scriptics is positioning tcl as a perl killer:

http://www.scriptics.com/scripting/perl.html

afaict, unicode and event handling are the two
main thingies missing from python 1.5.

-- unicode: is on its way.

-- event handling: asynclib/asynchat provides an
awesome framework for event-driven socket pro-
gramming. however, Python still lacks good cross-
platform support for event-driven access to files
and pipes. are threads good enough, or would it
be cool to have something similar to Tcl's fileevent
stuff in Python?

-- regexps: has anyone compared the new uni-
code-aware regexp package in Tcl with pcre?

comments?

</F>

btw, the rebol folks have reached 2.0:
http://www.rebol.com/

maybe 1.6 should be renamed to Python 6.0?

Re: Why Foo is better than Baz [ In reply to ]

akuchlin at cnri

May 3, 1999, 8:14 AM

Post #2 of 12 (3739 views)

Permalink

Fredrik Lundh writes:
>-- regexps: has anyone compared the new uni-
>code-aware regexp package in Tcl with pcre?

I looked at it a bit when Tcl 8.1 was in beta; it derives from
Henry Spencer's 1998-vintage code, which seems to try to do a lot of
optimization and analysis. It may even compile DFAs instead of NFAs
when possible, though it's hard for me to be sure. This might give it
a substantial speed advantage over engines that do less analysis, but
I haven't benchmarked it. The code is easy to read, but difficult to
understand because the theory underlying the analysis isn't explained
in the comments; one feels there should be an accompanying paper to
explain how everything works, and it's why I'm not sure if it really
is producing DFAs for some expressions.

Tcl seems to represent everything as UTF-8 internally, so
there's only one regex engine; there's . The code is scattered over
more files:

amarok generic>ls re*.[ch]
regc_color.c regc_locale.c regcustom.h regerrs.h regfree.c
regc_cvec.c regc_nfa.c rege_dfa.c regex.h regfronts.c
regc_lex.c regcomp.c regerror.c regexec.c regguts.h
amarok generic>wc -l re*.[ch]
742 regc_color.c
170 regc_cvec.c
1010 regc_lex.c
781 regc_locale.c
1528 regc_nfa.c
2124 regcomp.c
85 regcustom.h
627 rege_dfa.c
82 regerror.c
18 regerrs.h
308 regex.h
952 regexec.c
25 regfree.c
56 regfronts.c
388 regguts.h
8896 total
amarok generic>

This would be an issue for using it with Python, since all
these files would wind up scattered around the Modules directory. For
comparison, pypcre.c is around 4700 lines of code.

--
A.M. Kuchling http://starship.python.net/crew/amk/
Things need not have happened to be true. Tales and dreams are the
shadow-truths that will endure when mere facts are dust and ashes, and forgot.
-- Neil Gaiman, _Sandman_ #19: _A Midsummer Night's Dream_

Re: Why Foo is better than Baz [ In reply to ]

guido at CNRI

May 3, 1999, 8:32 AM

Post #3 of 12 (3744 views)

Permalink

> I looked at it a bit when Tcl 8.1 was in beta; it derives from
> Henry Spencer's 1998-vintage code, which seems to try to do a lot of
> optimization and analysis. It may even compile DFAs instead of NFAs
> when possible, though it's hard for me to be sure. This might give it
> a substantial speed advantage over engines that do less analysis, but
> I haven't benchmarked it. The code is easy to read, but difficult to
> understand because the theory underlying the analysis isn't explained
> in the comments; one feels there should be an accompanying paper to
> explain how everything works, and it's why I'm not sure if it really
> is producing DFAs for some expressions.
>
> Tcl seems to represent everything as UTF-8 internally, so
> there's only one regex engine; there's .

Hmm... I looked when Tcl 8.1 was in alpha, and I *think* that at that
point the regex engine was compiled twice, once for 8-bit chars and
once for 16-bit chars. But this may have changed.

I've noticed that Perl is taking the same position (everything is
UTF-8 internally). On the other hand, Java distinguishes 16-bit chars
from 8-bit bytes. Python is currently in the Java camp. This might
be a good time to make sure that we're still convinced that this is
the right thing to do!

> The code is scattered over
> more files:
>
> amarok generic>ls re*.[ch]
> regc_color.c regc_locale.c regcustom.h regerrs.h regfree.c
> regc_cvec.c regc_nfa.c rege_dfa.c regex.h regfronts.c
> regc_lex.c regcomp.c regerror.c regexec.c regguts.h
> amarok generic>wc -l re*.[ch]
> 742 regc_color.c
> 170 regc_cvec.c
> 1010 regc_lex.c
> 781 regc_locale.c
> 1528 regc_nfa.c
> 2124 regcomp.c
> 85 regcustom.h
> 627 rege_dfa.c
> 82 regerror.c
> 18 regerrs.h
> 308 regex.h
> 952 regexec.c
> 25 regfree.c
> 56 regfronts.c
> 388 regguts.h
> 8896 total
> amarok generic>
>
> This would be an issue for using it with Python, since all
> these files would wind up scattered around the Modules directory. For
> comparison, pypcre.c is around 4700 lines of code.

I'm sure that if it's good code, we'll find a way. Perhaps a more
interesting question is whether it is Perl5 compatible. I contacted
Henry Spencer at the time and he was willing to let us use his code.

--Guido van Rossum (home page: http://www.python.org/~guido/)

Re: Why Foo is better than Baz [ In reply to ]

akuchlin at cnri

May 3, 1999, 8:56 AM

Post #4 of 12 (3728 views)

Permalink

Guido van Rossum writes:
>Hmm... I looked when Tcl 8.1 was in alpha, and I *think* that at that
>point the regex engine was compiled twice, once for 8-bit chars and
>once for 16-bit chars. But this may have changed.

It doesn't seem to currently; the code in tclRegexp.c looks
like this:

/* Remember the UTF-8 string so Tcl_RegExpRange() can convert the
* matches from character to byte offsets.
*/
regexpPtr->string = string;
Tcl_DStringInit(&stringBuffer);
uniString = Tcl_UtfToUniCharDString(string, -1, &stringBuffer);
numChars = Tcl_DStringLength(&stringBuffer) / sizeof(Tcl_UniChar);
/* Perform the regexp match. */
result = TclRegExpExecUniChar(interp, re, uniString, numChars, -1,
((string > start) ? REG_NOTBOL : 0));

ISTR the Spencer engine does, however, define a small and
large representation for NFAs and have two versions of the engine, one
for each representation. Perhaps that's what you're thinking of.

>I've noticed that Perl is taking the same position (everything is
>UTF-8 internally). On the other hand, Java distinguishes 16-bit chars
>from 8-bit bytes. Python is currently in the Java camp. This might
>be a good time to make sure that we're still convinced that this is
>the right thing to do!

I don't know. There's certainly the fundamental dichotomy
that strings are sometimes used to represent characters, where
changing encodings on input and output is reasonably, and sometimes
used to hold chunks of binary data, where any changes are incorrect.
Perhaps Paul Prescod is right, and we should try to get some other
data type (array.array()) for holding binary data, as distinct from
strings.

>I'm sure that if it's good code, we'll find a way. Perhaps a more
>interesting question is whether it is Perl5 compatible. I contacted
>Henry Spencer at the time and he was willing to let us use his code.

Mostly Perl-compatible, though it doesn't look like the 5.005
features are there, and I haven't checked for every single 5.004
feature. Adding missing features might be problematic, because I
don't really understand what the code is doing at a high level. Also,
is there a user community for this code? Do any other projects use
it? Philip Hazel has been quite helpful with PCRE, an important thing
when making modifications to the code.

Should I make a point of looking at what using the Spencer
engine would entail? It might not be too difficult (an evening or
two, maybe?) to write a re.py that sat on top of the Spencer code;
that would at least let us do some benchmarking.

--
A.M. Kuchling http://starship.python.net/crew/amk/
In Einstein's theory of relativity the observer is a man who sets out in quest
of truth armed with a measuring-rod. In quantum theory he sets out with a
sieve.
-- Sir Arthur Eddington

Re: Why Foo is better than Baz [ In reply to ]

guido at CNRI

May 3, 1999, 9:02 AM

Post #5 of 12 (3737 views)

Permalink

> Should I make a point of looking at what using the Spencer
> engine would entail? It might not be too difficult (an evening or
> two, maybe?) to write a re.py that sat on top of the Spencer code;
> that would at least let us do some benchmarking.

Surely this would be more helpful than weeks of specilative emails --
go for it!

--Guido van Rossum (home page: http://www.python.org/~guido/)

Re: Why Foo is better than Baz [ In reply to ]

fredrik at pythonware

May 3, 1999, 10:10 AM

Post #6 of 12 (3744 views)

Permalink

> Also, is there a user community for this code?

how about comp.lang.tcl ;-)

</F>

Re: Why Foo is better than Baz [ In reply to ]

fredrik at pythonware

May 3, 1999, 10:15 AM

Post #7 of 12 (3745 views)

Permalink

talking about regexps, here's another thing that
would be quite nice to have in 1.6 (available from
the Python level, that is). or is it already in there
somewhere?

</F>

...

http://www.dejanews.com/[ST_rn=qs]/getdoc.xp?AN=464362873

Tcl 8.1b3 Request: Generated by Scriptics' bug entry form at

Submitted by: Frederic BONNET
OperatingSystem: Windows 98
CustomShell: Applied patch to the regexp engine (the exec part)
Synopsis: regexp improvements

DesiredBehavior:
As previously requested by Don Libes:

> I see no way for Tcl_RegExpExec to indicate "could match" meaning
> "could match if more characters arrive that were suitable for a
> match". This is required for a class of applications involving
> matching on a stream required by Expect's interact command. Henry
> assured me that this facility would be in the engine (I'm not the only
> one that needs it). Note that it is not sufficient to add one more
> return value to Tcl_RegExpExec (i.e., 2) because one needs to know
> both if something matches now and can match later. I recommend
> another argument (canMatch *int) be added to Tcl_RegExpExec.

/patch info follows/

...

RE: Why Foo is better than Baz [ In reply to ]

tim_one at email

May 3, 1999, 10:16 PM

Post #8 of 12 (3774 views)

Permalink

[Guido & Andrew on Tcl's new regexp code]
> I'm sure that if it's good code, we'll find a way. Perhaps a more
> interesting question is whether it is Perl5 compatible. I contacted
> Henry Spencer at the time and he was willing to let us use his code.

Haven't looked at the code, but did read the manpage just now:

http://www.scriptics.com/man/tcl8.1/TclCmd/regexp.htm

WRT Perl5 compatibility, it sez:

Incompatibilities of note include `\b', `\B', the lack of special
treatment for a trailing newline, the addition of complemented
bracket expressions to the things affected by newline-sensitive
matching, the restrictions on parentheses and back references in
lookahead constraints, and the longest/shortest-match (rather than
first-match) matching semantics.

So some gratuitous differences, and maybe a killer: Guido hasn't had much
kind to say about "longest" (aka POSIX) matching semantics. An example from
the page:

(week|wee)(night|knights)
matches all ten characters of `weeknights'

which means it matched 'wee' and 'knights'; Python/Perl match 'week' and
'night'.

It's the *natural* semantics if Andrew's suspicion that it's compiling a DFA
is correct; indeed, it's a pain to get that behavior any other way!

otoh-it's-potentially-very-much-faster-ly y'rs - tim

RE: Why Foo is better than Baz [ In reply to ]

tim_one at email

May 3, 1999, 10:51 PM

Post #9 of 12 (3738 views)

Permalink

[Tim]
> ...
> It's the *natural* semantics if Andrew's suspicion that it's
> compiling a DFA is correct ...

More from the man page:

AREs report the longest/shortest match for the RE, rather than
the first found in a specified search order. This may affect some
RREs which were written in the expectation that the first match
would be reported. (The careful crafting of RREs to optimize the
search order for fast matching is obsolete (AREs examine all possible
matches in parallel, and their performance is largely insensitive to
their complexity) but cases where the search order was exploited to
deliberately find a match which was not the longest/shortest will
need rewriting.)

Nails it, yes? Now, in 10 seconds, try to remember a regexp where this
really matters <wink>.

Note in passing that IDLE's colorizer regexp *needs* to search for
triple-quoted strings before single-quoted ones, else the P/P semantics
would consider """ to be an empty single-quoted string followed by a double
quote. This isn't a case where it matters in a bad way, though! The
"longest" rule picks the correct alternative regardless of the order in
which they're written.

at-least-in-that-specific-regex<0.1-wink>-ly y'rs - tim

Re: Why Foo is better than Baz [ In reply to ]

guido at CNRI

May 4, 1999, 5:26 AM

Post #10 of 12 (3728 views)

Permalink

[Tim]
> So some gratuitous differences, and maybe a killer: Guido hasn't had much
> kind to say about "longest" (aka POSIX) matching semantics.
>
> An example from the page:
>
> (week|wee)(night|knights)
> matches all ten characters of `weeknights'
>
> which means it matched 'wee' and 'knights'; Python/Perl match 'week' and
> 'night'.
>
> It's the *natural* semantics if Andrew's suspicion that it's compiling a DFA
> is correct; indeed, it's a pain to get that behavior any other way!

Possibly contradicting what I once said about DFAs (I have no idea
what I said any more :-): I think we shouldn't be hung up about the
subtleties of DFA vs. NFA; for most people, the Perl-compatibility
simply means that they can use the same metacharacters. My guess is
that people don'y so much translate long Perl regexp's to Python but
simply transport their (always incomplete -- Larry Wall *wants* it
that way :-) knowledge of Perl regexps to Python. My meta-guess is
that this is also Henry Spencer's and John Ousterhout's guess. As for
Larry Wall, I guess he really doesn't care :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)

Re: Why Foo is better than Baz [ In reply to ]

akuchlin at cnri

May 4, 1999, 9:14 AM

Post #11 of 12 (3739 views)

Permalink

Guido van Rossum writes:
>Possibly contradicting what I once said about DFAs (I have no idea
>what I said any more :-): I think we shouldn't be hung up about the
>subtleties of DFA vs. NFA; for most people, the Perl-compatibility
>simply means that they can use the same metacharacters. My guess is

I don't like slipping in such a change to the semantics with
no visible change to the module name or interface. On the other hand,
if it's not NFA-based, then it can provide POSIX semantics without
danger of taking exponential time to determine the longest match.
BTW, there's an interesting reference, I assume to this code, in
_Mastering Regular Expressions_; Spencer is quoted on page 121 as
saying it's "at worst quadratic in text size.".

Anyway, we can let it slide until a Python interface gets written.

--
A.M. Kuchling http://starship.python.net/crew/amk/
In the black shadow of the Baba Yaga babies screamed and mothers miscarried;
milk soured and men went mad.
-- In SANDMAN #38: "The Hunt"

Re: Why Foo is better than Baz [ In reply to ]

guido at CNRI

May 4, 1999, 9:19 AM

Post #12 of 12 (3745 views)

Permalink

> BTW, there's an interesting reference, I assume to this code, in
> _Mastering Regular Expressions_; Spencer is quoted on page 121 as
> saying it's "at worst quadratic in text size.".

Not sure if that was the same code -- this is *new* code, not
Spencer's old code. I think Friedl's book is older than the current
code.

--Guido van Rossum (home page: http://www.python.org/~guido/)