Mailing List Archive

Better text processing support in py2k?
It just occurred to me as I was replying to a request on the main list, that
Python's text handling capabilities could be a bit better than they are.
This will probably not come as a revelation to many of you, but I finally
put it together with the standard argument against beefing things up

One fix would be to add regular expressions to the language core and
have special syntax for them, as Perl has done. However, I don't like
this solution because Python is a general-purpose language, and regular
expressions are used for the single application domain of text
processing. For other application domains, regular expressions may be of
no interest, and you might want to remove them to save memory and code
size.

and the observation that Python does support some builtin objects and syntax
that are fairly specific to some much more restricted application domains
than text processing.

I stole the above quote from Andrew Kuchling's Python Warts page, which I
also happened to read earlier today.

What AMK says makes perfect sense until you examine some of the other things
that are in the language, like the Ellipsis object and complex numbers. If
I recall correctly both were added as a result of the NumPy package
development.

I have nothing against ellipses or complex numbers. They are fine first
class objects that should remain in the language. But I have never used
either one in my day-to-day work. On the other hand, I read files and
manipulate them with regular expressions all the time. I rather suspect
that more people use Python for some sort of text processing than any other
single application domain. Python should be good at it.

While I don't want to turn Python into Perl, I would like to see it do a
better job of what most people probably use the language for. Here is a
very short list of things I think need attention:

1. When using something like the simple file i/o idiom

for line in f.readlines():
dofunstuff(line)

the programmer should not have to care how big the file is. It
should just work in a reasonably efficient manner without gobbling up
all of memory. I realize this may require some change to the syntax
of the common idiom.

2. The re module needs to be sped up, if not to catch up with Perl, then
to catch up with the deprecated regex module. Depending how far
people want to go with things, adding some language syntax to support
regular expressions might be in order. I don't see that as
compelling as adding complex numbers however. Another possibility,
now that Barry Warsaw has opened the floodgates, is to add regular
expression methods to strings.

3. I've not yet used it, but I am told the pattern matching in
Marc-Andre Lemburg's mxTextTools
(http://starship.python.net/crew/lemburg/) is both powerful and
efficient (though it certainly appears complex). Perhaps it deserves
consideration for incorporation into the core Python distribution.

I'm sure other people will come up with other suggestions.

Skip Montanaro | http://www.mojam.com/
skip@mojam.com | http://www.musi-cal.com/
847-971-7098 | Python: Programming the way Guido indented...
Re: Better text processing support in py2k? [ In reply to ]
Skip Montanaro writes:
>What AMK says makes perfect sense until you examine some of the other things
>that are in the language, like the Ellipsis object and complex numbers. If
>I recall correctly both were added as a result of the NumPy package
>development.

True, but note that you can compile Python with WITHOUT_COMPLEX
defined to remove complex numbers.

> 1. When using something like the simple file i/o idiom
> for line in f.readlines():
> dofunstuff(line)
> the programmer should not have to care how big the file is.

What about 'for line in fileinput.input()', which already exists?
(Hmmm... if you have an already open file object, I don't think you
can pass it to fileinput.input(); maybe that should be fixed.)

On a vaguely related note, since there are many things like parser
generators and XML stuff and mxTextTools, I've been speculating about
a text processing topic guide. If you know of Python packages related
to text processing, please send me a private e-mail with a link.

--
A.M. Kuchling http://starship.python.net/crew/amk/
Constraints often boost creativity.
-- Jim Hugunin, 11 Feb 1999
Re: Better text processing support in py2k? [ In reply to ]
Andrew> True, but note that you can compile Python with WITHOUT_COMPLEX
Andrew> defined to remove complex numbers.

That's true, but that wasn't my point. I'm not arguing for or against space
efficiency, just that the the rather timeworn argument about not doing
anything special to support text processing because Python is a general
purpose language is a red herring.

>> 1. When using something like the simple file i/o idiom
>> for line in f.readlines():
>> dofunstuff(line)
>> the programmer should not have to care how big the file is.

Andrew> What about 'for line in fileinput.input()', which already
Andrew> exists? (Hmmm... if you have an already open file object, I
Andrew> don't think you can pass it to fileinput.input(); maybe that
Andrew> should be fixed.)

Well, a couple reasons jump to mind:

1. fileinput.FileInput isn't particularly efficient. At its heart, its
__getitem__ method makes a simple readline() call instead of buffering
some amount of readlines(sizehint) bytes. This can be fixed, but I'm
not sure what would happen to its semantics.

2. As you pointed out, it's not all that general.

My point, not at all well stated, is that the programmer shouldn't have to
worry (much?) about the conditions under which he does file i/o. Right
now, if I know the file is small(ish), I can do

for line in f.readlines():
dofunstuff(line)

but I have to know that the file won't be big, because readlines() will
behave badly (perhaps even generate a MemoryError exception) if the file is
large. In that case, I have to fall back to the safer (and slower)

line = f.readline()
while line:
dofunstuff(line)
line = f.readline()

or the more efficient, but more cumbersome

lines = f.readlines(sizehint)
while lines:
for line in lines:
dofunstuff(line)
lines = f.readlines(sizehint)

That's three separate idioms the programmer has to be aware of when writing
code to read a text file based upon the perceived need for speed, memory
usage and desired clarity:

fast/memory-intensive/clear
slow/memory-conserving/not-as-clear
fast/memory-conserving/fairly-muddy

Any particular reason that the readline method can't return an iterator that
supports __getitem__ and buffers input? (Again, remember this is for py2k,
so the potential breakage such a change might cause is a consideration, but
not a showstopper.)

Andrew> On a vaguely related note, since there are many things like
Andrew> parser generators and XML stuff and mxTextTools, I've been
Andrew> speculating about a text processing topic guide. If you know of
Andrew> Python packages related to text processing, please send me a
Andrew> private e-mail with a link.

This sounds like a good idea to me.

Skip Montanaro | http://www.mojam.com/
skip@mojam.com | http://www.musi-cal.com/
847-971-7098 | Python: Programming the way Guido indented...
Re: Better text processing support in py2k? [ In reply to ]
--- Skip Montanaro <skip@mojam.com> wrote:
> fast/memory-intensive/clear
> slow/memory-conserving/not-as-clear
> fast/memory-conserving/fairly-muddy
>
> Any particular reason that the readline method can't
> return an iterator that
> supports __getitem__ and buffers input? (Again,
> remember this is for py2k,
> so the potential breakage such a change might cause
> is a consideration, but
> not a showstopper.)

Why not generalize fileinput to do buffering instead?

More generally, Java has the notion of 'stackable
streams' - e.g. construct a 'BufferedFile' around a
'File', maybe construct a 'Line-oriented file' around
that etc. Each one takes a file-like object as an
argument to the constructor. Things you might want to
do:
- buffering
- international encoding conversions
- line delimiters other than CR/LF/CRLF
- read/write Python objects (i.e. use pickle/marshal)
- easy interfaces to parsers

This took me a couple of hours to get used to (and at
the time I thought 'Yuk!' when I saw first saw four
nested constructors), but gives you very precise
control and a lot of versatility when handling files.
It's an idiom Python does not use much but maybe it
should.

I'd argue that maybe some enhancements to fileinput.py
- adding some streams to provide building blocks for
these operations - would get us the power you want and
a lot more versatility besides.





=====
Andy Robinson
Robinson Analytics Ltd.
------------------
My opinions are the official policy of Robinson Analytics Ltd.
They just vary from day to day.

__________________________________________________
Do You Yahoo!?
Talk to your friends online with Yahoo! Messenger.
http://messenger.yahoo.com
Re: Better text processing support in py2k? [ In reply to ]
Andy Robinson wrote:
>
> --- Skip Montanaro <skip@mojam.com> wrote:
> > fast/memory-intensive/clear
> > slow/memory-conserving/not-as-clear
> > fast/memory-conserving/fairly-muddy
> >
> > Any particular reason that the readline method can't
> > return an iterator that
> > supports __getitem__ and buffers input? (Again,
> > remember this is for py2k,
> > so the potential breakage such a change might cause
> > is a consideration, but
> > not a showstopper.)
>
> Why not generalize fileinput to do buffering instead?
>
> More generally, Java has the notion of 'stackable
> streams' - e.g. construct a 'BufferedFile' around a
> 'File', maybe construct a 'Line-oriented file' around
> that etc. Each one takes a file-like object as an
> argument to the constructor. Things you might want to
> do:
> - buffering
> - international encoding conversions
> - line delimiters other than CR/LF/CRLF
> - read/write Python objects (i.e. use pickle/marshal)
> - easy interfaces to parsers

If all goes well we'll have something like this
in Python 1.6 at least for the encoding/decoding
part file reading and writing. You basically take
a file object and then wrap some StreamCodecs around
it to get the functionality you need. Very simple
and very intuitive.

> This took me a couple of hours to get used to (and at
> the time I thought 'Yuk!' when I saw first saw four
> nested constructors), but gives you very precise
> control and a lot of versatility when handling files.
> It's an idiom Python does not use much but maybe it
> should.
>
> I'd argue that maybe some enhancements to fileinput.py
> - adding some streams to provide building blocks for
> these operations - would get us the power you want and
> a lot more versatility besides.

--
Marc-Andre Lemburg
______________________________________________________________________
Y2000: Get ready to party !
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
RE: Better text processing support in py2k? [ In reply to ]
[Skip Montanaro, wants nicer text facilities]
> ...
> I rather suspect that more people use Python for some sort of
> text processing than any other single application domain.

Hmm. You're probably right, but I'm an exception.

> Python should be good at it.

And I guess I'm an exception mostly *because* Perl is better at easy text
crunching and Icon is better at hard text-crunching -- that is, I use the
right tool for the job <wink>.

> While I don't want to turn Python into Perl, I would like to see
> it do a better job of what most people probably use the language
> for. Here is a very short list of things I think need attention:
>
> 1. [.*A* clear way to do memory- and time-efficient textfile
> input]

I agree, but unsure how to fix it. The best way to write this now is

# f is some open file object.
while 1:
lines = f.readlines(BUFSIZE)
if not lines:
break
for line in lines:
process(line)

and it's not something anyone figures out on their own -- or enjoys typing
or explaining afterwards.

Perl gets its line-at-a-time speed by peeking and poking C FILE structs
directly in compiler- and platform-specific ways -- ways that vendors
*should* have done in their own fgets implementations, but almost never do.
I have no idea whether it works well with Perl's nascent notions of
threading, but in the absence of that "the system" doesn't know Perl is
cheating (i.e., as far as libc+friends are concerned, Perl *is* reading one
line at a time -- even mixing in C-level ungetc calls works (well, sometimes
<0.1 wink -- they don't always peek and poke enough fields>)).

The Python QIO extension module is much easier to port but less compatible
(it doesn't use stdio, so QIO-opened files don't play well with others) and
slower (although that's likely repairable -- he's got two passes over the
buffer where one hairier pass should suffice).

> 2. The re module needs to be sped up, if not to catch up with
> Perl, then to catch up with the deprecated regex module.

The irony here is that the re engine is very often unboundedly faster than
the regex engine -- provided you're chewing over large strings. Some tests
/F ran showed that the length-independent *overhead* of invoking re is about
10x higher than for regex. Presumably the bulk of that is due to re.py,
i.e. that you get to the re engine via going thru Python layers on your way
in and out, while regex was pure C.

In any case, /F is working on a new engine (for Unicode), and I believe he
has this all well in hand.

> Depending how far people want to go with things, adding some
> language syntax to support regular expressions might be in order.
> ...
> 3. I've not yet used it, but I am told the pattern matching in
> Marc-Andre Lemburg's mxTextTools
> (http://starship.python.net/crew/lemburg/)
> is both powerful and efficient (though it certainly appears
> complex). Perhaps it deserves consideration for
> incorporation into the core Python distribution.

It's not complex, it's complicated -- and *that's* what makes it un-Pythonic
<wink>. Tony Ibbs has written a friendly wrapper around mxTextTools that
suppresses much of the non-essential complication. OTOH, if you go into
this with a regexp mindset, it will run much slower than a real regexp
package, because the bulk of the latter is devoted to doing optimization;
mxTextTools is WYSIWYG (it screams if you code to its strengths, but crawls
if you e.g. try to implement naive backtracking).

You should go to the REBOL site and look at the description of REBOL's PARSE
verb in the FAQ ... mumble, mumble ... at

http://www.rebol.com/faq.html#11550948

Here's an example pulled from that page (this is a REBOL code fragment):

digit: charset "0123456789"
expr: [term ["+" | "-"] expr | term]
term: [factor ["*" | "/"] term | factor]
factor: [primary "**" factor | primary]
primary: [value | "(" expr ")"]
value: [digit value | digit]

parse "1 + 2 ** 9" expr

There hasn't been a pattern scheme this clean, convenient or powerful since
SNOBOL4. It exploits REBOL's Forth-like (lack of!) syntax, and
Smalltalk-like penchant for passing around thunks (anonymous closures --
"[...]" in REBOL builds a lexically-scoped entity called "a block", which
can be treated as code (executed) or data (manipulated like a Python list)
at will).

Now the example doesn't show this, but you can freely mix computations into
the middle of the patterns; only *some* of the words in the blocks have
special meaning to PARSE. The fragment above is already way beyond what can
be accomplished with regexps, but that's just the start of it. Perl too is
slamming in more & more ways to get user code to interact with its regexp
engine.

So REBOL has a *very* nice approach to this; I believe it's unreasonably
clumsy to mimic in Python primarily because of forward references (note e.g.
that the block attached to "expr" above refers to "term" before the latter
has been bound -- but the stuff inside [...] is just a closure so that
doesn't matter -- it only matters that term gets bound before expr is
*executed*). I hit a similar snag years ago when trying to mimic SNOBOL4's
approach in Python.

Perl's endless abuse of regexps is making that language more absurd by the
month.

The other major approach to mixing patterns with computation is due to Icon,
another language where a regexp mindset is fatal. On a whim, I whipped up
the attached, which illustrates a bit of the Icon approach in Pythonic terms
(but without language support for generators, the *heart* of it can't really
be captured). Here's an example of how this could be used to implement (the
simplest form of) string.split:

def mysplit(str):
s = Searcher(str)
white = CharSet(" \t\n")
result = []
s.many(white) # consume initial whitespace
while s.notmany(white): # consume non-whitespace
result.append(s.get_match())
s.many(white)
return result

>>> mysplit(" \t Hey, that's\tpretty\n\n neat! ")
['Hey,', "that's", 'pretty', 'neat!']
>>>

The primary thing to note is that there's no seam between analyzing the
string and doing computation on the partial results -- "the program is the
pattern". This is what Icon does to perfection, Perl is moving toward, and
REBOL is arriving at from a different direction. It's The Future <0.9
wink>.

Without generators it's difficult to work backtracking into the Searcher
class, but, as above, in my experience the backtracking feature of regexps
is rarely *needed*! For example, at various points "split" wants to suck up
all the whitespace characters, and that's *it* -- the backtracking
possibility in the regexp \s+ is often a bug just waiting for unexpected
*context* to trigger it. A hairy regexp is pure hell; but what simpler
regexps can do don't require all that funky regexp machinery.

BTW, the mxTextTools engine could be used to get blazing implementations of
the primary Searcher methods (it excels at simple analysis). OTOH, making
lots of calls to analyze short strings is slow. The only clean solutions to
that are Perl's and Icon's (build everyting into one language so the
compiler can optimize stuff away), and REBOL's (make no distinction between
code and data, so that code can be analyzed & optimized at runtime -- and
build the entire implementation around making closures and calls
supernaturally fast).

the-less-you-use-regexps-the-less-you-miss-'em<wink>-ly y'rs - tim

class CharSet:
def __init__(self, seq):
self.seq = seq
d = {}
for ch in seq:
d[ch] = 1
self.haskey = d.has_key

def __call__(self, ch):
return self.haskey(ch)

def __add__(self, other):
if isinstance(other, CharSet):
other = other.seq
return CharSet(self.seq + other)

def _normalize_index(i, n):
assert n >= 0
if i >= 0:
return min(i, n)
elif n == 0:
return 0
# want smallest q s.t. i + q*n >= 0
# <-> q*n >= -i
# <-> q >= -i/n
# so q = ceiling(-i/n) = -floor(i/n)
return i - (i/n)*n

class Searcher:
def __init__(self, str, lo=0, hi=None):
"""Create object to search in str[lo:hi].

lo defaults to 0.
hi defaults to len(str).
len(str) is repeatedly added to negative lo or hi until
reaching a number >= 0.
If lo > hi, a uselessly empty slice will be searched.
The search cursor is initialized to lo.
"""

self.s = str
self.lo = _normalize_index(lo, len(str))
if hi is None:
self.hi = len(str)
else:
self.hi = _normalize_index(hi, len(str))
if self.lo > self.hi:
self.hi = self.lo
self.i = self.lo
self.lastmatch = None, None

def any(self, charset, consume=1):
"""Try to match single character in charset.

Return true iff match succeeded.
Advance cursor iff success and optional arg "consume" is true.
"""

i = self.i
if i < self.hi and charset(self.s[i]):
if consume:
self.__consume(i+1)
return 1
return 0

def notany(self, charset, consume=1):
"""Try to match single character not in charset.

Return true iff match succeeded.
Advance cursor iff success and optional arg "consume" is true.
"""

i = self.i
if i < self.hi and not charset(self.s[i]):
if consume:
self.__consume(i+1)
return 1
return 0

def many(self, charset, consume=1):
"""Try to match one or more characters in charset.

Return true iff match succeeded.
Advance cursor iff success and optional arg "consume" is true.
"""

i, n, s = self.i, self.hi, self.s
j = i
while j < n and charset(s[j]):
j = j+1
if i < j:
if consume:
self.__consume(j)
return 1
return 0

def notmany(self, charset, consume=1):
"""Try to match one or more characters not in charset.

Return true iff match succeeded.
Advance cursor iff success and optional arg "consume" is true.
"""

i, n, s = self.i, self.hi, self.s
j = i
while j < n and not charset(s[j]):
j = j+1
if i < j:
if consume:
self.__consume(j)
return 1
return 0

def match(self, str, consume=1):
"""Try to match string "str".

Return true iff match succeeded.
Advance cursor iff success and optional arg "consume" is true.
"""

i = self.i
j = i + len(str)
if self.s[i:j] == str:
if consume:
self.__consume(j)
return 1
return 0

def get_str(self):
"""Return subject string."""
return self.s

def get_lo(self):
"""Return low slice bound."""
return self.lo

def get_hi(self):
"""Return high slice bound."""
return self.hi

def get_pos(self):
"""Return current value of search cursor."""
return self.i

def get_match_indices(self):
"""Return slice indices of last "consumed" match."""
return self.lastmatch

def get_match(self):
"""Return last "consumed" matching substring."""
i, j = self.lastmatch
if i is None:
return ValueError("no match to return!")
return self.s[i:j]

def set_pos(self, pos, consume=1):
"""Set search cursor to new value. No return value.

If optional arg "consume" is true, the last match is set to
the slice between pos and the current cursor position.
"""

p = _normalize_index(pos, len(self.s))
if not self.lo <= p <= self.hi:
raise ValueError("pos out of bounds: " + `pos`)
if consume:
self.__consume(p)
else:
self.i = p

def move_pos(self, incr, consume=1):
"""Move the cursor by incr characters. No return value.

If the new value is outside the slice bounds, it's clipped.
If optional arg "consume" is true, the last match is set to
the slice between the old and new cursor positions.
"""

newi = self.i + incr
if newi < self.lo:
newi = self.lo
elif newi > self.hi:
newi = self.hi
if consume:
self.__consume(newi)
else:
self.i = newi

def __consume(self, newi):
i, j = self.i, newi
if i > j:
i, j = j, i
self.lastmatch = i, j
self.i = newi
Re: Better text processing support in py2k? [ In reply to ]
Tim Peters is back from his vacation:
> > While I don't want to turn Python into Perl, I would like to see
> > it do a better job of what most people probably use the language
> > for. Here is a very short list of things I think need attention:
> >
> > 1. [.*A* clear way to do memory- and time-efficient textfile
> > input]
>
> I agree, but unsure how to fix it. The best way to write this now is
>
> # f is some open file object.
> while 1:
> lines = f.readlines(BUFSIZE)
> if not lines:
> break
> for line in lines:
> process(line)
>
> and it's not something anyone figures out on their own -- or enjoys typing
> or explaining afterwards.
>
> Perl gets its line-at-a-time speed by peeking and poking C FILE structs
> directly in compiler- and platform-specific ways -- ways that vendors
> *should* have done in their own fgets implementations, but almost never do.
> I have no idea whether it works well with Perl's nascent notions of
> threading, but in the absence of that "the system" doesn't know Perl is
> cheating (i.e., as far as libc+friends are concerned, Perl *is* reading one
> line at a time -- even mixing in C-level ungetc calls works (well, sometimes
> <0.1 wink -- they don't always peek and poke enough fields>)).
>
> The Python QIO extension module is much easier to port but less compatible
> (it doesn't use stdio, so QIO-opened files don't play well with others) and
> slower (although that's likely repairable -- he's got two passes over the
> buffer where one hairier pass should suffice).

we have something called SIO which uses memory mapping
where possible, and just a more aggressive read-ahead for
other cases. on a windows box, a traditional while/readline
loop runs 3-5 times faster than before. with SRE instead of
re, a while/readline/match loop runs up to 10 times faster
than before.

note that this is without *any* changes to the Python
source code...

> > 2. The re module needs to be sped up, if not to catch up with
> > Perl, then to catch up with the deprecated regex module.
>
> The irony here is that the re engine is very often unboundedly faster than
> the regex engine -- provided you're chewing over large strings. Some tests
> /F ran showed that the length-independent *overhead* of invoking re is about
> 10x higher than for regex. Presumably the bulk of that is due to re.py,
> i.e. that you get to the re engine via going thru Python layers on your way
> in and out, while regex was pure C.

I've attached some old benchmarks. I think the current code
base is a bit faster, but you get the idea.

> In any case, /F is working on a new engine (for Unicode), and I believe he
> has this all well in hand.

with a little luck, the new module will replace both pcre
and regex...

not to mention that it's fairly easy to write your own front-
end to the matching engine -- the expression parser and the
compiler are both written in good old python.

</F>

$ python sre_bench.py
0 5 50 250 1000 5000 25000
----- ----- ----- ----- ----- ----- ----- -----
search for Python|Perl in Perl ->
sre8 0.007 0.008 0.010 0.010 0.020 0.073 0.349
sre16 0.007 0.007 0.008 0.010 0.020 0.075 0.353
re 0.097 0.097 0.101 0.103 0.118 0.175 0.480
regex 0.007 0.007 0.009 0.020 0.059 0.271 1.320

search for (Python|Perl) in Perl ->
sre8 0.007 0.007 0.007 0.010 0.020 0.074 0.344
sre16 0.007 0.007 0.008 0.010 0.020 0.074 0.347
re 0.110 0.104 0.111 0.115 0.125 0.184 0.559
regex 0.006 0.006 0.009 0.019 0.057 0.285 1.432

search for Python in Python ->
sre8 0.007 0.007 0.007 0.011 0.021 0.072 0.387
sre16 0.007 0.007 0.008 0.010 0.022 0.082 0.365
re 0.107 0.097 0.105 0.102 0.118 0.175 0.511
regex 0.009 0.008 0.010 0.018 0.036 0.139 0.708

search for .*Python in Python ->
sre8 0.008 0.007 0.008 0.011 0.021 0.079 0.379
sre16 0.008 0.008 0.008 0.011 0.022 0.075 0.402
re 0.102 0.108 0.119 0.183 0.400 1.545 7.284
regex 0.013 0.019 0.072 0.318 1.231 8.035 45.366

search for .*Python.* in Python ->
sre8 0.008 0.008 0.008 0.011 0.021 0.080 0.383
sre16 0.008 0.008 0.008 0.011 0.021 0.079 0.395
re 0.103 0.108 0.119 0.184 0.418 1.685 8.378
regex 0.013 0.020 0.073 0.326 1.264 9.961 46.511

search for .*(Python) in Python ->
sre8 0.007 0.008 0.008 0.011 0.021 0.077 0.378
sre16 0.007 0.008 0.008 0.011 0.021 0.077 0.444
re 0.108 0.107 0.134 0.240 0.637 2.765 13.395
regex 0.026 0.112 3.820 87.322 (skipped)

search for .*P.*y.*t.*h.*o.*n.* in Python ->
sre8 0.010 0.010 0.014 0.031 0.093 0.419 2.212
sre16 0.010 0.011 0.014 0.030 0.093 0.419 2.292
re 0.112 0.121 0.195 0.521 1.747 8.298 40.877
regex 0.026 0.048 0.248 1.148 4.550 24.720 ...

(searching for patterns in padded strings; sre8
is the sre engine compiled for 8-bit characters,
sre16 is the same engine compiled for 16-bit
characters)
Re: Better text processing support in py2k? [ In reply to ]
Tim Peters wrote:
>
> [Skip Montanaro, wants nicer text facilities]
> > While I don't want to turn Python into Perl, I would like to see
> > it do a better job of what most people probably use the language
> > for. Here is a very short list of things I think need attention:
> >
> > 1. [.*A* clear way to do memory- and time-efficient textfile
> > input]
>
> ...
>
> The Python QIO extension module is much easier to port but less compatible
> (it doesn't use stdio, so QIO-opened files don't play well with others) and
> slower (although that's likely repairable -- he's got two passes over the
> buffer where one hairier pass should suffice).

What is QIO ?

> > Depending how far people want to go with things, adding some
> > language syntax to support regular expressions might be in order.
> > ...
> > 3. I've not yet used it, but I am told the pattern matching in
> > Marc-Andre Lemburg's mxTextTools
> > (http://starship.python.net/crew/lemburg/)
> > is both powerful and efficient (though it certainly appears
> > complex). Perhaps it deserves consideration for
> > incorporation into the core Python distribution.
>
> It's not complex, it's complicated -- and *that's* what makes it un-Pythonic
> <wink>. Tony Ibbs has written a friendly wrapper around mxTextTools that
> suppresses much of the non-essential complication. OTOH, if you go into
> this with a regexp mindset, it will run much slower than a real regexp
> package, because the bulk of the latter is devoted to doing optimization;
> mxTextTools is WYSIWYG (it screams if you code to its strengths, but crawls
> if you e.g. try to implement naive backtracking).

All true. mxTextTools provides the tools, not the magic. But this
is also its strength: you can optimize the hell out of your particular
parsing requirement without having to think about how the RE optimizer
works.

> You should go to the REBOL site and look at the description of REBOL's PARSE
> verb in the FAQ ... mumble, mumble ... at
>
> http://www.rebol.com/faq.html#11550948
>
> Here's an example pulled from that page (this is a REBOL code fragment):
>
> digit: charset "0123456789"
> expr: [term ["+" | "-"] expr | term]
> term: [factor ["*" | "/"] term | factor]
> factor: [primary "**" factor | primary]
> primary: [value | "(" expr ")"]
> value: [digit value | digit]
>
> parse "1 + 2 ** 9" expr
>
> There hasn't been a pattern scheme this clean, convenient or powerful since
> SNOBOL4. It exploits REBOL's Forth-like (lack of!) syntax, and
> Smalltalk-like penchant for passing around thunks (anonymous closures --
> "[...]" in REBOL builds a lexically-scoped entity called "a block", which
> can be treated as code (executed) or data (manipulated like a Python list)
> at will).

Looks nice indeed, but how does executable code fit into
that definition ? (mxTextTools allows you to write your own
parsing elements in Python, BTW; it should be possible to
use those mechanisms to achieve a similar intergration.)

> ...
>
> BTW, the mxTextTools engine could be used to get blazing implementations of
> the primary Searcher methods (it excels at simple analysis). OTOH, making
> lots of calls to analyze short strings is slow.

That's why mxTextTools converts these search idioms into byte codes
which it executes at C level. Some future version will even "precompile"
the tuple input and then omit the type checks during the search...
that should give another noticeable speedup. Note that recursion
etc. can be done at C level too -- Python function calls are not
needed.

> The only clean solutions to
> that are Perl's and Icon's (build everyting into one language so the
> compiler can optimize stuff away), and REBOL's (make no distinction between
> code and data, so that code can be analyzed & optimized at runtime -- and
> build the entire implementation around making closures and calls
> supernaturally fast).

Just for kicks, here is the mysplit() function using mxTextTools:

from mx.TextTools import *

table = (
# Match all whitespace
(None,AllInSet,whitespace_set,+1),
# Match and tag all non-whitespace
('text',AllInSet + AppendMatch,nonwhitespace_set,+1),
# Loop until EOF
(None,EOF,Here,-2),
)

def mysplit(text):

return tag(text,table)[1]

The timings:
mysplit: 5.84 sec.
string.split: 3.62 sec.

Note that you can customize the above to split text at any
character set you like, not just whitespace... without
compiling or writing C code. The function mx.TextTools.setsplit()
provides this functionality as pure C function.

--
Marc-Andre Lemburg
______________________________________________________________________
Y2000: Get ready to party !
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
RE: Better text processing support in py2k? [ In reply to ]
[M.-A. Lemburg]
> What is QIO ?

See DejaNews (I don't save URLs). "Quick" line-oriented text input adapted
from INN. Someone rewrote that as a Python extension module.

>> http://www.rebol.com/faq.html#11550948

> Looks nice indeed, but how does executable code fit into
> that definition ?

See the URL above I didn't save <wink>. PARSE's "pattern" argument is a
block. Blocks can be (& often are) nested. Whether any given block is code
or data is all the same to REBOL, so passing nested code blocks in PARSE's
pattern argument is easy. Because blocks are lexically scoped, assignments
(etc) inside a block are (well, can be) visible to its context; etc. It's a
very Lispish approach. REBOL is essentially Scheme under the covers, but
with syntax much more like Forth's (whitespace-separated strings of
arbitrary non-whitespace characters, with few pre-assigned meanings or
restrictions -- in fact, it's impossible for a compiler to determine where a
REBOL function call begins or ends! can't be known until runtime).

> (mxTextTools allows you to write your own parsing elements
> in Python, BTW; it should be possible to use those mechanisms
> to achieve a similar intergration.)

It can't capture the flavor -- although I don't know that it needs to
<wink>. There's no distinction between "the pattern language" and "the
computational language" in REBOL or Icon, and it's hard to explain what a
maddening distinction that can be once you've lived without it. mxTextTools
embedding would feel more like Icon, where the matching engine is fully
exposed to the programmer (REBOL hides it, allowing only "approved"
interactions).

>> OTOH, making lots of calls to analyze short strings is slow.

> That's why mxTextTools converts these search idioms into byte
> codes which it executes at C level. Some future version will
> even "precompile" the tuple input and then omit the type checks
> during the search...that should give another noticeable speedup.
> Note that recursion etc. can be done at C level too -- Python
> function calls are not needed.

That's also the curse of having distinct languages; e.g., Python already had
recursion, but you needed to reimplement it in a different way with
different syntax and different rules in your pattern language. In Icon etc,
there's no difference between a recursive pattern and a recursive function,
except in *what* it computes. The machinery is all the same, and both more
powerful and easier to learn because of that.

> ...
> Just for kicks, here is the mysplit() function using mxTextTools:
>
> from mx.TextTools import *
>
> table = (
> # Match all whitespace
> (None,AllInSet,whitespace_set,+1),
> # Match and tag all non-whitespace
> ('text',AllInSet + AppendMatch,nonwhitespace_set,+1),
> # Loop until EOF
> (None,EOF,Here,-2),
> )
>
> def mysplit(text):
>
> return tag(text,table)[1]
>
> The timings:
> mysplit: 5.84 sec.
> string.split: 3.62 sec.
>
> Note that you can customize the above to split text at any
> character set you like, not just whitespace... without
> compiling or writing C code.

That's equally true of the example I posted <wink>. Now what if I wanted to
stop splitting right after I find a keyword, recognized as such because it's
a key in some passed-in dictionary? In my example, I make an obvious local
code change, from

while s.notmany(white): # consume non-whitespace
result.append(s.get_match())
s.many(white)

to

while s.notmany(white): # consume non-whitespace
word = s.get_match()
result.append(word)
if dictionary.has_key(word):
break
s.many(white)

What does it do to your example? Or what if the target string isn't "a
string" (the code I posted only assumes the "str" object responds to
indexing and slicing -- any buffer object is fine -- so my example doesn't
change at all)? Or what if you need to pass the tokens on as they're found,
pipeline style? Etc. This is why I do complex string processing in Icon
<0.9 wink>.

OTOH, at what it does well, mxTextTools runs quicker than Icon. Its biggest
problem has always been that e.g. nobody knows what the hell

(None,EOF,Here,-2),

*means* at first glance -- or third <wink>.

an-extreme-on-the-transparency-vs-speed-curve-ly y'rs - tim
Re: Better text processing support in py2k? [ In reply to ]
Tim Peters wrote:
>
> [M.-A. Lemburg]
> > What is QIO ?
>
> See DejaNews (I don't save URLs). "Quick" line-oriented text input adapted
> from INN. Someone rewrote that as a Python extension module.

Ok, thanks.

> >> http://www.rebol.com/faq.html#11550948
>
> > Looks nice indeed, but how does executable code fit into
> > that definition ?
>
> See the URL above I didn't save <wink>. PARSE's "pattern" argument is a
> block. Blocks can be (& often are) nested. Whether any given block is code
> or data is all the same to REBOL, so passing nested code blocks in PARSE's
> pattern argument is easy. Because blocks are lexically scoped, assignments
> (etc) inside a block are (well, can be) visible to its context; etc. It's a
> very Lispish approach. REBOL is essentially Scheme under the covers, but
> with syntax much more like Forth's (whitespace-separated strings of
> arbitrary non-whitespace characters, with few pre-assigned meanings or
> restrictions -- in fact, it's impossible for a compiler to determine where a
> REBOL function call begins or ends! can't be known until runtime).

If I understand the concept correctly, I think Python could do
pretty much the same thing. The bummer is of course the need
for new keywords and byte codes (although these could be
split out into a separate text scanning engine). Using Python
function calls would slow down things to an extent that would
render the added functionality useless, well IMHO anyways ;-)

> > (mxTextTools allows you to write your own parsing elements
> > in Python, BTW; it should be possible to use those mechanisms
> > to achieve a similar intergration.)
>
> It can't capture the flavor -- although I don't know that it needs to
> <wink>. There's no distinction between "the pattern language" and "the
> computational language" in REBOL or Icon, and it's hard to explain what a
> maddening distinction that can be once you've lived without it. mxTextTools
> embedding would feel more like Icon, where the matching engine is fully
> exposed to the programmer (REBOL hides it, allowing only "approved"
> interactions).

Of course its hard for a Turing Machine to capture the flavor
of any high level language :-) When you're programming
the mxTextTools Tagging Engine directly you feel like writing
assembler... but things are moving in the right direction:
Tony Ibbs has a nice meta-language and M.C. Fletcher his
SimpleParse to cover up these insufficiencies.

> >> OTOH, making lots of calls to analyze short strings is slow.
>
> > That's why mxTextTools converts these search idioms into byte
> > codes which it executes at C level. Some future version will
> > even "precompile" the tuple input and then omit the type checks
> > during the search...that should give another noticeable speedup.
> > Note that recursion etc. can be done at C level too -- Python
> > function calls are not needed.
>
> That's also the curse of having distinct languages; e.g., Python already had
> recursion, but you needed to reimplement it in a different way with
> different syntax and different rules in your pattern language. In Icon etc,
> there's no difference between a recursive pattern and a recursive function,
> except in *what* it computes. The machinery is all the same, and both more
> powerful and easier to learn because of that.

Agreed.

> > ...
> > Just for kicks, here is the mysplit() function using mxTextTools:
> >
> > from mx.TextTools import *
> >
> > table = (
> > # Match all whitespace
> > (None,AllInSet,whitespace_set,+1),
> > # Match and tag all non-whitespace
> > ('text',AllInSet + AppendMatch,nonwhitespace_set,+1),
> > # Loop until EOF
> > (None,EOF,Here,-2),
> > )
> >
> > def mysplit(text):
> >
> > return tag(text,table)[1]
> >
> > The timings:
> > mysplit: 5.84 sec.
> > string.split: 3.62 sec.
> >
> > Note that you can customize the above to split text at any
> > character set you like, not just whitespace... without
> > compiling or writing C code.
>
> That's equally true of the example I posted <wink>. Now what if I wanted to
> stop splitting right after I find a keyword, recognized as such because it's
> a key in some passed-in dictionary? In my example, I make an obvious local
> code change, from
>
> while s.notmany(white): # consume non-whitespace
> result.append(s.get_match())
> s.many(white)
>
> to
>
> while s.notmany(white): # consume non-whitespace
> word = s.get_match()
> result.append(word)
> if dictionary.has_key(word):
> break
> s.many(white)
>
> What does it do to your example?

You'd replace the 'text' tagobj with a callable object and
write AllInSet + CallTag as command. The Tagging Engine will
then call the object with arguments (taglist,text,l,r,subtags)
and let it decide what to do.

In your example it would check the dictionary and raise an
exception in case a keyword is found to stop any further
scanning. If it's not a keyword, it would simply append
the found string to the taglist and return None.

Here's the code:

from mx.TextTools import *

import exceptions

stoplist = {'abc':1, 'def':1}

class KeywordFound(exceptions.StandardError):
def __init__(self, taglist):
self.taglist = taglist

def callable(taglist,text,l,r,subtags):

taglist.append(text[l:r])
if stoplist.has_key(text[l:r]):
raise KeywordFound(taglist)

table = (
# Match all whitespace
(None,AllInSet,whitespace_set,+1),
# Match and tag all non-whitespace
(callable,AllInSet + CallTag,nonwhitespace_set,+1),
# Loop until EOF
(None,EOF,Here,-2),
)

def mysplitex(text):

try:
return tag(text,table)[1]
except KeywordFound,data:
return data.taglist

> Or what if the target string isn't "a
> string" (the code I posted only assumes the "str" object responds to
> indexing and slicing -- any buffer object is fine -- so my example doesn't
> change at all)?

The current version only handles string objects, but I am
already beginning to convert all the APIs in mxTextTools to
"s#" or "t#" style (can't decide which to use... "s#" is great
for processing raw data, while "t#" more closely refers to
text processing).

> Or what if you need to pass the tokens on as they're found,
> pipeline style? Etc. This is why I do complex string processing in Icon
> <0.9 wink>.

You can have all that extra magic via callable tag objects
or callable matching functions. It's not exactly nice to
write, but I'm sure that a meta-language could do the
conversions for you.

> OTOH, at what it does well, mxTextTools runs quicker than Icon. Its biggest
> problem has always been that e.g. nobody knows what the hell
>
> (None,EOF,Here,-2),
>
> *means* at first glance -- or third <wink>.

The structure of those tag tables is very simple:

(tagobject, command, argument[., jump offset in case of failure
[, jump offset in case of success]])

Please remember that this is byte code, not some higher level
abstraction. The design is very much inverted from what you'd
usually do: design a nice language and then try to find suitable
set of byte codes to make it work as intended.

Anyway, I'll keep focussing on the speed aspect of mxTextTools;
others can focus on abstractions, so that eventually everybody
will be happy :-)

Happy New Year,
--
Marc-Andre Lemburg
______________________________________________________________________
Y2000: Get ready to party !
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
RE: Better text processing support in py2k? [ In reply to ]
[.Fredrik Lundh, whose very nice eMatter book is on sale until
the end of the 20th century (as real people think of it),
although the eMatter distribution scheme has lots of problems
[.just an editorial note from a bot who has to-- for unknown
reasons Fatbrain "is working on" --delete the Fatbrain
registry tree and reregister the book almost every time he
tries to open it <wink>
]
]

> we have something called SIO which uses memory mapping
> where possible, and just a more aggressive read-ahead for
> other cases. on a windows box, a traditional while/readline
> loop runs 3-5 times faster than before. with SRE instead of
> re, a while/readline/match loop runs up to 10 times faster
> than before.
>
> note that this is without *any* changes to the Python
> source code...

If so, there's potential for significantly more speed. Python does its
line-at-a-time input with a character-at-a-time macro-in-a-loop, the same
way naive vendors (read "almost all vendors") implement fgets. It's
replacing that inner loop with direct peeking into the FILE buffer that gets
Perl its dramatic speed -- despite that Perl has fancier input functionality
(the oft-requested automagical "input record separator"). So it sounds like
the Perl trick is orthogonal to SIO's tricks; Perl isn't doing mmaps or
read-aheads or anything else fancy under the covers -- it only optimizes the
inner loop!

> ...
> with a little luck, the new module will replace both pcre
> and regex...

If something more tangible than luck would help to make this come true, feel
free to mention it <wink>.

> not to mention that it's fairly easy to write your own front-
> end to the matching engine -- the expression parser and the
> compiler are both written in good old python.

Ah, good news / bad news. Perl refugees aren't accustomed to "precompiling"
regexp objects, so write code that will cause regexps to get recompiled over
& over. Even if you cache the results under the covers, the overhead of the
Python call to the regexp compiler will likely take as long as the engine
takes to search.

Personally, in such cases, I think they should learn how to use the language
<0.5 wink>.
RE: Better text processing support in py2k? [ In reply to ]
>> This is why I do complex string processing in Icon <0.9 wink>.

[MAL]
> You can have all that extra magic via callable tag objects
> or callable matching functions. It's not exactly nice to
> write, but I'm sure that a meta-language could do the
> conversions for you.

That wasn't my point: I do it in Icon because it *is* "exactly nice to
write", and doesn't require any yet-another meta-language. It's all
straightforward, in a way that separate schemes pasted together can never be
(simply because they *are* "separate schemes pasted together" <wink>).

The point of my Python examples wasn't that they could do something
mxTextTools can't do, but that they were *Python* examples: every variation
I mentioned (or that you're likely to think of) was easy to handle for any
Python programmer because the "control flow" and "data type" etc aspects
could be handled exactly the way they always are in *non* pattern-matching
Python code too, rather than recoded in pattern-scheme-specific different
ways (e.g., where I had a vanailla "if/break", you set up a special
exception to tickle the matching engine).

I'm not attacking mxTextTools, so don't feel compelled to defend it --
people using regexps in those examples are dead in the water. mxTextTools
is very good at what it does; if we have a real disagreement, it's probably
that I'm less optimistic about the prospects for higher-level wrappers
(e.g., MikeF's SimpleParse is much slower than "a real" BNF parsing system
(ARBNFPS), in part because he isn't doing all the optimizations ARBNFPS
does, but also in part because ARBNFPS uses an underlying engine more
optimized to its specific task than mxTextTool's more-general engine *can*
be). So I don't see mxTextTools as being the answer to everything -- and if
you hadn't written it, you would agree with that on first glance <wink>.

> Anyway, I'll keep focussing on the speed aspect of mxTextTools;
> others can focus on abstractions, so that eventually everybody
> will be happy :-)

You and I will be, anyway <wink>.
Re: Better text processing support in py2k? [ In reply to ]
Tim Peters wrote:
>
> >> This is why I do complex string processing in Icon <0.9 wink>.
>
> [MAL]
> > You can have all that extra magic via callable tag objects
> > or callable matching functions. It's not exactly nice to
> > write, but I'm sure that a meta-language could do the
> > conversions for you.
>
> That wasn't my point: I do it in Icon because it *is* "exactly nice to
> write", and doesn't require any yet-another meta-language. It's all
> straightforward, in a way that separate schemes pasted together can never be
> (simply because they *are* "separate schemes pasted together" <wink>).
>
> The point of my Python examples wasn't that they could do something
> mxTextTools can't do, but that they were *Python* examples: every variation
> I mentioned (or that you're likely to think of) was easy to handle for any
> Python programmer because the "control flow" and "data type" etc aspects
> could be handled exactly the way they always are in *non* pattern-matching
> Python code too, rather than recoded in pattern-scheme-specific different
> ways (e.g., where I had a vanailla "if/break", you set up a special
> exception to tickle the matching engine).
>
> I'm not attacking mxTextTools, so don't feel compelled to defend it --

Oh, I wasn't defending it -- I know that it is cryptic and sometimes
a pain to use. But given that you don't have to invoke a C compiler
to get a raw speed I find it a rather useful alternative to code
fast utility functions which would otherwise have to be written
in C.

The other reason it exists is simply because I don't like the
recursive style of regexps too much. mxTextTools is simple
and straightforward. Backtracking is still possible, but not
recommended.

> people using regexps in those examples are dead in the water. mxTextTools
> is very good at what it does; if we have a real disagreement, it's probably
> that I'm less optimistic about the prospects for higher-level wrappers
> (e.g., MikeF's SimpleParse is much slower than "a real" BNF parsing system
> (ARBNFPS), in part because he isn't doing all the optimizations ARBNFPS
> does, but also in part because ARBNFPS uses an underlying engine more
> optimized to its specific task than mxTextTool's more-general engine *can*
> be). So I don't see mxTextTools as being the answer to everything -- and if
> you hadn't written it, you would agree with that on first glance <wink>.

Oh, I'm sure it *is* the answer to all out problems ;-) ...

def main(*dummy):
...

from mx.TextTools import *
tag("",((main, Skip + CallTag, 0),))

> > Anyway, I'll keep focussing on the speed aspect of mxTextTools;
> > others can focus on abstractions, so that eventually everybody
> > will be happy :-)
>
> You and I will be, anyway <wink>.

Happy New Year :-)
--
Marc-Andre Lemburg
______________________________________________________________________
Y2000: Happy New Century !
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/