Mailing List Archive

Stackless & String-processing
In article <378A40AD.B4FDB79F@appliedbiometrics.com>,
Christian Tismer <tismer@appliedbiometrics.com> wrote:
>
> Stackless Python 0.3
>
>The possibilities are for instance:
>
>Continuations, with which we will build Coroutines and Generators
>
>Coroutines are able to run at the speed of a single C function call,
>which makes them a considerable alternative in certain algorithms.
>This is no longer a speculation since I have working coroutine
>prototypes, written with continuations.

I've been looking at Icon, and it occurs to me that if coroutines and
generators were available at the Python level, it might yield a way of
doing string processing that is more "Pythonic" than regexps.

Regexps are nice because when you have a pattern that they can
directly represent, then you can simply specify a pattern and then you
don't have to worry about the tedious bookkeeping of looping over the
string and keeping track of the state variable.

However, when you need to match a pattern that a regexp can't match,
then suddenly you need to break the string into tokens with regexps
and then loop over the pieces and keep track of a bunch of state
variables that don't necessarily correspond to the pieces you are
actually interested in.

This is unpleasant, because a) clumsy bookkeeping is bad, and b)
there's two different sorts of syntax needed to do basically
similar tasks. If we could compose generators just like functions,
then the bookkeeping can be abstracted away and the same approach
will work for arbitrarily complicated parsing tasks.

So I think it would be nice if these lovely toys were available at
the Python level. Is this a possibility?

(I'd love to be able to define coroutines in Python like so:

def evens(z):
for elt in z:
if z % 2 == 0:
suspend(z)

It would probably require some rethinking of Python's iteration
protocol though.)


Neel
Stackless & String-processing [ In reply to ]
[Neel Krishnaswami]
> ...
> I've been looking at Icon, and it occurs to me that if coroutines and
> generators were available at the Python level, it might yield a way of
> doing string processing that is more "Pythonic" than regexps.

Yes, and you can find much more about this in the very early days of the
String SIG archives.

> Regexps are nice because when you have a pattern that they can
> directly represent, then you can simply specify a pattern and then you
> don't have to worry about the tedious bookkeeping of looping over the
> string and keeping track of the state variable.
>
> However, when you need to match a pattern that a regexp can't match,
> then suddenly you need to break the string into tokens with regexps
> and then loop over the pieces and keep track of a bunch of state
> variables that don't necessarily correspond to the pieces you are
> actually interested in.
>
> This is unpleasant, because a) clumsy bookkeeping is bad, and b)
> there's two different sorts of syntax needed to do basically
> similar tasks.

The latter is one of the arguments Griswold gave in the paper that
introduced Icon, contrasting Icon's uniform approach to SNOBOL4's distinct
pattern and procedural sublanguages.

I don't think he'd make the same argument today! Icon is an idiosyncratic
delight, but most string pattern matching was in fact easier (to write,
modify, or read) in SNOBOL4. Once you start using Icon for complex pattern
matching, you'll soon discover that it's very hard to go back to old code
and disentangle "the pattern" from "the processing" -- *because* everything
looks the same, and it's all intermixed. The distinct sublanguages in
SNOBOL4 enforced a clean separation; and that was more often an aid than a
burden.

Have noted before that writing string matching code in Icon feels a lot like
writing in an assembly language for SNOBOL4. Of course the latter didn't
have Icon's power in other areas (e.g. it's at least possible to program
structural pattern-matchers in Icon, using generators to match e.g. tree
patterns; SNOBOL4's patterns started & ended with strings).

> If we could compose generators just like functions, then the bookkeeping
> can be abstracted away and the same approach will work for arbitrarily
> complicated parsing tasks.

The real advantage of regexps is speed, & that's probably while they'll
always be much more popular. SNOBOL4 and Icon didn't define "bal" builtins
because you couldn't code that pattern yourself <wink>. bal is beyond a
regexp's abilities, but it's still a simple kind of pattern, and just runs
too darned slow if implemented via the recursive definition as run with the
general backtracking machinery.

> So I think it would be nice if these lovely toys were available at
> the Python level. Is this a possibility?

I've been bugging Guido about generators since '91, but for the billion and
one uses other than string matching. Anything is possible -- except that
I'll ever stop bugging him <wink>.

> (I'd love to be able to define coroutines in Python like so:
>
> def evens(z):
> for elt in z:
> if z % 2 == 0:
> suspend(z)

How do you intend to resume it?

> It would probably require some rethinking of Python's iteration
> protocol though.)

That's not a hangup; Guido already knows how to do it cleanly. Like any Big
New Feature there are real questions about costs vs benefits, and so far
this stuff has been widely viewed as "too esoteric". I think it's natural
as breathing, to the point that a *non*-resumable function strikes me as
never inhaling <wink>.

For now, there's a working implementation of a generator protocol (somewhat
more flexible than Icon's) in the source distribution, under
Demo/threads/Generator.py. I also posted a general (thread-based) coroutine
implementation a bit over 5 years ago. Building these on Christian's
stackless Python instead should run much faster.

BTW, generators are much easier to implement than coroutines -- the former
don't require threads or stacklessness or continuations under the covers,
just some relatively straightforward changes to Python's current
implementation (Steven Majewski noted this 5 years ago). This is why
generators work in all ports of Icon, but coroutines are an optional feature
that's supported only if a platform guru has taken the time to write the
platform-dependent context-switching assembly code Icon coexps require.

degeneratoredly y'rs - tim
Stackless & String-processing [ In reply to ]
[Neel Krishnaswami]
: > If we could compose generators just like functions, then the bookkeeping
: > can be abstracted away and the same approach will work for arbitrarily
: > complicated parsing tasks.
Tim Peters (tim_one@email.msn.com) wrote:
: The latter is one of the arguments Griswold gave in the paper that
: introduced Icon, contrasting Icon's uniform approach to SNOBOL4's distinct
: pattern and procedural sublanguages.
:
: I don't think he'd make the same argument today! Icon is an idiosyncratic
: delight, but most string pattern matching was in fact easier (to write,
: modify, or read) in SNOBOL4. Once you start using Icon for complex pattern
: matching, you'll soon discover that it's very hard to go back to old code
: and disentangle "the pattern" from "the processing" -- *because* everything
: looks the same, and it's all intermixed. The distinct sublanguages in
: SNOBOL4 enforced a clean separation; and that was more often an aid than a
: burden.

As far as I understand this thread I think Neel is right, and Tim and
Griswold are wrong. I have been recently doing a lot of programming with
parser combinators in Clean and Haskell. They are something like what
Neel is talking about (as the name "parser combinators" suggests you
write small "string parsers" as functions, and then have "parser
combinators" that stitch these together). Note that in this parser
combinator approach there is no distinction between the code for "the
pattern" and the code for "the processing" -- they are all written in
the same language (Clean/Haskell in my case). At first I found the
idea of having the two types of code (the pattern and the processing)
in the same language (rather than in 2 distinct languages as in the
traditional way) a bit difficult. But I perservered and I am glad I
did. I found that once you reach a certain level of familiarity with
combinators you can write new pattern and processing code very
quickly and very reliably. The trick is to get the combinators right.

graham
--
This ain't no technological breakdown
Oh no, this is the road to hell
Stackless & String-processing [ In reply to ]
In article <000601bece8f$b549a420$51a22299@tim>,
Tim Peters <tim_one@email.msn.com> wrote:
>[Neel Krishnaswami]
>
>> (I'd love to be able to define coroutines in Python like so:
>>
>> def evens(z):
>> for elt in z:
>> if z % 2 == 0:
>> suspend(elt)
>
>
>> It would probably require some rethinking of Python's iteration
>> protocol though.)
>
>How do you intend to resume it?

I left this unspecified because I don't know the answer. :) I figured
that it could be rolled up into the iteration protocol and I could
use it without having to completely understand how it all works.

[The iteration protocol]

>That's not a hangup; Guido already knows how to do it cleanly.

Good -- now I don't have to worry at all.

>BTW, generators are much easier to implement than coroutines -- the former
>don't require threads or stacklessness or continuations under the covers,
>just some relatively straightforward changes to Python's current
>implementation (Steven Majewski noted this 5 years ago).

What's the difference, exactly? AFAICT you need to save the execution
state when suspending both coroutines and generators, but I might be
missing something obvious....


Neel
Stackless & String-processing [ In reply to ]
In article <7mkrrt$a81$1@cronkite.cc.uga.edu>,
Graham Matthews <graham@sloth.math.uga.edu> wrote:
>
>[Parser Combinators]
>
>They are something like what Neel is talking about (as the name
>"parser combinators" suggests you write small "string parsers" as
>functions, and then have "parser combinators" that stitch these
>together). Note that in this parser combinator approach there is no
>distinction between the code for "the pattern" and the code for "the
>processing" -- they are all written in the same language
>(Clean/Haskell in my case).

Thanks for the tip -- it led me to do some digging on the web and I
found a nice description of this in a paper called "Monadic Parser
Combinators" by Graham Hutton and Erik Meijer. Its URL is:

http://www.cs.nott.ac.uk/Department/Staff/gmh/monparsing.ps

>At first I found the idea of having the two types of code (the
>pattern and the processing) in the same language (rather than in 2
>distinct languages as in the traditional way) a bit difficult. But I
>perservered and I am glad I did. I found that once you reach a
>certain level of familiarity with combinators you can write new
>pattern and processing code very quickly and very reliably. The
>trick is to get the combinators right.

I think that this weekend I'm going to try and put together a basic
version of this parsing system in Python based on the ideas in this
paper -- basically just write a simple AST class and the combinators
they describe. Then I can add proper support for backtracking and lazy
evaluation and all the stuff with serious hack value when Christian
Tismer makes continuations available to us. :)

This really seems like a clean and straightforward way of doing
things. My experience with recursive descent parsers has been pretty
bad, but now I suspect that this is because I wasn't organizing my
work in a systematic enough fashion.


Neel
Stackless & String-processing [ In reply to ]
[Tim]
> BTW, generators are much easier to implement than coroutines -- ...

[Neel]
> What's the difference, exactly? AFAICT you need to save the execution
> state when suspending both coroutines and generators, but I might be
> missing something obvious....

Just that a generator (in the Icon sense) is a semi-coroutine: it always
suspends directly to its caller, at the point it was called. A world of
difference follows from that:

1) From the caller's point of view, it's an ordinary call. So on *that* end
of it, generators are no harder than vanilla procedure calls.

2) From the generator's point of view, "the execution state" is a single
frame (its own). A coroutine can transfer to any other at any time, with
intervening calls piling up an arbitrary number of frames in the state.

3) From a recursive interpreter's point of view, the implementation language
(C, in Python's case) can't get any of its own frames stuck between caller
and callee during suspension of a generator (the generator *returns* to its
caller; it doesn't *resume* it; this asymmetry is the heart of it).

It's #3 (dealing with an intertwined C stack) that requires
platform-specific hackery to make coexps work in Icon, while the lack of C
stack pollution is why generators can work without platform-specific cruft.

Continuations and threads are harder still to implement (efficiently) -- the
less restrictive the rules, the less the implementation can exploit to make
its life easier.

a-phenomenon-you-may-observe-again-some-day<wink>-ly y'rs - tim
Stackless & String-processing [ In reply to ]
In article <7mlvju$6v3$1@brick.cswv.com>, Neel Krishnaswami
<neelk@brick.cswv.com> wrote:

> In article <7mkrrt$a81$1@cronkite.cc.uga.edu>,
> Graham Matthews <graham@sloth.math.uga.edu> wrote:
> >
> >[Parser Combinators]
> >
> >They are something like what Neel is talking about (as the name
> >"parser combinators" suggests you write small "string parsers" as
> >functions, and then have "parser combinators" that stitch these
> >together). Note that in this parser combinator approach there is no
> >distinction between the code for "the pattern" and the code for "the
> >processing" -- they are all written in the same language
> >(Clean/Haskell in my case).
<snip>
> This really seems like a clean and straightforward way of doing
> things. My experience with recursive descent parsers has been pretty
> bad, but now I suspect that this is because I wasn't organizing my
> work in a systematic enough fashion.
Clean and straightforward it may be, but not efficient, and subject to
termination problems on left-recursive grammars. Techniques for
compiling regular expressions and deterministic context-free grammars
into efficient pattern-matching/parsing automata do not extend readily
to programs in a general purpose declarative (functional or logic)
programming language. Certain clever things can be done (in particular
see "The Functional Treatment of Parsing," by Rene Leermakers,
Kluwer Academic Publishers, 1993), but, as far as I know, nothing beats
the automata-theoretic optimization techniques for pattern matching and
parsing, of which the typical modern compiler for regular expressions
is just the best-known instance.

-- F
Stackless & String-processing [ In reply to ]
Neel Krishnaswami wrote:
>
> What's the difference, exactly? AFAICT you need to save the execution
> state when suspending both coroutines and generators, but I might be
> missing something obvious....

The crucial difference is that a generator is never resumed
after its caller has returned. This means that the generator's
state can be pushed onto the same stack as the caller's.

A coroutine, on the other hand, can outlive the context in
which it was created, and therefore needs a stack all of
its own.

Another way to think about it is that a generator call is
equivalent to an ordinary call where one of the parameters
is a procedure. For example, where in "generator python"
you might write

def even_numbers_up_to(n):
for i in range(n):
if i % 2 == 0:
yield(i)

for e in even_numbers_up_to(42):
print e, "is an even number!"

you could do the same thing in ordinary python as

def for_each_even_number_up_to(n, do_it):
for i in range(n):
if i % 2 == 0:
do_it(i)

def print_it(e):
print e, "is an even number!"

for_each_even_number_up_to(42, print_it)

which clearly can be executed quite happily using
a single stack.

If Python had something akin to Smalltalk code blocks,
generators wouldn't be needed. It would be nifty to
be able to write something like

for_each_even_number_up_to(42) -> (e):
print e, "is an even number!"

Greg
Stackless & String-processing [ In reply to ]
Greg Ewing wrote:
>
> Neel Krishnaswami wrote:
> >
> > What's the difference, exactly? AFAICT you need to save the execution
> > state when suspending both coroutines and generators, but I might be
> > missing something obvious....
>
> The crucial difference is that a generator is never resumed
> after its caller has returned. This means that the generator's
> state can be pushed onto the same stack as the caller's.

Fine.

> A coroutine, on the other hand, can outlive the context in
> which it was created, and therefore needs a stack all of
> its own.

Smell of danger...

> Another way to think about it is that a generator call is
> equivalent to an ordinary call where one of the parameters
> is a procedure. For example, where in "generator python"
> you might write

...

> If Python had something akin to Smalltalk code blocks,
> generators wouldn't be needed. It would be nifty to
> be able to write something like
>
> for_each_even_number_up_to(42) -> (e):
> print e, "is an even number!"

Traceback (innermost last):
File "<inbox:Re: Stackless & String-processing>", line 22, in claim
SemanticMisconceptionError:
things should be made as simple as possible, but not any simpler
>>> #:-)

>>> # sorry did you really mean that?
>>> # Where is your generator's state?
>>> # how would you do backtracking?

hoping-that-I'm-misconcepted-and-not-you-ly y'rs - chris

--
Christian Tismer :^) <mailto:tismer@appliedbiometrics.com>
Applied Biometrics GmbH : Have a break! Take a ride on Python's
Kaiserin-Augusta-Allee 101 : *Starship* http://starship.python.net
10553 Berlin : PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF
we're tired of banana software - shipped green, ripens at home
Stackless & String-processing [ In reply to ]
Graham Matthews <graham@sloth.math.uga.edu> wrote:
: >[Parser Combinators]
Neel Krishnaswami (neelk@brick.cswv.com) wrote:
: Thanks for the tip -- it led me to do some digging on the web and I
: found a nice description of this in a paper called "Monadic Parser
: Combinators" by Graham Hutton and Erik Meijer. Its URL is:
:
: http://www.cs.nott.ac.uk/Department/Staff/gmh/monparsing.ps

Yes this is a nice place to start. Another nice place to look is in
the Clean documentation. Somewhere in their docs (I think in the old
book) there is a Clean implementation of combinators (without all the
"do" notation).

The trickiest part for me was to get the error handling to work well,
since Clean/Haskell don't have exceptions. This should be "easier"
in Python, although you are welcome to see my code if you want. Also
feel free to email me questions.

Neel Krishnaswami (neelk@brick.cswv.com) wrote:
: This really seems like a clean and straightforward way of doing
: things. My experience with recursive descent parsers has been pretty
: bad, but now I suspect that this is because I wasn't organizing my
: work in a systematic enough fashion.

I had similar experiences with rec.descent parsers. By the way this
technique is not limited to parsing. I use essentially the same
code to implement a type checker too. A parser has input a string
or a file, and output an AST. The type checker has the same structure
as a parser but it's input (the thing it parses) is an AST and it's
output an AST annotated with types. Combinators are more general
than just parsing and patterns. They are a form of sequencing
construct.

graham

--
This ain't no technological breakdown
Oh no, this is the road to hell
Stackless & String-processing [ In reply to ]
Fernando Pereira (pereira@research.att.com) wrote:
: but, as far as I know, nothing beats
: the automata-theoretic optimization techniques for pattern matching and
: parsing, of which the typical modern compiler for regular expressions
: is just the best-known instance.

In what sense are you using the words "nothing beats". I am sure that
the automata-theoretic approach produces the fastest pattern matchers
and parsers. So on this sense "nothing beats them". But the class of
languages recognisable by automaton doesn't cover all the languages of
interest (in fact my guess is it covers very few ... context ...). So
in this sense lots of things "beat" the automata approach. And then
there are software engineering issues ...

graham

--
This ain't no technological breakdown
Oh no, this is the road to hell
Stackless & String-processing [ In reply to ]
Neel Krishnaswami (neelk@brick.cswv.com) wrote:
: I think that this weekend I'm going to try and put together a basic
: version of this parsing system in Python based on the ideas in this
: paper -- basically just write a simple AST class and the combinators
: they describe. Then I can add proper support for backtracking and lazy
: evaluation and all the stuff with serious hack value when Christian
: Tismer makes continuations available to us. :)

It seems to me that you might also be able to make the regular expression
module available through this interface. All you have to do is wrap the
regular expression up in a function/object that can be combined like the
other parsers. Indeed you will need "primitive parsers" for things like
numbers, and these are really just a special case of a "primitive"
regular expression parser.

graham

--
This ain't no technological breakdown
Oh no, this is the road to hell
Stackless & String-processing [ In reply to ]
[Neel Krishnaswami]
> ...
> I found a nice description of this in a paper called "Monadic Parser
> Combinators" by Graham Hutton and Erik Meijer. Its URL is:
> http://www.cs.nott.ac.uk/Department/Staff/gmh/monparsing.ps

[Graham Matthews]
> Yes this is a nice place to start. Another nice place to look is in
> the Clean documentation. Somewhere in their docs (I think in the old
> book) there is a Clean implementation of combinators (without all the
> "do" notation).

A nice paper on Clean parsing combinators can be downloaded as item 5 from
the "Case Studies" list at

http://www.cs.kun.nl/~clean/

Interestingly enough, the next paper on the list ("An Interpreter") declines
to use combinators for its parsing task, instead building a conventional
"by hand" recursive descent parser.

another-two-years-and-they'll-discover-lex/yacc<wink>-ly y'rs - tim
Stackless & String-processing [ In reply to ]
In article <7mnirc$t2b$2@cronkite.cc.uga.edu>, Graham Matthews
<graham@sloth.math.uga.edu> wrote:

> Fernando Pereira (pereira@research.att.com) wrote:
> : but, as far as I know, nothing beats
> : the automata-theoretic optimization techniques for pattern matching and
> : parsing, of which the typical modern compiler for regular expressions
> : is just the best-known instance.
>
> In what sense are you using the words "nothing beats". I am sure that
> the automata-theoretic approach produces the fastest pattern matchers
> and parsers. So on this sense "nothing beats them".
I made the sense clear. For regular and deterministic CF languages,
although my main use of finite-state techniques these days is in speech
and text processing, involving some less known and some new techniques
for combining and optimizing weighted finite-state transducers
<http://www.research.att.com/sw/tools/fsm/>.
> But the class of
> languages recognisable by automaton doesn't cover all the languages of
> interest (in fact my guess is it covers very few ... context ...).
In what sense of "very few"? Most programming languages are designed to
have an LR(k) or LL(k) backbone, and regular lexical structure (those
that aren't are cursed by their unhappy implementors). {S|HT|X}ML and
other Internet protocols obey similar or even stronger restrictions as
far as I know.

Naturally occurring sequences (text, speech, proteins, DNA, ...) do not
fit neatly into the standard formal language classes, and for those the
question is that of what language class can best approximate the string
distribution, which pushes one quickly to very restrictive language
classes (eg. bounded-context regular languages, or lexicalized
stochastic CFGs) learnable automatically from data.
> So
> in this sense lots of things "beat" the automata approach.
Opportunities for optimization => efficiency. Crucial in text and
speech indexing, Web search, biological sequence processing, event log
analysis, to name just a few.
> And then
> there are software engineering issues ...
Agreed. However, more traditional lexer- and parser-generators had
sensible answers to many of those issues, that might not be as elegant
as monadic approaches, but that paid serious attention to optimization
in the regular and CF backbones. By the way, I find it faintly amusing
to see the functional programming community rediscovering and dressing
in categorial garb a simple idea that was a main motivator for the
development of logic programming in the early 70s (see
<http://akpublic.research.att.com:9000/~pereira/bib.html#Pereira+Warren-
80:DCGs> for a survey). It *is* a good idea for rapid prototyping, but
interspersing pattern matching/parsing with arbitrary computations
(even purely functional ones) effectively blocks optimization, which is
required in large-scale applications.

-- F
Stackless & String-processing [ In reply to ]
Tim Peters (tim_one@email.msn.com) wrote:
: A nice paper on Clean parsing combinators can be downloaded as item 5
: >
: Interestingly enough, the next paper on the list ("An Interpreter") declines
: to use combinators for its parsing task, instead building a conventional
: "by hand" recursive descent parser.

Yes it's kind of strange that they don't use the combinators to build
the interpreter. One of Peyton-Jones' papers has a good (but small)
example of making an interpreter with combinators (it's probably the
first paper of his on monads).

Tim Peters (tim_one@email.msn.com) wrote:
: another-two-years-and-they'll-discover-lex/yacc<wink>-ly y'rs - tim

I'll take that as a BIG <wink>. Once you use these combinators you will
never go back to lex/yacc (unless you want LALR(n) grammars).

graham
--
The girls is all salty, the boys is all sweet,
the food ain't too shabby
an' they piss in the streets
In France, way down in France, way on down, way on down in France
Stackless & String-processing [ In reply to ]
<graham@sloth.math.uga.edu> wrote:
: > But the class of
: > languages recognisable by automaton doesn't cover all the languages of
: > interest (in fact my guess is it covers very few ... context ...).
Fernando Pereira (pereira@research.att.com) wrote:
: In what sense of "very few"? Most programming languages are designed to
: have an LR(k) or LL(k) backbone, and regular lexical structure

So the regular expressions help you with the regular lexical structure.
But they don't help with the LR(n) bit. They also don't help with context
dependencies. That leaves out a lot of languages.

<graham@sloth.math.uga.edu> wrote:
: > And then
: > there are software engineering issues ...
Fernando Pereira (pereira@research.att.com) wrote:
: Agreed. However, more traditional lexer- and parser-generators had
: sensible answers to many of those issues, that might not be as elegant
: as monadic approaches, but that paid serious attention to optimization
: in the regular and CF backbones.

I don't really see why using parser combinators means that you cannot
optimise. You are not precluded from using regular expression parsers
as primitive parsers, nor are you precluded from using LALR(n) parsers
as primitive parsers. Hence you can optimise those parts of your
parsing with automaton if you want. So I don't really understand your
criticism.

Also as I said parser combinators are really more of a sequencing
technique -- one that applies nicely to parsing.

Fernando Pereira (pereira@research.att.com) wrote:
: By the way, I find it faintly amusing
: to see the functional programming community rediscovering and dressing
: in categorial garb a simple idea that was a main motivator for the
: development of logic programming in the early 70s (see
: <http://akpublic.research.att.com:9000/~pereira/bib.html#Pereira+Warren-
: 80:DCGs> for a survey).

Yes combinators are much like definite clause grammars in prolog. But if
I remember correctly from my use of DCG's, they gave you a fixed set of
combinators? Moreover they were limited to string parsing. The combinator
approach in the fp world has neither limitation.

Fernando Pereira (pereira@research.att.com) wrote:
: It *is* a good idea for rapid prototyping, but
: interspersing pattern matching/parsing with arbitrary computations
: (even purely functional ones) effectively blocks optimization, which is
-------------------------------

Can you explain why? I don't see it, since you can have reg expr parsers
and other such fast parsers as primitive parsers that you combine ...

graham
--
The girls is all salty, the boys is all sweet,
the food ain't too shabby
an' they piss in the streets
In France, way down in France, way on down, way on down in France
Stackless & String-processing [ In reply to ]
In article <7mqaht$r27$2@cronkite.cc.uga.edu>, Graham Matthews
<graham@sloth.math.uga.edu> wrote:

> <graham@sloth.math.uga.edu> wrote:
> : > But the class of
> : > languages recognisable by automaton doesn't cover all the languages of
> : > interest (in fact my guess is it covers very few ... context ...).
> Fernando Pereira (pereira@research.att.com) wrote:
> : In what sense of "very few"? Most programming languages are designed to
> : have an LR(k) or LL(k) backbone, and regular lexical structure
>
> So the regular expressions help you with the regular lexical structure.
> But they don't help with the LR(n) bit. They also don't help with context
> dependencies. That leaves out a lot of languages.
By "automaton" you seem to have misunderstood "finite-state automaton."
I was referring more generally to determinizable classes of automata,
eg. determinizable PDAs. Hence my reference to LR and LL constructions,
which effectively attempt to determinize PDAs (and fail with conflicts
if they cannot).
>
> <graham@sloth.math.uga.edu> wrote:
> : > And then
> : > there are software engineering issues ...
> Fernando Pereira (pereira@research.att.com) wrote:
> : Agreed. However, more traditional lexer- and parser-generators had
> : sensible answers to many of those issues, that might not be as elegant
> : as monadic approaches, but that paid serious attention to optimization
> : in the regular and CF backbones.
>
> I don't really see why using parser combinators means that you cannot
> optimise. You are not precluded from using regular expression parsers
> as primitive parsers, nor are you precluded from using LALR(n) parsers
> as primitive parsers. Hence you can optimise those parts of your
> parsing with automaton if you want. So I don't really understand your
> criticism.
The monadic approach mixes up the search structure for pattern matching
or parsing (what is elsewhere represented as an automaton) with other
data processing. Unless your optimizer can solve the halting problem,
it will not be able to determinize otherwise determinizable control
structures because it cannot figure out the data dependencies.
>
> Also as I said parser combinators are really more of a sequencing
> technique -- one that applies nicely to parsing.
>
Sure, but so what? Automata are too a sequencing technique, usable for
many other things besides pattern matching and parsing. The fundamental
difference is that automata-theoretic approaches separate neatly the
representation of sequences from what one wants to do with the parts of
those sequences, allowing optimizations that are difficult to obtain
within general-purpose programs.
> Fernando Pereira (pereira@research.att.com) wrote:
> : By the way, I find it faintly amusing
> : to see the functional programming community rediscovering and dressing
> : in categorial garb a simple idea that was a main motivator for the
> : development of logic programming in the early 70s (see
> : <http://akpublic.research.att.com:9000/~pereira/bib.html#Pereira+Warren-
> : 80:DCGs> for a survey).
>
> Yes combinators are much like definite clause grammars in prolog. But if
> I remember correctly from my use of DCG's, they gave you a fixed set of
> combinators?
It's easy enough to define one's own combinators, but not as elegantly
as in a higher-order functional language.
>Moreover they were limited to string parsing.
Not true. There's nothing in DCGs that ties them to strings. The
underlying programming pattern (often called "difference lists") is in
fact a general accumulator pattern that has been used in many other
applications, eg. code generation.
> Fernando Pereira (pereira@research.att.com) wrote:
> : It *is* a good idea for rapid prototyping, but
> : interspersing pattern matching/parsing with arbitrary computations
> : (even purely functional ones) effectively blocks optimization, which is
> -------------------------------
>
> Can you explain why? I don't see it, since you can have reg expr parsers
> and other such fast parsers as primitive parsers that you combine ...
Either you put all the interesting stuff in separate primitive parsers,
in which case I'm not sure what's the point of the monadic approach, or
you combine parsing and other data processing more tightly, which takes
advantage of the monadic approach to achieve greater conciseness, but
then it is hard to optimize the parser because the other data
processing is in the way of the necessary analyses. Focusing on
determinization (a crucial part of regular and LR constructions), you
need to determine whether two alternative computation paths that
consume the same input sequence can be merged into a single path. In
general, this is undecidable since it requires figuring out whether two
arbitrary functions (those associated to those two paths) produce the
same output. Now, it might be interesting to see whether compiler
optimization and code rewriting techniques could be used in trying to
recognize possible otimizations, but the general point stands that
tangling pattern-matching or parsing with general computation makes
optimization very difficult.

-- F

--
-- F
Stackless & String-processing [ In reply to ]
<graham@sloth.math.uga.edu> wrote:
> I don't really see why using parser combinators means that you cannot
> optimise. You are not precluded from using regular expression parsers
> as primitive parsers, nor are you precluded from using LALR(n) parsers
> as primitive parsers. Hence you can optimise those parts of your
> parsing with automaton if you want. So I don't really understand your
> criticism.
Fernando Pereira (pereira@research.att.com) wrote:
: The monadic approach mixes up the search structure for pattern matching
: or parsing (what is elsewhere represented as an automaton) with other
: data processing. Unless your optimizer can solve the halting problem,
: it will not be able to determinize otherwise determinizable control
: structures because it cannot figure out the data dependencies.

You have totally misunderstood what I said. I said "you can have regular
expression parsers as primitive parsers". There is no need for the compiler
to "solve the halting problem". The programmer is free to use regular
expression/automaton techniques where he/she wants. Such parsers take
strings and return ASTs. They can then be combined with other parsers
as if they were primitive parsers. This is purely a programming issue
and has nothing to do with the halting problem, compiler optimisations,
etc. To make it clearer a programmer can do

p = some_parser_that_uses_automaton :: String -> Ast
q = some_parser_that_uses_some_other_technique :: String -> Ast
pq = p <&> q
where <&> is the combinator that sequences parsers

Where does the halting problem enter into this? The program has to write
the autoton code him/herself, but there are fp tools that do this, much
like for example Yacc.

<graham@sloth.math.uga.edu> wrote:
: > Also as I said parser combinators are really more of a sequencing
: > technique -- one that applies nicely to parsing.
Fernando Pereira (pereira@research.att.com) wrote:
: Sure, but so what? Automata are too a sequencing technique, usable for
: many other things besides pattern matching and parsing. The fundamental
: difference is that automata-theoretic approaches separate neatly the
: representation of sequences from what one wants to do with the parts of
: those sequences, allowing optimizations that are difficult to obtain
: within general-purpose programs.

I still don't understand why you think the combinator approach is so
inefficient, given that you can have automaton based parsers embedded
in it at will.

The point of saying that combinators are more a of sequence tool than
a parsing tool is to point out that they are more general than automaton.
You can program much richer sequencing with combinators than you can
with automaton (not suprisingly). For those parts of your parser or
sequenceing where you only need automaton you use automaton as outlined
above. But where you need more powerful sequencing you have it in the
combinator approach.


<graham@sloth.math.uga.edu> wrote:
: > Can you explain why? I don't see it, since you can have reg expr parsers
: > and other such fast parsers as primitive parsers that you combine ...
Fernando Pereira (pereira@research.att.com) wrote:
: Either you put all the interesting stuff in separate primitive parsers,
: in which case I'm not sure what's the point of the monadic approach,

Err isn't this what I said you do ... I said "you can have regular
expression parsers as primitive parsers". The point of the monadic
approach is that is allows you to cleanly program arbitrary sequencing
when you need it. Moreover it makes for sound software engineering,
etc. That too is what I said in the last mail ....

graham
--
The girls is all salty, the boys is all sweet,
the food ain't too shabby
an' they piss in the streets
In France, way down in France, way on down, way on down in France
Stackless & String-processing [ In reply to ]
In article <7mssne$m8r$1@cronkite.cc.uga.edu>, Graham Matthews
<graham@sloth.math.uga.edu> wrote:
> You have totally misunderstood what I said. I said "you can have regular
> expression parsers as primitive parsers". There is no need for the compiler
> to "solve the halting problem". The programmer is free to use regular
> expression/automaton techniques where he/she wants. Such parsers take
> strings and return ASTs. They can then be combined with other parsers
> as if they were primitive parsers.
One last try. I understood your point from the beginning. However, you
did not understood *mine*. If you push the complex
pattern-matching/parsing machinery into black boxes, there's not much
for the monadic glue to do, is there? And even then, some
optimizations, those involving the various black boxes, will be
blocked. Here's a simple example. Assume that there are two black boxes
A and B, and that you use monadic glue to say "A or B". A and B may be
deterministic, but the result will not be, and exhaustive search will
be required in the glue. Sure, I can recognize such cases by hand and
create a new black box "A or B" which is optimized, but that is an
error-prone hand optimization. And one may not notice subtler
inefficiencies.

-- F
Stackless & String-processing [ In reply to ]
[Graham Matthews]
> You have totally misunderstood what I said.
> ...

[Fernando Pereira]
> One last try. I understood your point from the beginning. However, you
> did not understood *mine*.
> ...

Trust me on this one, fellows: you each understand the other, and have for
several (interesting) msgs already. Graham has an unusual way of expressing
his appreciation for a good discussion <wink>.

http://www.deja.com/getdoc.xp?AN=316262255

leaping-thru-hoops-ly y'rs - tim
Stackless & String-processing [ In reply to ]
<graham@sloth.math.uga.edu> wrote:
> expression parsers as primitive parsers". There is no need for the compiler
> to "solve the halting problem". The programmer is free to use regular
> expression/automaton techniques where he/she wants. Such parsers take
> strings and return ASTs. They can then be combined with other parsers
> as if they were primitive parsers.
Fernando Pereira (pereira@research.att.com) wrote:
: One last try. I understood your point from the beginning. However, you
: did not understood *mine*. If you push the complex
: pattern-matching/parsing machinery into black boxes, there's not much
: for the monadic glue to do, is there?

What your combinators do depends on your application. If all you are
doing is parsing using primitive automaton based parsers then your
combinators won't have much to do. But if you are doing more than
that then they will have more to do, and hence be more complex. For
example you can use the same combinators you define for parsing to do
type checking, model state, etc -- things I would not want to do with
an automaton. That is why I said that combinators are really sequencing
constructs.

Fernando Pereira (pereira@research.att.com) wrote:
: And even then, some
: optimizations, those involving the various black boxes, will be
: blocked. Here's a simple example. Assume that there are two black boxes
: A and B, and that you use monadic glue to say "A or B". A and B may be
: deterministic, but the result will not be, and exhaustive search will
: be required in the glue. Sure, I can recognize such cases by hand and
: create a new black box "A or B" which is optimized, but that is an
: error-prone hand optimization. And one may not notice subtler
: inefficiencies.

Your point appears to be that if you were programming only using automaton
for parsing you wouldn't miss the above optimisation, but that somehow
using combinators means you would miss the above optimisation. You
and I must program differently then. If I were using automaton, even
in the context of also using combinators, I would be aware of the
potential for such optimisations and look for them.

graham
--
The girls is all salty, the boys is all sweet,
the food ain't too shabby
an' they piss in the streets
In France, way down in France, way on down, way on down in France
Stackless & String-processing [ In reply to ]
[Graham Matthews]
: > You have totally misunderstood what I said.
[Fernando Pereira]
: > One last try. I understood your point from the beginning. However, you
: > did not understood *mine*.
Tim Peters (tim_one@email.msn.com) wrote:
: Trust me on this one, fellows: you each understand the other, and have for
: several (interesting) msgs already. Graham has an unusual way of expressing
: his appreciation for a good discussion

Tim I don't think that we are agreeing.: Fernando seems to want to
view combinators as parsing constructs and hence says things like
"the combinators don't have to do much if you are using automata
based black boxes" (to paraphrase). But combinators are more than
parsing constructs -- they are sequencing constructs used for a wide
variety of tasks where automata techniques are not available.

graham

--
The girls is all salty, the boys is all sweet,
the food ain't too shabby
an' they piss in the streets
In France, way down in France, way on down, way on down in France
Stackless & String-processing [ In reply to ]
In article <7mvika$cpj$1@cronkite.cc.uga.edu>, Graham Matthews
<graham@sloth.math.uga.edu> wrote:

> <graham@sloth.math.uga.edu> wrote:
> > expression parsers as primitive parsers". There is no need for the compiler
> > to "solve the halting problem". The programmer is free to use regular
> > expression/automaton techniques where he/she wants. Such parsers take
> > strings and return ASTs. They can then be combined with other parsers
> > as if they were primitive parsers.
> Fernando Pereira (pereira@research.att.com) wrote:
> : One last try. I understood your point from the beginning. However, you
> : did not understood *mine*. If you push the complex
> : pattern-matching/parsing machinery into black boxes, there's not much
> : for the monadic glue to do, is there?
>
> What your combinators do depends on your application. If all you are
> doing is parsing using primitive automaton based parsers then your
> combinators won't have much to do. But if you are doing more than
> that then they will have more to do, and hence be more complex. For
> example you can use the same combinators you define for parsing to do
> type checking, model state, etc -- things I would not want to do with
> an automaton. That is why I said that combinators are really sequencing
> constructs.
*What* I've been saying is that there is a generality-efficiency
tradeoff, and that the monadic approach goes for generality at the
expense of efficiency, by precluding optimizations in those very
important cases in which combinators are used for parsing. After all,
the papers cited earlier in this discussion were all about parsing, and
the thread was originally about string processing, coroutines and Icon
generators. Here's an interesting question. A context-free grammar can
be seen as a set of simultaneous recursive definitions in which the
RHSs are polynomials (where + is alternation and * is concatenation;
more generally, this can be expressed in terms of semirings and formal
power series, with some very interesting advantages). A regexp pattern
matcher is just a special case (modulo complications with non-regular
extensions like matching against pattern variables set by earlier \(
... \) ). Now, could the mappings of (special cases of) such grammars
to deterministic automata be extended to interesting cases of monadic
expressions? I have some vague ideas on how this might go based on
recent work on weighted transducer determinization by my colleague
Mehryar Mohri -- basically, one defines an algebraic structure over
"outputs" so that it makes sense to talk about the greatest common
factor of alternative outputs generated from the same input prefix, and
one can give sufficient conditions for determinizability.
>
> Fernando Pereira (pereira@research.att.com) wrote:
> : And even then, some
> : optimizations, those involving the various black boxes, will be
> : blocked. Here's a simple example. Assume that there are two black boxes
> : A and B, and that you use monadic glue to say "A or B". A and B may be
> : deterministic, but the result will not be, and exhaustive search will
> : be required in the glue. Sure, I can recognize such cases by hand and
> : create a new black box "A or B" which is optimized, but that is an
> : error-prone hand optimization. And one may not notice subtler
> : inefficiencies.
>
> Your point appears to be that if you were programming only using automaton
> for parsing you wouldn't miss the above optimisation, but that somehow
> using combinators means you would miss the above optimisation. You
> and I must program differently then. If I were using automaton, even
> in the context of also using combinators, I would be aware of the
> potential for such optimisations and look for them.
You might be, by looking at the code and changing it by hand. But a
program would not, because in general the optimizable skeleton of
sequence, alternation and Kleene closure combinators is interspersed
with arbitrary functional code. The point of automata-based compilation
of regexps and CFGs is that it is *automatic*. Clearly, a monadic
program that is a direct transcription of a regexp or a CFG could be
similarly compiled. But from my readings (those mentioned in the thread
and others) as well as your comments, the whole point of monadic
combinators for parsing is to interweave the grammatical structure with
other computations, thus making the underlying grammatical structure
less accessible to optimization. As I suggested above, it would be
interesting to define classes of monadic programs that still allow (an
appropriate extension of) those optimizations. But AFAIK that has not
been done.

-- F
Stackless & String-processing [ In reply to ]
[Tim, on Graham & Fernando's combinator thread]
> Trust me on this one, fellows: you each understand the other,
> and have for several (interesting) msgs already. ...

[Graham]
> Tim I don't think that we are agreeing.

Who said anything about agreeing? You two obviously disagree, but
disagreement doesn't imply misunderstanding. Heck, you've both repeated the
same unbudging positions so often even *I* understand what you're both
saying <wink>.

less-filling-tastes-great-ly y'rs - tim
Stackless & String-processing [ In reply to ]
<graham@sloth.math.uga.edu> wrote:
> What your combinators do depends on your application. If all you are
> doing is parsing using primitive automaton based parsers then your
> combinators won't have much to do. But if you are doing more than
> that then they will have more to do, and hence be more complex. For
> example you can use the same combinators you define for parsing to do
> type checking, model state, etc -- things I would not want to do with
> an automaton. That is why I said that combinators are really sequencing
> constructs.
Fernando Pereira (pereira@research.att.com) wrote:
: *What* I've been saying is that there is a generality-efficiency
: tradeoff, and that the monadic approach goes for generality at the
: expense of efficiency, by precluding optimizations in those very
: important cases in which combinators are used for parsing.

And I am saying that this claim that generality implies loss of
efficiency is a red herring. If you are writing a program that uses
combinator based parsing there is nothing to stop you using automaton
based techniques in your program. You just add the appropriate
automaton based primitives (eg. for regular expressions). You
might end up adding in so many automaton based primitives that
you don't need to do much with your combinators, but whether or
not that happens depends entirely on your application and your
needs. The point is that combinator based parsing doesn't prevent
you from using other more efficient techniques as and where
appropriate!

Have you ever written a combinator based parsing program?

<graham@sloth.math.uga.edu> wrote:
> Your point appears to be that if you were programming only using automaton
> for parsing you wouldn't miss the above optimisation, but that somehow
> using combinators means you would miss the above optimisation. You
> and I must program differently then. If I were using automaton, even
> in the context of also using combinators, I would be aware of the
> potential for such optimisations and look for them.
Fernando Pereira (pereira@research.att.com) wrote:
: You might be, by looking at the code and changing it by hand. But a
: program would not, because in general the optimizable skeleton of
: sequence, alternation and Kleene closure combinators is interspersed
: with arbitrary functional code.

Could you please point out to me where I claimed that "a program could
optimise your combinator program". If you can't find such a quote then
what is your point?

Fernando Pereira (pereira@research.att.com) wrote:
: But from my readings (those mentioned in the thread
: and others) as well as your comments, the whole point of monadic
: combinators for parsing is to interweave the grammatical structure with
: other computations, thus making the underlying grammatical structure
: less accessible to optimization. As I suggested above, it would be
: interesting to define classes of monadic programs that still allow (an
: appropriate extension of) those optimizations. But AFAIK that has not
: been done.

It hasn't been done, it won't likely be done, and I never said it
could be done. So what is your point?

graham
--
The girls is all salty, the boys is all sweet,
the food ain't too shabby
an' they piss in the streets
In France, way down in France, way on down, way on down in France

1 2  View All