Mailing List Archive: Parsing

Parsing

May 3, 1999, 5:54 PM

Post #1 of 6 (824 views)

I am using Aycock's package to handle some parsing but I am having trouble
because the language I am parsing is highly context sensitive. I don't
have any trouble dealing with the context-sensitivity in the so-called
"context free grammar" part of the package (the parser) but in the scanner
it is killing me.

Let's pretend I am parsing a tagged (but non-SGML) language where there is
an element "URL". Within "URL" elements, the characters < and > are
illegal: they must be escaped as \< and \>.

Elsewhere they are not. Here is the grammar I would *like* to write
(roughly):

Element ::= <URL> urlcontent </URL>
urlcontent = (([^<>\/:]* ("\<"|"\>"|":"|"/"|"\\"))*
Element ::= <NOT-A-URL> anychar* </NOT-A-URL>

Of course this is a made-up syntax because I don't think you can put
regular expressions in Aycock's BNF. I've used tools that do allow this so
I'm not sure how to handle it. This is also a made-up (simplified) example
so demonstrating how I can do it all in the scanner is probably not
helpful.

I could handle it if I could switch scanners mid-stream (for URL elements)
but Aycock's scanner finishes up before the parser even gets under way!
Should I scan and then parse (at a high level) and then rescan and reparse
the URLs? Is there a package that allows me to mix the lexical and
syntactic levels more?

--
Paul Prescod - ISOGEN Consulting Engineer speaking for only himself
http://itrc.uwaterloo.ca/~papresco

Diplomatic term: "We had a frank exchange of views."
Translation: Negotiations stopped just short of shouting and
table-banging. (Brill's Content, Apr. 1999)

Parsing [ In reply to ]

paul at prescod

May 3, 1999, 6:42 PM

Post #2 of 6 (816 views)

Permalink

John Aycock wrote:
>
> In your particular case, my thought is that the correct form of "urlcontent"
> is more of a semantic issue, and so I'd want to handle it *after* scanning and
> parsing; you would likely be able to print a better diagnostic message
> that way too. Maybe I've misunderstood the problem you described?

I can see how handling it as a semantic issue would simplify
implementation but I don't follow how it is inherently a "semantic issue."
What could be more syntactic then a set of allowed characters?

Overall, I find your package to be quite intuitive and unbelievably tiny.
My problem is that I got used to the much more complex and large javacc
which let me embed lexical constraints in BNF rules. I want your package's
simplicity so that some less technical co-worker's can maintain the code.
But I miss some of that power.

--
Paul Prescod - ISOGEN Consulting Engineer speaking for only himself
http://itrc.uwaterloo.ca/~papresco

The first three Noble Truths of Python:
All that is not Python is suffering.
The origin of suffering lies in the use of not-Python.
The cessation of suffering can be achieved by not using not-Python.
http://www.pauahtun.org/4nobletruthsofpython.html

Parsing [ In reply to ]

aycock at csc

May 3, 1999, 11:28 PM

Post #3 of 6 (821 views)

Permalink

Paul Prescod (paul@prescod.net) wrote:
: I am using Aycock's package to handle some parsing but I am having trouble
: because the language I am parsing is highly context sensitive. I don't
: have any trouble dealing with the context-sensitivity in the so-called
: "context free grammar" part of the package (the parser) but in the scanner
: it is killing me.
..
Element ::= <URL> urlcontent </URL>
urlcontent = (([^<>\/:]* ("\<"|"\>"|":"|"/"|"\\"))*
Element ::= <NOT-A-URL> anychar* </NOT-A-URL>
..
: I could handle it if I could switch scanners mid-stream (for URL elements)
: but Aycock's scanner finishes up before the parser even gets under way!

True. The framework's design deliberately excluded lexical feedback. When
I needed to do a similar context-sensitive job with the framework, I had a
kludge in the scanner class which guessed the context.

In your particular case, my thought is that the correct form of "urlcontent"
is more of a semantic issue, and so I'd want to handle it *after* scanning and
parsing; you would likely be able to print a better diagnostic message
that way too. Maybe I've misunderstood the problem you described?

John

Parsing [ In reply to ]

gmcm at hypernet

May 4, 1999, 6:15 AM

Post #4 of 6 (822 views)

Permalink

> I am using Aycock's package to handle some parsing but I am having
> trouble because the language I am parsing is highly context
> sensitive. I don't have any trouble dealing with the
> context-sensitivity in the so-called "context free grammar" part of
> the package (the parser) but in the scanner it is killing me.
>
> Let's pretend I am parsing a tagged (but non-SGML) language where
> there is an element "URL". Within "URL" elements, the characters <
> and > are illegal: they must be escaped as \< and \>.
>
> Elsewhere they are not. Here is the grammar I would *like* to write
> (roughly):
>
> Element ::= <URL> urlcontent </URL>
> urlcontent = (([^<>\/:]* ("\<"|"\>"|":"|"/"|"\\"))*
> Element ::= <NOT-A-URL> anychar* </NOT-A-URL>
/snip/
>
> I could handle it if I could switch scanners mid-stream (for URL
> elements) but Aycock's scanner finishes up before the parser even
> gets under way!

You can use an "outer" and an "inner" scanner. The outer one
recognizes vanilla stuff, and the existence of a (e.g.) urlcontent.
The inner one picks apart the urlcontent, returns its list, and the
outer just concatenates those results to his own. So you come into
the parser with a merged set of tokens.

- Gordon

Parsing [ In reply to ]

aycock at csc

May 4, 1999, 10:14 AM

Post #5 of 6 (820 views)

Permalink

Paul Prescod (paul@prescod.net) wrote:
: John Aycock wrote:
: > In your particular case, my thought is that the correct form of "urlcontent"
: > is more of a semantic issue
:
: I can see how handling it as a semantic issue would simplify
: implementation but I don't follow how it is inherently a "semantic issue."
: What could be more syntactic then a set of allowed characters?

Fair enough. In general, one can even make the argument that all "static
semantics" can be expressed syntactically, and that the only true
semantics are "dynamic semantics". But let's not go there :-)

John

Parsing [ In reply to ]

philh at vision25

May 4, 1999, 2:59 PM

Post #6 of 6 (823 views)

Permalink

In article <372E4543.68A125EC@prescod.net>
paul@prescod.net "Paul Prescod" writes:
> I am using Aycock's package to handle some parsing but I am having trouble
> because the language I am parsing is highly context sensitive. I don't
> have any trouble dealing with the context-sensitivity in the so-called
> "context free grammar" part of the package (the parser) but in the scanner
> it is killing me.
>
> Let's pretend I am parsing a tagged (but non-SGML) language where there is
> an element "URL". Within "URL" elements, the characters < and > are
> illegal: they must be escaped as \< and \>.
>
> Elsewhere they are not. Here is the grammar I would *like* to write
> (roughly):
>
> Element ::= <URL> urlcontent </URL>
> urlcontent = (([^<>\/:]* ("\<"|"\>"|":"|"/"|"\\"))*
> Element ::= <NOT-A-URL> anychar* </NOT-A-URL>

I am currently writing a stream library for Python, which includes
an abstract PeekStream class which defines an abstract peek() method
allowing one to peek ahead of what's in the stream without reading
it, and several methods implemented using this method, which perform
parsing (e.g. isNextSkip(str) -- which if the next characters in the
input stream are (str), returns true and reads them, else returns
false and doesn't read them).

Your language would be processed something like this:

# use one of these to set up the PeekStream, depending on whether
# input comes from a string or file
ps = PeekString(someString)
ps = PeekFile(open('filename'))

while ps.hasMoreChars():
if ps.isNextSkip('<URL>'):
content = ps.readToAfter('</URL>')
processUrlContent(content)
continue
if ps.isNextSkip('<')
tagName = ps.readToAfter('>')
contents = ps.readToAfter('</' + tagName + '>')
processTag(tagName, contents)
continue
# ...stuff here to process if the next characters coming from the
# stream weren't '<URL>' or '<'...

> I could handle it if I could switch scanners mid-stream (for URL elements)
> but Aycock's scanner finishes up before the parser even gets under way!
> Should I scan and then parse (at a high level) and then rescan and reparse
> the URLs? Is there a package that allows me to mix the lexical and
> syntactic levels more?

My library is really aimed at lexical analysis rather than parsing,
although it could be used with a parser -- it is based on code I wrote
in C++ to code for a yylex() function to use with yacc because I got
fed up with trying to get lex to work.

--
Phil Hunt....philh@vision25.demon.co.uk