Mailing List Archive: [String-SIG] Parsing

[Paul Prescod]
> I am using Aycock's package to handle some parsing but I am having
> trouble because the language I am parsing is highly context
> sensitive. I don't have any trouble dealing with the context-sensitivity
> in the so-called "context free grammar" part of the package (the parser)
> but in the scanner it is killing me.

Easiest way out is not to try dealing with it in the scanner at all.

> Let's pretend I am parsing a tagged (but non-SGML) language where there
> is an element "URL". Within "URL" elements, the characters < and > are
> illegal: they must be escaped as \< and \>.
>
> Elsewhere they are not. Here is the grammar I would *like* to write
> (roughly):
>
> Element ::= <URL> urlcontent </URL>
> urlcontent = (([^<>\/:]* ("\<"|"\>"|":"|"/"|"\\"))*
> Element ::= <NOT-A-URL> anychar* </NOT-A-URL>
>
> Of course this is a made-up syntax because I don't think you can put
> regular expressions in Aycock's BNF. I've used tools that do allow
> this so I'm not sure how to handle it.

z = x | y ->
z ::= x
z ::= y

z = x+ ->
z ::= x
z ::= x z

z = x? ->
z ::= x
z ::=

and so on.

> This is also a made-up (simplified) example so demonstrating how I can
> do it all in the scanner is probably not helpful.

Can't you enforce your "no < or > in URL" restriction in the parser instead?
View it as a semantic constraint instead of a syntactic one, and the problem
vanishes <wink>. Languages play this trick all the time; e.g., the C
expression "x + y" isn't legit unless some vrbl named "x" is in scope, but
no C compiler tries to enforce "no x here!" in the scanner. The scanner
just passes on everything that looks like an identifier, and leaves it to
the parser to gripe about bad cases. Similarly you should be able to pass
on "forbidden" characters and gripe about them only in the contexts where
they're illegal.

> I could handle it if I could switch scanners mid-stream (for URL elements)
> but Aycock's scanner finishes up before the parser even gets under way!

Some versions of lex allow associating scanner expressions with "state
names", and let you dynamically fiddle the set of active state names.
Fastest way known to code an impossibly brittle scanner <0.7 wink>.

> Should I scan and then parse (at a high level) and then rescan and reparse
> the URLs?

I wouldn't.

> Is there a package that allows me to mix the lexical and syntactic
> levels more?

Nothing beats Marc-Andre's mxTextTools for speed or flexibility: it lets
you mix all possible levels in all possible ways, since it has no concept of
levels to get in the way. Or to help, either.

parsing-ain't-easy-reminding-ly y'rs - tim