Mailing List Archive

Parsing HTML tags with 're'
I'm trying to parse HTML using Python's 're' module. Generally things
are working fine, but I have one small problem left, and I'd also like to
know if I'm just going the wrong direction entirely.

I have a set of expressions I use for different stages of my parsing
process. These include:

TagStart = re.compile( '<[a-zA-Z]|<!--|</[a-zA-Z]' )
TagEnd = re.compile( '>' )
CommentEnd = re.compile( '-->' )

I use these to find the start and end of tags. Since I parse line by
line, I can't assume the tag will end in the same string as it started
in, so I don't look for '<*>' or anything similar. I look for a tag
start, and then I buffer lines until I find a matching tag end. Once I
have a complete tag, I want to parse out the tag type and the argument
list where appropriate (i.e. not in comments, and end tags don't have
arguments). To do this parsing, I use these expressions:

Structure = re.compile('^<(?P<type>[a-zA-Z_/]\w*)\s*(?P<args>[^>]+)*>')
Arglist = re.compile('(?P<name>[^=]+)=?(?P<value>.+)?')

The first one separates out the tag type and the optional argument list,
and the second one parses out the individual arguments.

The only problem I'm having is with the arglist parser; if an argument
looks like, arg="quoted arg with spaces", the spaces cause the argument
to break up. What's the best way to fix this?

As I said before, I would also be interested to know if there is a much
better way to do this. Did I miss a standard HTML parsing Python module?
If this approach is reasonable, feel free to scoop the code for your own
nefarious purposes.

- Bruce


Sent via Deja.com http://www.deja.com/
Share what you know. Learn what you don't.
Parsing HTML tags with 're' [ In reply to ]
In article <7okkte$25a$1@nnrp1.deja.com>,
Bruce Fletcher <befletch@my-deja.com> wrote:
>
>As I said before, I would also be interested to know if there is a much
>better way to do this. Did I miss a standard HTML parsing Python module?
>If this approach is reasonable, feel free to scoop the code for your own
>nefarious purposes.

sgmllib/htmllib
--
--- Aahz (@netcom.com)

Androgynous poly kinky vanilla queer het <*> http://www.rahul.net/aahz/
Hugs and backrubs -- I break Rule 6 (if you want to know, do some research)
Parsing HTML tags with 're' [ In reply to ]
In article <7oknev$kac@dfw-ixnews15.ix.netcom.com>,
aahz@netcom.com (Aahz Maruch) wrote:
> In article <7okkte$25a$1@nnrp1.deja.com>,
> Bruce Fletcher <befletch@my-deja.com> wrote:
> >
> >As I said before, I would also be interested to know if there is a much
> >better way to do this. Did I miss a standard HTML parsing Python module?
> >If this approach is reasonable, feel free to scoop the code for your own
> >nefarious purposes.
>
> sgmllib/htmllib
> --
> --- Aahz (@netcom.com)

Hmmm. Yes, I forgot about htmllib. Some time ago, when I started
looking into this, I looked at the docs for htmllib and decided it wasn't
working for me. Now sgmllib, that looks like maybe what I want.

Thanks,
- Bruce


Sent via Deja.com http://www.deja.com/
Share what you know. Learn what you don't.