Mailing List Archive

htmllib: CR in CDATA
It appears that htmllib doesn't ignore returns in CDATA fields, as HTML 4.0
says it should:
http://www.w3.org/TR/REC-html40/types.html#type-cdata
http://www.w3.org/TR/REC-html40/sgml/dtd.html

As a result, htmllib improperly parses any CDATA element that wraps across a
line; this affects elements like

<A href="foo.
gif">

I'm happy to work up a patch, but I thought I'd ask around first. It may be
a bit involved to fix it properly; every CDATA should be handled this way,
which practically means almost every tag attribute.

Regards,


Mark Nottingham, Melbourne Australia
mnot@pobox.com http://www.mnot.net/
htmllib: CR in CDATA [ In reply to ]
Whooops, nevermind, I misread the spec -- carriage returns are turned into
spaces (which is what htmllib does) - *line feeds* should be ignored...

--
"Get me the phone book."
"Which one?"
"Doesn't matter."


----- Original Message -----
From: Mark Nottingham <mnot@pobox.com>
To: Python <python-list@cwi.nl>
Sent: Tuesday, June 22, 1999 12:55
Subject: htmllib: CR in CDATA


> It appears that htmllib doesn't ignore returns in CDATA fields, as HTML
4.0
> says it should:
> http://www.w3.org/TR/REC-html40/types.html#type-cdata
> http://www.w3.org/TR/REC-html40/sgml/dtd.html
>
> As a result, htmllib improperly parses any CDATA element that wraps across
a
> line; this affects elements like
>
> <A href="foo.
> gif">
>
> I'm happy to work up a patch, but I thought I'd ask around first. It may
be
> a bit involved to fix it properly; every CDATA should be handled this way,
> which practically means almost every tag attribute.
>
> Regards,
>
>
> Mark Nottingham, Melbourne Australia
> mnot@pobox.com http://www.mnot.net/
>
>
>
htmllib: CR in CDATA [ In reply to ]
OK, I'm starting to have a really nice conversation with myself now ;-)

htmllib DOESN'T change the newline to a single space - it leaves it in.

CDATA is a sequence of characters from the document character set and may
include character entities. User agents should interpret attribute values as
follows:
Replace character entities with characters,
Ignore line feeds,
Replace each carriage return or tab with a single space.
User agents may ignore leading and trailing white space in CDATA
attribute values (e.g., " myval " may be interpreted as "myval").
Authors should not declare attribute values with leading or trailing white
space.

If I'm wrong this time, someone please save us both the trouble and shoot
me.

Example:

#!/opt/local/bin/python
body = '''\
<HTML>
<A HREF="image.
jpg">
</HTML>
'''
import formatter, htmllib, sys, string
parser = htmllib.HTMLParser(formatter.NullFormatter())
parser.feed(body)
print parser.anchorlist



----- Original Message -----
From: Mark Nottingham <mnot@pobox.com>
To: Python <python-list@cwi.nl>
Sent: Tuesday, June 22, 1999 7:31
Subject: Re: htmllib: CR in CDATA


> Whooops, nevermind, I misread the spec -- carriage returns are turned into
> spaces (which is what htmllib does) - *line feeds* should be ignored...
>
> --
> "Get me the phone book."
> "Which one?"
> "Doesn't matter."
>
>
> ----- Original Message -----
> From: Mark Nottingham <mnot@pobox.com>
> To: Python <python-list@cwi.nl>
> Sent: Tuesday, June 22, 1999 12:55
> Subject: htmllib: CR in CDATA
>
>
> > It appears that htmllib doesn't ignore returns in CDATA fields, as HTML
> 4.0
> > says it should:
> > http://www.w3.org/TR/REC-html40/types.html#type-cdata
> > http://www.w3.org/TR/REC-html40/sgml/dtd.html
> >
> > As a result, htmllib improperly parses any CDATA element that wraps
across
> a
> > line; this affects elements like
> >
> > <A href="foo.
> > gif">
> >
> > I'm happy to work up a patch, but I thought I'd ask around first. It may
> be
> > a bit involved to fix it properly; every CDATA should be handled this
way,
> > which practically means almost every tag attribute.
> >
> > Regards,
> >
> >
> > Mark Nottingham, Melbourne Australia
> > mnot@pobox.com http://www.mnot.net/
> >
> >
> >
>
>
htmllib: CR in CDATA [ In reply to ]
And here's a go at a patch. Looking at the DTD, practically all attribute
types are CDATA, and those that aren't shouldn't have tabs or newlines in
them anyway. It uses string.translate; how efficient is this?


*** /opt/local/lib/python1.5/sgmllib.py Thu Apr 15 00:54:11 1999
--- sgmllib.py Tue Jun 22 21:02:05 1999
***************
*** 38,43 ****
--- 38,44 ----
'[%s]*([a-zA-Z_][-.a-zA-Z_0-9]*)' % string.whitespace
+ ('([%s]*=[%s]*' % (string.whitespace, string.whitespace))
+ r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./:+*%?!\(\)_#=~]*))?')
+ cdata_tr = string.maketrans('\t\n', ' ')


# SGML parser base class -- find tags and call handler functions.
***************
*** 251,256 ****
--- 252,258 ----
elif attrvalue[:1] == '\'' == attrvalue[-1:] or \
attrvalue[:1] == '"' == attrvalue[-1:]:
attrvalue = attrvalue[1:-1]
+ attrvalue = string.translate(attrvalue, cdata_tr, '\r')
attrs.append((string.lower(attrname), attrvalue))
k = match.end(0)
if rawdata[j] == '>':
htmllib: CR in CDATA [ In reply to ]
Mark Nottingham wrote:
> It appears that htmllib doesn't ignore returns in CDATA fields, as HTML 4.0
> says it should.

well, htmllib doesn't claim to be HTML 4.0 compliant...

> OK, I'm starting to have a really nice conversation with myself now ;-)
>
> htmllib DOESN'T change the newline to a single space - it leaves it in.
>
> CDATA is a sequence of characters from the document character set and may
> include character entities. User agents should interpret attribute values as
> follows:
> Replace character entities with characters,
> Ignore line feeds,
> Replace each carriage return or tab with a single space.
> User agents may ignore leading and trailing white space in CDATA
> attribute values (e.g., " myval " may be interpreted as "myval").
> Authors should not declare attribute values with leading or trailing white
> space.

...and it doesn't claim to be a "user agent", either...

</F>
htmllib: CR in CDATA [ In reply to ]
> well, htmllib doesn't claim to be HTML 4.0 compliant...

But it does claim 2.0:
"""HTML 2.0 parser.

See the HTML 2.0 specification:
http://www.w3.org/hypertext/WWW/MarkUp/html-spec/html-spec_toc.html
"""

Fine. Now, if we have a look at
http://www.w3.org/MarkUp/html-spec/html-spec_9.html#SEC9.1
we'll see that attributes are marked as CDATA.

Unfortunately, I don't have a copy of the SGML specification, so I can't
definatively say that this is the proper way to treat CDATA; all I have is
the description in the HTML 4.0 docs, as previously referenced. So, it's
probably not a good idea to patch this in sgmllib.py (as I did). However, it
is IMHO reasonable to conclude that, since both 2.0 and 4.0 refer to SGML
for the definition of CDATA, we can apply what we know about it from one to
the other (SGML being a fairly stable spec AFAIK).

I'm certainly willing to admit that this isn't directly specified behaviour
for a 2.0 parser, but I still think it's the Right Thing. A a practical
level, I'm parsing HTML with these constructs in it; if I pass off an HREF
to httplib that has a newline in it, all sorts of bad things happen.

I've ended up calling a cleaning function each time I parse attributes in my
subclassed parser; this does the job nicely. However, IMHO this sort of
lexical processing/second guessing shouldn't be necessary by the user of a
parser.


> ...and it doesn't claim to be a "user agent", either...

*sigh*
Do we _really_ want to take a trip down this semantic rabbit warren? In HTML
2.0-land, user agent is:
A component of a distributed system that presents an interface and processes
requests on behalf of a user; for example, a www browser or a mail user
agent.

Now, htmllib certainly:
* is a component
* part of a distributed system (i.e., the Web)
* presents an interface (programmatic)
* processes requests on behalf of a user

I'm curious... if it's not a user agent in the quoted context, what is it?
htmllib: CR in CDATA [ In reply to ]
> Now, htmllib certainly:
> * is a component
> * part of a distributed system (i.e., the Web)
> * presents an interface (programmatic)
> * processes requests on behalf of a user

> I'm curious... if it's not a user agent in the quoted context, what is it?

I'm pretty sure you know what I meant, but
alright...

htmllib is a parser, just like the documentation says.
you have to add an application to get an HTML user
agent (see section 1.2.3 of the 2.0 spec for more
info on user agents).

imho, it's pretty reasonable for an SGML parser to
behave like an XML parser: split the document up
into pieces, but pass them all to the application as
untouched as possible. if you wish to implement
additional behaviour, do that on the application
level. otherwise, you'll end up in a situation where
some users cannot use the standard library...

(like I did only a few hours ago, trying to use sgmllib
to parse SGML data with case-sensitive tags. sigh...)

</F>
htmllib: CR in CDATA [ In reply to ]
> imho, it's pretty reasonable for an SGML parser to
> behave like an XML parser: split the document up
> into pieces, but pass them all to the application as
> untouched as possible. if you wish to implement
> additional behaviour, do that on the application
> level. otherwise, you'll end up in a situation where
> some users cannot use the standard library...

I see your point. I suppose in a perfect world, the standards would be
coherent enough that these kinds of choices wouldn't have to be made; this
shouldn't have to be additional behaviour.

Cheers,
htmllib: CR in CDATA [ In reply to ]
Mark Nottingham wrote:
> > imho, it's pretty reasonable for an SGML parser to
> > behave like an XML parser: split the document up
> > into pieces, but pass them all to the application as
> > untouched as possible. if you wish to implement
> > additional behaviour, do that on the application
> > level. otherwise, you'll end up in a situation where
> > some users cannot use the standard library...
>
> I see your point. I suppose in a perfect world, the standards would be
> coherent enough that these kinds of choices wouldn't have to be made; this
> shouldn't have to be additional behaviour.

well, I guess this is one of the main reasons why
XML was invented...

</F>