Mailing List Archive: HTML munging code rewritten

HTML munging code rewritten

Feb 12, 2002, 12:18 AM

Post #1 of 8 (2488 views)

I've rewritten wikiPage::removeHTMLtags again. (Checked into CVS, diff:
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/wikipedia/phpwiki/fpw/wikiPage.php.diff?r1=1.60&r2=1.61)
Exciting new features:

* Removes unwanted tag attributes, such as the scripting attributes
(onmouseclick, onmouseout, etc) which can be used to create fake links
or automatically redirect the browser to another web site (see the
previous version of the [[Goatse.cx]] article for an example)

* Makes a more serious attempt to fix mismatched open/close tag pairs.
Related, makes some attempts at normalization of tables. ie, <tr> not
allowed outside of <table> etc.

* Nested tables now work.

The function feels more weighty than it ought to be, but it works on
everything I've tried throwing at it so far, which is an improvement
over the previous versions.

I also threw in fixes for:
* Character entities in <pre> sections
* ISBN numbers with letters in them
* == Section headers == at the edges of HTML tags

-- brion vibber (brion @ pobox.com)

Re: HTML munging code rewritten [ In reply to ]

Feb 12, 2002, 2:57 AM

Post #2 of 8 (2459 views)

From: "Brion Vibber" <brion@pobox.com>
>
> * Nested tables now work.
>
> * == Section headers == at the edges of HTML tags

Do we want this to work? I'd prefer it if we took a conservative stance
towards the mark-up: we don't allow it unless it is needed. This is becuase
once you allow something it will be used, and then it will be hard to
unallow. This might one day become important when we try to optimize the
parser. So I feel we should discuss extensions of mark-up publicly before
implementing them.

As always we should strive for correctness and simplicity. The two go hand
in hand.

Sorry if I sound too critical, but I have strong feelings about this.

-- Jan Hidders

Re: HTML munging code rewritten [ In reply to ]

Feb 12, 2002, 3:47 AM

Post #3 of 8 (2466 views)

On mar, 2002-02-12 at 01:57, Jan Hidders wrote:
> From: "Brion Vibber" <brion@pobox.com>
> >
> > * Nested tables now work.
> >
> > * == Section headers == at the edges of HTML tags
>
> Do we want this to work? I'd prefer it if we took a conservative stance
> towards the mark-up: we don't allow it unless it is needed. This is becuase
> once you allow something it will be used, and then it will be hard to
> unallow. This might one day become important when we try to optimize the
> parser. So I feel we should discuss extensions of mark-up publicly before
> implementing them.

The section headers bit was a bugfix to replicate the previous behavior
of of usemodwiki. Not strictly necessary though, I suppose. I'll take it
back out.

Tables are certainly in very active use as it is, and I see no reason to
break nested tables, which at least someone seems sufficiently
interested in to write a bug report about them not working. Nonetheless,
making them work was simply a side effect of my changes to remove the
dangerous scripting attributes, which previous too-simple versions let
pass through.

Now, we *know* we don't want scripting. What *do* we want? Good
question. Here's what was allowed under usemodwiki:

Font effects: b i u font big small sub sup cite code em s strike
strong tt var
Paragraph formatting: br p hr pre center blockquote div
Headings: h1 h2 h3 h4 h5 h6
Lists: ol ul dl li dt dd
Tables: table caption td th tr

all with any attribute you cared to stuff in, including evil JavaScript
tricks. The first version of the PHP script that went online allowed
*everything* except a, script, title, html, body, and header (including
malicious scripting attributes). The next version replicated the
usemodwiki behavior exactly, broken tables and scripting attributes and
all.

Too much? Not enough? For the time being I've trimmed the allowed tags
back to the usemodwiki list (I'd slipped in some additional table tags
and some semantic markup tags that could theoretically be used to help
eg voice-based browsers for the blind render things more clearly), but
I'm sure as heck keeping those scripts out, and I can't imagine why we'd
want to break perfectly well-formed tables.

> As always we should strive for correctness and simplicity. The two go hand
> in hand.
>
> Sorry if I sound too critical, but I have strong feelings about this.

Agreed, which is why I'm not entirely happy with my current
implementation of the function, which is too complicated in places. But
the previous one, while relatively simple, was definitely not correct.

-- brion vibber (brion @ pobox.com)

Re: HTML munging code rewritten [ In reply to ]

Feb 12, 2002, 4:43 AM

Post #4 of 8 (2461 views)

From: "Brion Vibber" <brion@pobox.com>
> On mar, 2002-02-12 at 01:57, Jan Hidders wrote:
>
> The section headers bit was a bugfix to replicate the previous behavior
> of of usemodwiki. Not strictly necessary though, I suppose. I'll take it
> back out.

Thank you.

> Tables are certainly in very active use as it is, and I see no reason to
> break nested tables, which at least someone seems sufficiently
> interested in to write a bug report about them not working.

Ah, sorry, I didn't know they had worked before. In that case they should
now also work.

> Now, we *know* we don't want scripting. What *do* we want?

In some sense that is not really up to us, but should be decided by
Wikipedia management and the rest of the gang in Wikipedia-l. However, what
is clear IMO is that the tags and attributes that worked under usemod and
have been used in useful ways should now also work. I would suggest that
when we are in doubt this then we disallow it and wait until people start to
complain, and then we ask around for opinions.

If we want to change or restrict the markup more drastically then this
probably needs to be discussed first publicly. (I have a rather extreme
opinion about this, but that is not relevant here.) By the way I really like
the fact that you have a list of allowed tags/attributes and don't allow
anything else. Maybe we should even have separate lists of allowed
attributes per tag, but that maybe overkill.

> > Sorry if I sound too critical, but I have strong feelings about this.
>
> Agreed, which is why I'm not entirely happy with my current
> implementation of the function, which is too complicated in places. But
> the previous one, while relatively simple, was definitely not correct.

Agreed, and I do believe that what you are doing at the moment will turn out
to be very important in the long run. For example, when we are going to
optimize the parser, export to XML or adapt the mark-up, then it will be
very important, if not crucial, that the contents of Wikipedia satisfy a
certain strict syntax. So the more the script safeguards this syntax, the
better it is.

-- Jan Hidders

Re: HTML munging code rewritten [ In reply to ]

lars at aronsson

Feb 12, 2002, 9:21 AM

Post #5 of 8 (2457 views)

> The section headers bit was a bugfix to replicate the previous behavior
> of of usemodwiki. Not strictly necessary though, I suppose. I'll take it
> back out.

Just an idea: Section headers can be quite useful. During the first
pass through the text (sub WikiToHTML in UseModWiki) you would store
the resulting text in a string variable and all found section headers
in a separate array. In the output HTML you would first present a
table of contents, i.e. a list of <a href="#header" > linked section
headers pointing down to the right section header. This would allow
really long articles as a substitute for subpages, e.g. the subpages
Denmark/History Denmark/Government Denmark/Geography would instead be
subheaders in the same long page. You would want to support
[[Denmark#Geography]] type of links from other pages.

Look at the user interface of encarta.msn.com.

--
Lars Aronsson
<lars@aronsson.se>
tel +46-70-7891609
http://aronsson.se/ http://elektrosmog.nu/ http://susning.nu/

Re: HTML munging code rewritten [ In reply to ]

lsanger at nupedia

Feb 12, 2002, 11:22 AM

Post #6 of 8 (2464 views)

The notion of "section headers" sounds like one that should be discussed
on wikipedia-l before considering implementing it in software, I think.

The programmers, as much as we love 'em, should not be solely in charge of
specifying the system requirements...

Larry

On Tue, 12 Feb 2002, Lars Aronsson wrote:

> > The section headers bit was a bugfix to replicate the previous behavior
> > of of usemodwiki. Not strictly necessary though, I suppose. I'll take it
> > back out.
>
> Just an idea: Section headers can be quite useful. During the first
> pass through the text (sub WikiToHTML in UseModWiki) you would store
> the resulting text in a string variable and all found section headers
> in a separate array. In the output HTML you would first present a
> table of contents, i.e. a list of <a href="#header" > linked section
> headers pointing down to the right section header. This would allow
> really long articles as a substitute for subpages, e.g. the subpages
> Denmark/History Denmark/Government Denmark/Geography would instead be
> subheaders in the same long page. You would want to support
> [[Denmark#Geography]] type of links from other pages.
>
> Look at the user interface of encarta.msn.com.
>
>
>

Re: HTML munging code rewritten [ In reply to ]

Feb 12, 2002, 4:49 PM

Post #7 of 8 (2467 views)

Brian,

I just uploaded from CVS and now the <B> and <I> tags don't seem to work
anymore. Frankly, I'm quite happy with that situation, but I suspect some
other people won't be. :)

-- Jan Hidders

Re: HTML munging code rewritten [ In reply to ]

Feb 12, 2002, 5:25 PM

Post #8 of 8 (2463 views)

On mar, 2002-02-12 at 15:49, Jan Hidders wrote:
> Brian,
>
> I just uploaded from CVS and now the <B> and <I> tags don't seem to work
> anymore. Frankly, I'm quite happy with that situation, but I suspect some
> other people won't be. :)

Hmm, news to me, they work on my test server... Let me grab a fresh copy
of everything and try again...

Yeah, still works for me. Could you send an example of wikitext that
triggers the bug?

-- brion vibber (brion @ pobox.com)