Mailing List Archive

some notes on html character entities
Some notes.

1.
From english database dump (cur only)
things that looked like character entities were extracted.

2.
Lot of them were incorrect - they didn't end with ;, but with <, space
or something like that.

Here are results from perlscripting:

These were considered "html entities":
547567 &\S+?; &#\d+;? &#[xX][0-9a-fA-F]+;?

Other matches:
574064 &\S+?;?
516728 &\S+?; &#\d+; &#[xX][0-9a-fA-F]+;
29547 &nbsp
23015 &nbsp;

3.
Of that 547567:
823 hex refs (all correctly ended)
134505 decimal refs (103661 ended correctly - only 77%)
412239 other (incorrectly ended already excluded)

4.
After hex->dec conversion:
32701 unique entities
32415 unique numerical entities
286 unique named entities (after second look it seems that lot of
these are things like cgi part of URLs
etc. and not real html entities)
5.
30 most popular named entities:
294565 sup2
48034 deg
23015 nbsp
9558 middot
2449 eacute
2092 amp
1738 gt
1480 lt
1474 times
1417 radic
1273 quot
1210 mdash
909 ouml
759 uuml
743 alpha
708 rarr
651 lambda
629 aacute
615 pi
612 phi
511 epsilon
510 mu
508 egrave
455 iacute
409 ndash
396 oacute
395 omega
390 le
368 gamma
357 sigma

6.
XHTML absolutely can't contain incorrect &-entities. It won't even display.
So if we want to move to XHTML and have goodies like MathML,
we must make Wikipedia parser understand them.

7.
Search will benefit much from replacing html entities with proper characters
in searching text form.

8.
If we want to add option of generating PNGs of non-Latin characters,
then we must parse them.

9.
We may want small inline images, like those on Sensei's Library generated
by W1 B3 etc. Of course we can't make every W1 turn into image.
But using &W1; will do fine. They will be needed if we ever add support
for game diagrams, Using lot of [[Image:w1.png]] just doesn't seem right.

10. Summary note:
We need to make parser understand &-entities.
It's impossible to ignore the problem for much longer.

If you create new parser for Wikipedia, please consider this issue.