Mailing List Archive

Fun with URL encoding
I finally got around to ferreting out the URL-encoding problem, which
could produce some URLs that were encoding correctly, but others that
actually encoded the URL-encoded form. For instance, for a page titled
'Anátomy?' we might see hrefs that are incorrectly double-encoded like
these:

http://www.piclab.com/wikitest/wiki.phtml?title=An%25E1tomy%253F
http://www.piclab.com/wikitest/wiki.phtml?title=An%25E1tomy%253F&action=edit
http://www.piclab.com/wikitest/wiki.phtml?title=An%25E1tomy%253F&action=history
http://www.piclab.com/wikitest/wiki.phtml?title=Talk:An%25E1tomy%253F&action=edit

as well as correct ones like:
http://www.piclab.com/wikitest/wiki.phtml?title=Special:Userlogin&returnto=An%E1tomy%3F
http://www.piclab.com/wikitest/wiki.phtml?title=Special:Whatlinkshere&target=An%E1tomy%3F
http://www.piclab.com/wikitest/wiki.phtml?title=Special:Recentchangeslinked&target=An%E1tomy%3F

(Note that the &s in the URL appear here as & because I copied them
right out of the HTML source of the page, where they must appear that
way to be legal HTML.)

A hackish redundant urldecode() in Title::newFromURL() was presumably
added to catch the first case. (PHP decodes URL-variable before we get
them, so it's not necessary on correctly-encoded URLs.) I'd prefer to
remove it, but double-encoded URLs have been polluting the search
engines for some time and we have to retain compatibility.

The culprit was wfLocalUrl(), which takes two parameters, a wiki page
title and a section for additional URL bits; it URL-encodes the title,
then tacks both onto the server's hostname... but the first one has
already been encoded by Title::getPrefixedURL(), so we get the fubar'd
double-encoding above.

The correct encoding remains up in the target=, returnto=, etc because
the URL bits aren't encoded a second time (the &s can't be URL-encoded
or they lose their meaning).

I've removed the redundant encoding from wfLocalUrl(); I haven't come
across another mis-encoded URL on since, and I've been trying.


Additionally, I've added a check from Title::newFromURL() that checks
the character encoding of links coming in from the outside; for a
latin-1 wiki UTF-8 encoded links are detected and converted to latin-1,
and on a UTF-8 wiki latin-1 links are detected and converted to UTF-8.
(The check is done in Language::checkTitleEncoding() and can be
customized by language; I've set up the Polish to detect Latin-2 and the
Esperanto to detect X-surrogates, so they'll be able to retain
compatibility with existing links once converted.)

This is needed for a couple reasons:

* Some browsers (notably Internet Explorer) send URLs encoded in UTF-8
if you type them into the URL bar or follow a link that's not
URL-encoded. Thus we were getting mis-encoded titles from time to time
when someone typed a title with accented chars directly into the URL bar
or followed links from differently-encoded external sites.

* This should help with linking between the various language wikis, with
less need for manually adding URL-encoding to interlanguage links that
cross encodings.

* As noted above, compatibility with old URLs on some wikis.

Note that _theoretically_ a legal UTF-8 sequence could also be legal ISO
8859-1. (But not bloody likely -- an uppercase accented letter followed
by a single high punctuation mark or symbol, or a lowercase accented
letter followed by two or three high punctuation marks or symbols.)
Title URLs aren't checked or converted if the referer matches our
server, so one could still work with such a page; just set up a redirect
from the converted form for the benefit of outside links.

-- brion vibber (brion @ pobox.com)
Re: Fun with URL encoding [ In reply to ]
"Brion VIBBER" skribis:

> Note that _theoretically_ a legal UTF-8 sequence could also be legal ISO
> 8859-1.

[eo] Cxu vere?
Mi pensis ke en la komenco de la dua duono de ISO-8859-1
estas kelkaj numeroj reservita (kontrola kodoj) -
128 gxis 159, se mi memoras gxuste. Tiuj estas la bitokoj
de la formo 100xxxxx, kiuj ja povas aperi en UTF-8 (en
la dua aux sekvaj bitokoj de UTF-8-kodita signo).

[en] Really?
I thought that at the start of the second half of
ISO-8859-1 some numbers are reserved (control codes) -
128 to 159, if I remember correctly. That are the octets
of the form 100xxxxx, which can occur in UTF-8 (in the
second or following octets of a UTF-8 encoded sign).


Pauxlo
Re: Fun with URL encoding [ In reply to ]
Paul Ebermann wrote:
> "Brion VIBBER" skribis:
>>Note that _theoretically_ a legal UTF-8 sequence could also be legal ISO
>>8859-1.
>
> [eo] Cxu vere?
> Mi pensis ke en la komenco de la dua duono de ISO-8859-1
> estas kelkaj numeroj reservita (kontrola kodoj) -
> 128 gxis 159, se mi memoras gxuste. Tiuj estas la bitokoj
> de la formo 100xxxxx, kiuj ja povas aperi en UTF-8 (en
> la dua aux sekvaj bitokoj de UTF-8-kodita signo).

Jes ja, sed ne cxiuj UTF-8-kodoj trovigxas en la gamo rezervita; se la
sekva(j) bitoko(j) formas laux 101xxxxx ili trovigxas en la gamo
160-191, kiu konsistigxas el diversaj punkciiloj kaj simboloj. Ekzemple:

á -> á
0xC3 0xA1 -> 0x00E1
110(00011) 10(1000001) -> 0000000011100001

Malofta bitokaro en latino-1, certe, sed lauxnorma.

> [en] Really?
> I thought that at the start of the second half of
> ISO-8859-1 some numbers are reserved (control codes) -
> 128 to 159, if I remember correctly. That are the octets
> of the form 100xxxxx, which can occur in UTF-8 (in the
> second or following octets of a UTF-8 encoded sign).

Sure, but not all UTF-8 codes will find themselves in the reserved
range; if the tail byte(s) are in the form 101xxxxx they'll be in the
160-191 range, which is populated by various punctuation marks and
symbols. For instance:

á -> á
0xC3 0xA1 -> 0x00E1
110(00011) 10(1000001) -> 0000000011100001

Not a terribly likely sequence of bytes in Latin-1, but it's legal.

-- brion vibber (brion @ pobox.com)
Re: Fun with URL encoding [ In reply to ]
"Brion VIBBER" skribis:


> Paul Ebermann wrote:
> > "Brion VIBBER" skribis:
> >>Note that _theoretically_ a legal UTF-8 sequence could also be legal ISO
> >>8859-1.
> >
> > [eo] Cxu vere?
[...]
>
> Jes ja, sed ne cxiuj UTF-8-kodoj trovigxas en la gamo rezervita; [...]

Mi komprenis ke vi diris "Cxiu legala UTF-8-sekvenco estas
legala ISO-8859-1". Nun mi komprenas ke vi diris
"Ekzistas UTF-8-sekvencoj kiuj estas legala ISO-8859-1".
Pardonu la konfuzo.

> > [en] Really?
[...]
>
> Sure, but not all UTF-8 codes will find themselves in the reserved
> range; [...]

I understood that you said "Each legal UTF-8-sequence is
legal ISO-8859-1", now I understand that you said "there
are legal UTF-8-sequences which are legal ISO-8859-1".
Sorry for the confusion.

Pauxlo