Mailing List Archive

International upgrades
So, where do we stand on the issue of international upgrades?

I'd like to get back to these quickly, if possible. Starting with esperanto, and then
polish. And then probably spanish, although of course we'll now need to co-ordinate with
the forpas forked group, so that we minimize the extent of the forkage in the hopes of
bringing things back together soon.

--Jimbo
Re: International upgrades [ In reply to ]
On Tue, 5 Mar 2002, Jimmy Wales wrote:
> So, where do we stand on the issue of international upgrades?
>
> I'd like to get back to these quickly, if possible. Starting with esperanto, and then
> polish. And then probably spanish, although of course we'll now need to co-ordinate with
> the forpas forked group, so that we minimize the extent of the forkage in the hopes of
> bringing things back together soon.

There was a discussion some days ago on how to best implement a more or
less character set independent underlying system, but it sort of died out
without any clear consensus. In particular, no response from lcrocker, who
made the initial claim that the present system of using language-tied
encodings was Wrong.

As it left off, Jan and I had more or less agreed on something like:
* Check the browser's Accept-charset header; if available, use UTF-8. If
not, use the most likely encoding used for that wiki's language.
* Where necessary, convert characters into/out of HTML entities so that
non-Unicode browsers can safely handle all characters.
* Internally, non-ascii characters will need to be escaped somehow in the
search index field to allow correct indexing.

Somewhat less solid was how to store the actual text internally: Lee
suggested Latin-1 (or ASCII?) with HTML entities, I suggested UTF-8.

* UTF-8 is more space- and bandwidth-efficient and doesn't require
outgoing transliteration for the many users with relatively current,
UTF8-savvy browsers, but needs to be translated into native code / HTML
entities for non-UTF-8-savvy browsers (old old old ones, and Netscape 4
which has very buggy Unicode support).

* ASCII+HTML entities won't require outgoing translation for
non-UTF8-savvy browsers that nonetheless understand unicode-numbered
character entities, but may not be much of an improvement for older
browsers that don't know the character entities are always numbers in
Unicode, not the current character set. Thus outgoing translation to the
browser's character set is recommended to be safe.

Incoming translation is always required, as edited text will come to us in
the character encoding used by the browser and may or may not have HTML
entities typed by the user mixed in.

The character set translation can probably be done mostly via PHP's iconv
support -- however this is an optional component and must be enabled
during compile time (same as the annoying 4-letter minimum for search
index terms). Also, some slight customisation of the process is necessary
for for instance the Esperanto transliteration schema (basically in place
in $RecodeInput/$RecodeOutput).

If there's some consensus on this, we can get crackin' and get this
implemented so the upgrades can proceed.

-- brion vibber (brion @ pobox.com)
Re: International upgrades [ In reply to ]
From: "Brion Vibber" <brion@pobox.com>
>
> Somewhat less solid was how to store the actual text internally: Lee
> suggested Latin-1 (or ASCII?) with HTML entities, I suggested UTF-8.

We cannot index UTF-8. But if we introduce a redundant indexable field where
all characters (even the ASII ones) are represented with their unicode
number, then we would have a way around the 4-letter indexing boundary and
the problem that you cannot index anything but letters. So in that case I
would vote for UTF-8 since that would probably be the most efficient anyway.

> If there's some consensus on this, we can get crackin' and get this
> implemented so the upgrades can proceed.

Er, I would suggest that before coding we set up a document that describes
what the consenus is. It should say what codings are used for what, when and
where. It should also say which functions take care of this coding. This
would also include the coding used in URLs.

-- Jan Hidders
Re: International upgrades [ In reply to ]
On ĵaŭ, 2002-03-07 at 10:50, Jan Hidders wrote:
> From: "Brion Vibber" <brion@pobox.com>
> >
> > Somewhat less solid was how to store the actual text internally: Lee
> > suggested Latin-1 (or ASCII?) with HTML entities, I suggested UTF-8.
>
> We cannot index UTF-8.

Nor HTML entities, naturally.

> But if we introduce a redundant indexable field where
> all characters (even the ASII ones) are represented with their unicode
> number, then we would have a way around the 4-letter indexing boundary and
> the problem that you cannot index anything but letters. So in that case I
> would vote for UTF-8 since that would probably be the most efficient anyway.

Hmm, that's an idea.

[.Incidentally; if we are to switch to UTF-8, we'll obviously want to do
something about the fact that the current English wikipedia uses
ISO-8859-1 high characters extensively. These pages can be converted
fairly easily, either as a one time search & replace or as a
normalise-an-old-page-when-we-first-load-it thing.]

> > If there's some consensus on this, we can get crackin' and get this
> > implemented so the upgrades can proceed.
>
> Er, I would suggest that before coding we set up a document that describes
> what the consenus is. It should say what codings are used for what, when and
> where. It should also say which functions take care of this coding. This
> would also include the coding used in URLs.

Well, I was hoping there would be some evidence of some kind of
consensus before anyone goes writing documents or code! :)

We can probably add this stuff to
http://meta.wikipedia.com/wiki.phtml?title=Proposed_Wikipedia_policy_on_foreign_characters

The only sane format for URLs would be url-encoded UTF-8. This is the
recommended norm (http://www.w3.org/International/O-URL-and-ident.html),
it is the most future-proof (can you imagine if we kept all our URLs in
EBCDIC instead of ASCII because everybody still had links & bookmarks
from their old IBM mainframe days?), and it allows links across
languages to be consistently represented.

-- brion vibber (brion @ pobox.com)
Re: International upgrades [ In reply to ]
From: "Brion L. VIBBER" <brion@pobox.com>
> On ĵaŭ, 2002-03-07 at 10:50, Jan Hidders wrote:
> > From: "Brion Vibber" <brion@pobox.com>
> > >
> > > Somewhat less solid was how to store the actual text internally: Lee
> > > suggested Latin-1 (or ASCII?) with HTML entities, I suggested UTF-8.
> >
> > We cannot index UTF-8.
>
> Nor HTML entities, naturally.

That's not so obvious. If you replace the &, # and ; with something that is
indexed like ' and _ then indexing would work. But like I said, I favour
UTF-8 anyway and we can solve the indexing problem with an extra column
where everything is more or less represented as entities anyway. (We could
even replace uppercase with lowercase there and have case insensitive
indexing.)

> [.Incidentally; if we are to switch to UTF-8, we'll obviously want to do
> something about the fact that the current English wikipedia uses
> ISO-8859-1 high characters extensively. These pages can be converted
> fairly easily, either as a one time search & replace or as a
> normalise-an-old-page-when-we-first-load-it thing.]

I like the one-time-search-and-replace approach. No need to complicate
and/or slow down the run-time code with checks and translation code.

> Well, I was hoping there would be some evidence of some kind of
> consensus before anyone goes writing documents or code! :)

Of course. :-) I am still wondering what our great leader thinks of all
this.

> We can probably add this stuff to
>
http://meta.wikipedia.com/wiki.phtml?title=Proposed_Wikipedia_policy_on_fore
ign_characters

This page is more about the local policy on the English Wikipedia. I would
like to see a page with a title like "A common architecture for Wikipedias
in all languages". I'd start writing it, but work is really busy at the
moment.

> The only sane format for URLs would be url-encoded UTF-8. This is the
> recommended norm (http://www.w3.org/International/O-URL-and-ident.html),
> it is the most future-proof (can you imagine if we kept all our URLs in
> EBCDIC instead of ASCII because everybody still had links & bookmarks
> from their old IBM mainframe days?), and it allows links across
> languages to be consistently represented.

Completely agreed.

-- Jan Hidders
Re: International upgrades [ In reply to ]
On ven, 2002-03-08 at 01:10, Jan Hidders wrote:
> From: "Brion L. VIBBER" <brion@pobox.com>
> > On ĵaŭ, 2002-03-07 at 10:50, Jan Hidders wrote:
> > > From: "Brion Vibber" <brion@pobox.com>
> > > >
> > > > Somewhat less solid was how to store the actual text internally: Lee
> > > > suggested Latin-1 (or ASCII?) with HTML entities, I suggested UTF-8.
> > >
> > > We cannot index UTF-8.
> >
> > Nor HTML entities, naturally.
>
> That's not so obvious. If you replace the &, # and ; with something that is
> indexed like ' and _ then indexing would work.

Exactly what I'm saying; in both cases we can't index words including
non-ASCII characters unless we munge the text somehow.

Which reminds me, this will not work well for Japanese and Chinese,
which don't separate words by spaces... ugh!

> But like I said, I favour
> UTF-8 anyway and we can solve the indexing problem with an extra column
> where everything is more or less represented as entities anyway.

Don't we already have a separate index column?

> (We could
> even replace uppercase with lowercase there and have case insensitive
> indexing.)

Hmm, can be done.

> > [.Incidentally; if we are to switch to UTF-8, we'll obviously want to do
> > something about the fact that the current English wikipedia uses
> > ISO-8859-1 high characters extensively. These pages can be converted
> > fairly easily, either as a one time search & replace or as a
> > normalise-an-old-page-when-we-first-load-it thing.]
>
> I like the one-time-search-and-replace approach. No need to complicate
> and/or slow down the run-time code with checks and translation code.

Okay. We'll need another update script for poor Jimbo, then!

> > Well, I was hoping there would be some evidence of some kind of
> > consensus before anyone goes writing documents or code! :)
>
> Of course. :-) I am still wondering what our great leader thinks of all
> this.

The ways of the Great Leader are mysterious indeed, we must await
revelation... :)

> > We can probably add this stuff to
> >
> http://meta.wikipedia.com/wiki.phtml?title=Proposed_Wikipedia_policy_on_fore
> ign_characters
>
> This page is more about the local policy on the English Wikipedia. I would
> like to see a page with a title like "A common architecture for Wikipedias
> in all languages". I'd start writing it, but work is really busy at the
> moment.

Okay, but they are clearly related topics since we'd be affecting the
English wikipedia as well! I might throw up a quick page in the morning
to start with.

> > The only sane format for URLs would be url-encoded UTF-8. This is the
> > recommended norm (http://www.w3.org/International/O-URL-and-ident.html),
> > it is the most future-proof (can you imagine if we kept all our URLs in
> > EBCDIC instead of ASCII because everybody still had links & bookmarks
> > from their old IBM mainframe days?), and it allows links across
> > languages to be consistently represented.
>
> Completely agreed.

That's you, me, and Carey Evans then...

-- brion vibber (brion @ pobox.com)
Re: International upgrades [ In reply to ]
Jan Hidders wrote:
> We cannot index UTF-8.

We shouldn't. We should strip down to 7bit U.S. ASCII before
indexing. Searching for o should find any occurance of ö, ó or ô.
This works great for English, Swedish, Norwegian, Danish, Finnish, and
German. I have successfully tried this on other websites before, but
I cannot speak for other languages. Of course, the search expression
must be stripped in the same way before the search is performed.

Also, in the stripping down, any E following a wovel could be removed,
to avoid the confusion between spellings like Gottingen, Goettingen,
and Göttingen, and that Danish poet Oehlenschläger.

This sort of search will yield a few hits too many, which is good.
I'm not advocating soundex matching here, but soundex could be
implemented in the same way.


--
Lars Aronsson (lars@aronsson.se)
Aronsson Datateknik
Teknikringen 1e, SE-583 30 Linuxköping, Sweden
tel +46-70-7891609
http://aronsson.se/ http://elektrosmog.nu/ http://susning.nu/
Re: International upgrades [ In reply to ]
On sab, 2002-03-09 at 14:21, Lars Aronsson wrote:
> Jan Hidders wrote:
> > We cannot index UTF-8.
>
> We shouldn't. We should strip down to 7bit U.S. ASCII before
> indexing. Searching for o should find any occurance of ö, ó or ô.
> This works great for English, Swedish, Norwegian, Danish, Finnish, and
> German. I have successfully tried this on other websites before, but
> I cannot speak for other languages. Of course, the search expression
> must be stripped in the same way before the search is performed.

That's only relevant for accented Latin characters, obviously. Hebrew,
Arabic, Cyrillic, Greek, Chinese and Japanese characters still need to
be retained and searchable. (However we can similarly fold together
cases and accents for Greek, perhaps final/medial forms for Greek,
Hebrew, and Arabic, and possibly katakana/hiragana for Japanese.)

So yes, we need to index UTF-8 if we're using it.

> Also, in the stripping down, any E following a wovel could be removed,
> to avoid the confusion between spellings like Gottingen, Goettingen,
> and Göttingen, and that Danish poet Oehlenschläger.
>
> This sort of search will yield a few hits too many, which is good.
> I'm not advocating soundex matching here, but soundex could be
> implemented in the same way.

I have no objection to the above. Would match potato/potatoe, too. :)

-- brion vibber (brion @ pobox.com)
Re: International upgrades [ In reply to ]
Brion L. VIBBER wrote:
> That's only relevant for accented Latin characters, obviously. Hebrew,
> Arabic, Cyrillic, Greek, Chinese and Japanese characters still need to
> be retained and searchable.

Are we talking about Greek/Hebrew characters in the English/German
Wikipedia now? I think users of the English/German Wikipedia won't
have Greek/Hebrew keyboards, so ASCII searching would do just fine.

I have no idea how to implement search in the Greek/Hebrew Wikipedia.


--
Lars Aronsson (lars@aronsson.se)
Aronsson Datateknik
Teknikringen 1e, SE-583 30 Linuxköping, Sweden
tel +46-70-7891609
http://aronsson.se/ http://elektrosmog.nu/ http://susning.nu/
Re: International upgrades [ In reply to ]
On sab, 2002-03-09 at 18:50, Lars Aronsson wrote:
> Brion L. VIBBER wrote:
> > That's only relevant for accented Latin characters, obviously. Hebrew,
> > Arabic, Cyrillic, Greek, Chinese and Japanese characters still need to
> > be retained and searchable.
>
> Are we talking about Greek/Hebrew characters in the English/German
> Wikipedia now? I think users of the English/German Wikipedia won't
> have Greek/Hebrew keyboards,

Excepting Greeks and Israelis, obviously. ;)

> so ASCII searching would do just fine.

But why bother creating a special separate ASCII-only search, when the
non-Latin code is necessary for other languages and we're using a
unified character set?

Why *shouldn't* I be able to search for the occasional Greek, Hebrew, or
Japanese word in the original spelling on the English wikipedia, if we
allow people to put them in in the first place?

> I have no idea how to implement search in the Greek/Hebrew Wikipedia.

As stated above: do whatever accent/case/other equivalent conversion is
necessary (exactly as you propose for Latin characters), and perform
some conversion so that MySQL doesn't reject the UTF-8 non-ascii
characters as word separators (in an ideal world, we'd just configure
MySQL to understand UTF-8; otherwise, replacing raw bytes with hex codes
should work fine).

-- brion vibber (brion @ pobox.com)