Mailing List Archive

Internationalization issues
A lot of the character-encoding stuff in the present code is a mess
too. I understand well and can handle all the details between server
and browser, but two things I don't know all the quirks of are PHP and
MySQL, so this is my attempt to pick the brains of those who have
already found those problems:

(1) Is MySQL 8-bit clean? If I store a chunk of 8-bit bytes in a
text field, will I get them back unmolested, or will MySQL try to be
"helpful" and fuck them up? If the latter, what are the limitations
of what can be stored in a text field and where is that documented?

(2) Are PHP strings 8-bit clean? I'd be amazed if they weren't,
considering how much of PHP is modelled on Perl.

(3) Is the PHP on wikipedia.com compiled with the "iconv" library
(an optional thing), and does PHP use it as documented?

--
Lee Daniel Crocker <lee@piclab.com> <http://www.piclab.com/lee/>
"All inventions or works of authorship original to me, herein and past,
are placed irrevocably in the public domain, and may be used or modified
for any purpose, without permission, attribution, or notification."--LDC
Re: Internationalization issues [ In reply to ]
On mer, 2002-05-08 at 20:33, Lee Daniel Crocker wrote:
> A lot of the character-encoding stuff in the present code is a mess
> too.

I appreciate your polite understatement.

> I understand well and can handle all the details between server
> and browser, but two things I don't know all the quirks of are PHP and
> MySQL, so this is my attempt to pick the brains of those who have
> already found those problems:
>
> (1) Is MySQL 8-bit clean? If I store a chunk of 8-bit bytes in a
> text field, will I get them back unmolested, or will MySQL try to be
> "helpful" and fuck them up? If the latter, what are the limitations
> of what can be stored in a text field and where is that documented?

As far as I know, yes (the former). MySQL has no direct support for
UTF-8, but it does have explicit support for a number of single and
multibyte encodings, so one would expect it to be generally 8-bit clean,
and implementations of sticking UTF-8 strings into MySQL using the
default ISO-8859-1 setting abound. However there are limitations --
MySQL doesn't know about proper case matching or accent folding, for
instance, which would be nice for searching. (This is get-aroundable if
we do our own case/accent-folding when saving a page and storing it in a
separate field just for searches, which has been discussed from time to
time.)

> (2) Are PHP strings 8-bit clean? I'd be amazed if they weren't,
> considering how much of PHP is modelled on Perl.

Claims to be.

> (3) Is the PHP on wikipedia.com compiled with the "iconv" library
> (an optional thing),

I don't believe it is. It really should be.

> and does PHP use it as documented?

That's a good question!

-- brion vibber (brion @ pobox.com)
Re: Internationalization issues [ In reply to ]
> I'm not sure I'm the right person to raise this question but
> I wondered what the current thinking is on adapting the code
> for other character sets. If I recall correctly we or now
> assuming UTF-8, right? What exactly does that mean, btw? That
> we changed the MySQL character tables for those above 7F?
> Anyting else?

The English Wikipedia, and the German one being tested now, are
both ISO-8859-1, not UTF-8. UTF-8 will be needed for Polish and
other languages. There won't be much software change involved;
just telling MySQL to index the right way.

As for a special notation for accented characters, I'm not fond
of the idea. Foreign users should have foreign keyboards. Others
should still be able to enter accents by whatever means their OS
and browser allow, and I'm not aware of any that don't have some
feature for it. I don't like duplicating effort that should be
already done elsewhere.
Re: Internationalization issues [ In reply to ]
On Thu, Aug 22, 2002 at 10:11:41AM -0700, lcrocker@nupedia.com wrote:
>
> The English Wikipedia, and the German one being tested now, are
> both ISO-8859-1, not UTF-8. UTF-8 will be needed for Polish and
> other languages. There won't be much software change involved;
> just telling MySQL to index the right way.

That may in fact involve defining our own new character set for MySQL that
defines the properties of the subset of UTF-8 that covers English, German
and Polish. Or is each Wikipedia going to get its own mysql server? Anyway,
I'll start asking around if something like that not already exists
somewhere.

> As for a special notation for accented characters, I'm not fond
> of the idea. Foreign users should have foreign keyboards. Others
> should still be able to enter accents by whatever means their OS
> and browser allow, and I'm not aware of any that don't have some
> feature for it.

All I know at the moment is that the request has been made by a member of
the German community. I don't know how many people asked for it, why they
wanted it or how badly they need it, but I'll ask them. I'm a bit surprised
that Magnus hasn't brought this up, (I'm not German) but I have the
impression he has been busy lately.

> I don't like duplicating effort that should be
> already done elsewhere.

The question is not if you would implement it, but only if it would be Ok to
define some hooks so that they can implement it themselves if they wanted
to without changing any common code.

-- Jan Hidders
Re: Internationalization issues [ In reply to ]
> As for a special notation for accented characters, I'm not fond
> of the idea. Foreign users should have foreign keyboards.

Of course that's not the problem.

> Others
> should still be able to enter accents by whatever means their OS
> and browser allow, and I'm not aware of any that don't have some
> feature for it.

I don't know which feature you mean. Some foreign contributors use html
entities for umlauts, others type ae for ä, oe for ö and ue fur ü. The
first one makes the EditBox look ugly, and Links with entities in them
don't work, and the second one has to be corrected by someone. At the
moment it's not a big deal, just annoying. But I think it could be
easily automated, if entities would automaticaly be turned into umlauts
when the text is saved. An easier way of entering umlauts, like \"o,
would make foreign contributors even happier, but that should be
standardized then in all wikipedias that use umlauts, accents, etc.


Could someone please configure this mailinglist the same way as
wikipedia-l, so that replies go to the list?


Kurt
Re: Internationalization issues [ In reply to ]
At 2002-08-23 07:43 +0200, Kurt Jansson wrote:
>> As for a special notation for accented characters, I'm not fond
>> of the idea. Foreign users should have foreign keyboards.
>
>Of course that's not the problem.
>
>> Others
>> should still be able to enter accents by whatever means their OS
>> and browser allow, and I'm not aware of any that don't have some
>> feature for it.
>
>I don't know which feature you mean. Some foreign contributors use html
>entities for umlauts, others type ae for ä, oe for ö and ue fur ü. The
>first one makes the EditBox look ugly, and Links with entities in them
>don't work, and the second one has to be corrected by someone. At the
>moment it's not a big deal, just annoying. But I think it could be
>easily automated, if entities would automaticaly be turned into umlauts
>when the text is saved. An easier way of entering umlauts, like \"o,
>would make foreign contributors even happier, but that should be
>standardized then in all wikipedias that use umlauts, accents, etc.

As far as I can oversee the problem the best way to use accent
letters in this (HTML rendering) environment is to use '&auml;'
etc. since every visitor can make sure his browser can render
it correctly. NN and IE can already do it since version 4 or
even earlier.

Every other method will depend on whatever font the editor
of an article is using and that may not be the same as what
the next editor is using or what the Wikipedia web-server
is saying to the visitor it is serving.

Perhaps the Wikipedia software could try to translate ae and
such to the appropriate HTML abreviations like &auml; but
that would be risky, because it would have to know in what
language each word was written. In Dutch we have the 'oe'
as a valid combination which is not equal to '&ouml;', so
if Dutch and German were mixed in an article it would cause
problems.

Please also consider that font problems may seem to be solved
for now, but how local is that solution? And how new and
MS-based does the system have to be and how long will it
last? Perhaps someone invents a much better system in ten
years time. Will all texts become worthless then? At least
with the '&auml;'-system it's all in ASCII and therefore
human-readible and even understandable with a little effort.

Also consider that a lot of PDA's don't use or only offer
a few standard fonts.

>Could someone please configure this mailinglist the same way as
>wikipedia-l, so that replies go to the list?

I agree. (Add the 'Reply-To:' header, don't change the
'From:' header.)

Greetings,
Jaap
Re: Internationalization issues [ In reply to ]
> As far as I can oversee the problem the best way to use accent
> letters in this (HTML rendering) environment is to use '&auml;'
> etc. since every visitor can make sure his browser can render
> it correctly. NN and IE can already do it since version 4 or
> even earlier.

You mean every umlaut should be shown in the EditBox as it's entity?
That would make most German articles very hard readable. It should be
possible to enter entities, but they should be shown as umlauts to the
'normal' user. If foreign contributors don't see umlauts but strange
characters in their EditBox, maybe they could have an option in their
preferences so that they are shown as entities. But I haven't heard of
this problem.


> Perhaps the Wikipedia software could try to translate ae and
> such to the appropriate HTML abreviations like &auml; but
> that would be risky, because it would have to know in what
> language each word was written. In Dutch we have the 'oe'
> as a valid combination which is not equal to '&ouml;', so
> if Dutch and German were mixed in an article it would cause
> problems.

I would never suggest this, because 'oe' is also a valid combination in
German (e.g. the musical instrument "Oboe", or a city I lived in,
"Itzehoe"). Same with "ae" and "ue".


Kurt
Re: Internationalization issues [ In reply to ]
At 2002-08-23 07:43 +0200, Kurt Jansson wrote:
>
> [...] Some foreign contributors use html entities for umlauts, others type
> ae for ä, oe for ö and ue fur ü. The first one makes the EditBox look
> ugly, and Links with entities in them don't work, and the second one has
> to be corrected by someone. At the moment it's not a big deal, just
> annoying. But I think it could be easily automated, if entities would
> automaticaly be turned into umlauts when the text is saved. An easier way
> of entering umlauts, like \"o, would make foreign contributors even
> happier, but that should be standardized then in all wikipedias that use
> umlauts, accents, etc.

Well, in some sense it is easy, and in some sense it isn't. It would mean
introducing for the first time a difference in functionality depending upon
the language of the concerning Wikipedia because for example the translation
of entities is not the desired behavior on the English Wikipedia at the
moment.

In itself this is not hard to implement by defining certain functions that
are called in the common code and defined in the language-specific part of
the code, but
1. deciding which functions we define and what they should do
requires some deep architectural thinking, and
2. it would add an extra layer of complexity that makes it a bit harder to for
example determine what the cause of a certain bug is (it could be
language specific).

So Lee is probably right if he wants a good justification for such an
addition. If you say that at the moment it is not a big deal (I assume that
only a very small minority of the contributors of the German Wikipedia is
working on a non-German keyboard, and I only saw a few instances where
people had used "&ouml;") then it is probably best to wait until the moment it
does become a big deal.

-- Jan Hidders
Re: Internationalization issues [ In reply to ]
> So Lee is probably right if he wants a good justification for such an
> addition.

Okay, it seems I was a bit naive about how easy this whould be to
implement.

Whould it be easier to make a script that converts every &szlig; to ß,
&ouml; to ö, &uuml; to ü and &auml; to ä and the same for Ä, Ö, Ü in the
German database?
I could then call it by hand if you tell me how.

If this also is too complicated or dangerous - forget about it :-)


Kurt
Re: Internationalization issues [ In reply to ]
On Sat, Aug 24, 2002 at 07:08:16PM +0200, Kurt Jansson wrote:
>
> Whould it be easier to make a script that converts every &szlig; to ß,
> &ouml; to ö, &uuml; to ü and &auml; to ä and the same for Ä, Ö, Ü in the
> German database?
> I could then call it by hand if you tell me how.

I assume you want to do this on a regular basis? In that case the script has
to behave as a normal user, i.e., the change should look as any other
regular minor update. I suppose I could write a script in PHP that you could
run and that simply interacts with the site as a normal user. It would do a
search (simply by getting the corresonding URL) for "ouml", "uuml" et
cetera, and do a minor edit that changes them. But note that this is also
pretty easily done by hand.

-- Jan Hidders
Re: Internationalization issues [ In reply to ]
Jaap van Ganswijk wrote:

> Perhaps the Wikipedia software could try to translate ae and
> such to the appropriate HTML abreviations like &auml; but
> that would be risky, because it would have to know in what
> language each word was written. In Dutch we have the 'oe'
> as a valid combination which is not equal to '&ouml;', so
> if Dutch and German were mixed in an article it would cause
> problems.
>
To do this the system also needs to distinguish between an umlaut and a
diaresis. The famous painter Raphael is often spelled with a diaresis
as "Raphaël". It wouldn't do do have him automagically turned into
"Raphäl". The important thing to me is not in having the machines put
in umlauts or other accents, but having the search engine regard
spellings with or without accents as equivalent. This would be a great
help for the searcher who doesn't know if a word has an accent or
exactly what accent it has. No non-french speaking person should be
required to know about how some verbs change their accent patterns, or
the subtleties about an acute or grave accent on a final "e" in Catalan.
The uniquely German treatment of umlauted vowels can then probably be
treated with redirects.

Someone also made a comment an anglo-centric comment about foreign users
using foreign keyboards. What does this mean if I write in more than
one language? Maybe I should connect a separate keyboard for each
language. The computer should be smart enough to know that it does
things differntly when I'm on my Russian or Turkish or Devanagiri
keyboard. [;-)] .

Eclecticology

PS: At first I thought that I was replying to the list but apparently it
only went to Jaap. I suppose I'll should use the reply all when
answering to the list. -Ec
Re: Internationalization issues [ In reply to ]
On Monday 26 August 2002 11:13, Ray Saintonge wrote:
> To do this the system also needs to distinguish between an umlaut and a
> diaresis. The famous painter Raphael is often spelled with a diaresis
> as "Raphaël". It wouldn't do do have him automagically turned into
> "Raphäl". The important thing to me is not in having the machines put
> in umlauts or other accents, but having the search engine regard
> spellings with or without accents as equivalent. This would be a great
> help for the searcher who doesn't know if a word has an accent or
> exactly what accent it has. No non-french speaking person should be
> required to know about how some verbs change their accent patterns, or
> the subtleties about an acute or grave accent on a final "e" in Catalan.
> The uniquely German treatment of umlauted vowels can then probably be
> treated with redirects.

There are also a few German words in which "ue" is *not* equivalent to "ü":
"Tuer" means "doer", the actor form of "tun", and is different from "Tür"
meaning "door", and "Guericke", the Magdeburger who made the vacuum pump, is
written sometimes "Gericke" but never "Güricke".

phma
Re: Internationalization issues [ In reply to ]
Ray Saintonge wrote:
> Someone also made a comment an anglo-centric comment about foreign users
> using foreign keyboards. What does this mean if I write in more than
> one language? Maybe I should connect a separate keyboard for each
> language. The computer should be smart enough to know that it does
> things differntly when I'm on my Russian or Turkish or Devanagiri
> keyboard. [;-)] .

That's why modern computers have software-switchable keyboard maps...

blá blà blä
bła bła bła
бла бла бла
μβλα μβλα μβλα
ブラー ブラー ブラー

-- brion vibber (brion @ pobox.com)