Mailing List Archive

Encoding for non-Latin1 wikipedias (was Re: New case conversion functions)
(I'm cross-posting this to intlwiki-l; for those who have not been
reading wikitech-l, there's been some recent debate about how to encode
non-Latin1 characters using the new PHP-based wikipedia software. In
summary:
* I had been preparing to convert the Esperanto wiki to the new
software, under the assumption of continuity with the old software in
which native character encodings were used: UTF-8 on eo.wikipedia.com,
ISO-8859-2 on pl.wikipedia.com, etc.
* Lee Crocker claimed that the plan all along was actually to use
ISO-8859-1 with HTML character entities for all wikis, so that users
could easily cut and paste text between wikis with different languages.
* I pointed out that this had never been implemented on the wikis that
are being actively used, and there were obvious problems with that plan
for people who were going to edit text in non-Latin languages; 100%
HTML-entity text is not very legible.
* Jan Hidders recommended the possibility of checking what character
encoding the browser preferred, and based on that specially converting
those character entities in the current encoding to display as normal
characters in the edit box. )

On sab, 2002-02-23 at 08:24, Jan Hidders wrote:
>
> From: "Brion Vibber" <brion@pobox.com>
> > On ven, 2002-02-22 at 02:33, Jan Hidders wrote:
> > >
> > > No, no, of course not. What I meant was that we check what the display
> > > character encoding (I said character set, but that is probably not the
> right
> > > word) is that is given with the request. [....]
> >
> > (Not necessarily, just lots of transliteration tables. Or perhaps
> > compile in iconv support... does iconv allow partial transliteration
> > between HTML entities and other character sets? ie, HTML entities that
> > are not in the destination charset are left intact?)
>
> Ah, I didn't know iconv. My guess would be that it doesn't know about
> entities because the default behaviour should be that it doesn't and I don't
> see any flags to indicate that it should.
>
> > > The nice thing about this would be that you can cut'n paste anything
> from
> > > any other Wikipedia by cutting it from the edit box.
> >
> > If I understand correctly, you're suggesting that the default character
> > encoding should *not* be based on the language used, but on some ability
> > of the browser to specify a preferred encoding (for instance, the HTTP
> > Accept-Charset header), such that the same user would see wikipedias in
> > different languages come up with the same character encoding?
>
> Yes, and the accept-charset header is indeed the meta-tag I was thinking of.

Does it vary appropriately with the default language of the browser? I
tried a few browsers on my system (US English or non-specific versions)
and got:

Mozilla 0.9.8: ISO-8859-1, utf-8;q=0.66, *;q=0.66
Konqueror: unicode, utf-8, *
Netscape 4.78: iso-8859-1,*,utf-8
Opera 6: windows-1252;q=1.0, utf-8;q=1.0, utf-16;q=1.0,iso-8859-1;q=0.6, *;q=0.1
Internet Explorer 5.0: (nothing)
lynx: (nothing)

Could people using other languages check what they get? I set up a quick
little script to check the relavant headers and spit them out:
http://leuksman.com/misc/charset.php

In the case of multiple supported encodings, do we use the first one, or
the "best" one?

With regards to L. Crocker's earlier comment about UTF-8 being
potentially problematic with older browsers; after checking the
Accept-Charset field we *know* that a browser supports UTF-8, and can
provide an alternate, more limited, encoding if not.

That, I think, would be my personal preference.

While I'm at it -- if instead of HTML entities we were to use UTF-8
internally, we could get around the search problem by simply changing
high characters in the fulltext index to hex codes. If we're already
converting things right and left, there's little advantage to using the
entities internally -- particularly if most users have nice, modern
browsers with Unicode support and thus require no other conversion.

> > I'm not convinced that the default value of that would always (or even
> > often) be acceptable, and most users won't know how to change it.
>
> Can you give an example? What is at the moment the behaviour of the
> Esperanto Wikipedia?

The current behavior that you'll find on visiting eo.wikipedia.com
(running the old software):
- Article contents, titles, comments, user interface messages, etc. are
shown in UTF-8.
- The article edit box is in limited UTF-8, with the standard diacritics
transliterated (X-system)
- Any input (edit box, comment field, search box, username) is
internally normalised to X-system, so users can type any way they wish.
- On selecting a convenient link at the top of the screen, article
contents, user interface messages etc. are optionally shown in X-system.
The user does not have to log in to do this, just as he/she does not
have to create a named user account to set any other preferences in the
usemod wiki, however cookies are required.

As currently sitting in the CVS repository for the new software, the
only significant differences are:
- Full UTF-8 is used internally instead of X-system
- Since preferences can't be set without creating a user account,
cookies aren't saved for the encoding preference of not-logged-in users.

> > Simple, obvious at first sight manual switching between UTF-8 and a
> > standard transliteration format is a non-negotiable requirement for the
> > Esperanto 'pedia, so retaining the manual override is necessary.
>
> Of course, users will still be able to choose their encoding (provided they
> log in). Note that I was only talking about the editable boxes such as the
> edit box and the search box. It makes sense to me to ask people to log in if
> they want special behavior there. The encoding for presentation of pages is
> another matter that can be decided separately.

Okay, but what's "special behavior" and what's "default behavior"? Say I
visit the Japanese wikipedia, the French wikipedia, and the Esperanto
wikipedia; is using a native encoding for native editing in each one
"special behavior"? Or is using a common encoding for all "special
behavior"?

I find the login process on the new software way too heavy and
intimidating, so I'm loathe to require people to do it just to change
one tiny option.

> > > > cf $wikiRecodeInput(), $wikiRecodeOutput() if you want a ready place
> to
> > > > do this.
> > >
> > > It doesn't have the right arguments. But these are implementation
> details.
> > > We first should agree on the architecture.
> >
> > There's no character set argument because that's a global variable. At
> > present $wikiCharset specifies the default encoding (that used in the
> > database),
>
> Ok, then it would be the right place. But note that for indexing reasons the
> meaning of the variable that you gave can no longer be correct. The encoding
> in the database *has* to be different from the representation in the edit
> box and there may even be another encoding used for representing the page.

Right, it can change as necessary. Whatever happens, though, we still
need the custom transliteration functions available for the X-system
conversion. Those users without special keyboard drivers (or worse yet,
with Netscape 4's Unicode support) need to be able to type and sometimes
read that way. HTML entities are unacceptably difficult for typing, and
not helpful for reading -- a browser that doesn't show the UTF-8 won't
do any better with the entities.

-- brion vibber (brion @ pobox.com)
Re: Encoding for non-Latin1 wikipedias (was Re: New caseconversion functions) [ In reply to ]
From: "Brion Vibber" <brion@pobox.com>
> On sab, 2002-02-23 at 08:24, Jan Hidders wrote:
> >
> > Yes, and the accept-charset header is indeed the meta-tag I was thinking
of.
>
> Does it vary appropriately with the default language of the browser? I
> tried a few browsers on my system (US English or non-specific versions)
> and got:
>
> Mozilla 0.9.8: ISO-8859-1, utf-8;q=0.66, *;q=0.66
> Konqueror: unicode, utf-8, *
> Netscape 4.78: iso-8859-1,*,utf-8
> Opera 6: windows-1252;q=1.0, utf-8;q=1.0, utf-16;q=1.0,iso-8859-1;q=0.6,
*;q=0.1
> Internet Explorer 5.0: (nothing)
> lynx: (nothing)

FWIW, I tried it with nl-be and as you would expect it is the same for
Netscape and Explorer as above. I'll try what happens if I turn on
multilanguage-support in Windows.

> In the case of multiple supported encodings, do we use the first one, or
> the "best" one?

I would say the best one with. The only reason not do so is to show users
how to write certain characters as entities, but if somebody decides to edit
a page on the Russian Wikipedia we might expect that this person knows how
to enter Cyrillic characters on his or her keyboard, or either knows how to
enter them as entities. I'm not sure about when q!=1.0 for the best one.

> With regards to L. Crocker's earlier comment about UTF-8 being
> potentially problematic with older browsers; after checking the
> Accept-Charset field we *know* that a browser supports UTF-8, and can
> provide an alternate, more limited, encoding if not.
>
> That, I think, would be my personal preference.

Mine too.

> While I'm at it -- if instead of HTML entities we were to use UTF-8
> internally, we could get around the search problem by simply changing
> high characters in the fulltext index to hex codes.

Just to be clear on this. If we want indexing to work we need to encode it
in a format that uses only the Latin 1 numbers, letters, "_" and "'". The
alternative is to run different MySQL servers with appopriate character sets
for different Wikipedias.

> > Can you give an example? What is at the moment the behaviour of the
> > Esperanto Wikipedia?
>
> The current behavior that you'll find on visiting eo.wikipedia.com
> (running the old software):
> - Article contents, titles, comments, user interface messages, etc. are
> shown in UTF-8.
> - The article edit box is in limited UTF-8, with the standard diacritics
> transliterated (X-system)
> - Any input (edit box, comment field, search box, username) is
> internally normalised to X-system, so users can type any way they wish.
> - On selecting a convenient link at the top of the screen, article
> contents, user interface messages etc. are optionally shown in X-system.

Ok. I see your point. Does this also work for Windows users?

> Okay, but what's "special behavior" and what's "default behavior"? Say I
> visit the Japanese wikipedia, the French wikipedia, and the Esperanto
> wikipedia; is using a native encoding for native editing in each one
> "special behavior"? Or is using a common encoding for all "special
> behavior"?

I would say that the default behaviour is that you see in the edit box the
best character encoding your browser accepts. If the browser doesn't specify
this then the site falls back to Latin1 or whatever is specified as
$defaultEncoding on that site.

> [...] Whatever happens, though, we still
> need the custom transliteration functions available for the X-system
> conversion. Those users without special keyboard drivers (or worse yet,
> with Netscape 4's Unicode support) need to be able to type and sometimes
> read that way.

By the way, is this specified in the charset with the POST?

> HTML entities are unacceptably difficult for typing, and
> not helpful for reading -- a browser that doesn't show the UTF-8 won't
> do any better with the entities.

My main interest is that we maintain 1 software package that can be adapted
by setting some variables and includes. If we really cannot get around
having extra special encoding-links that can be turned off or on depending
upon some variables, then so be it.

-- Jan Hidders
Re: [Intlwiki-l] Re: Encoding for non-Latin1 wikipedias (was Re: New caseconversion functions) [ In reply to ]
On lun, 2002-02-25 at 07:55, Jan Hidders wrote:
>
> From: "Brion Vibber" <brion@pobox.com>
> > On sab, 2002-02-23 at 08:24, Jan Hidders wrote:
> > >
> > > Yes, and the accept-charset header is indeed the meta-tag I was thinking
> of.
> >
> > Does it vary appropriately with the default language of the browser? I
> > tried a few browsers on my system (US English or non-specific versions)
> > and got:
> >
> > Mozilla 0.9.8: ISO-8859-1, utf-8;q=0.66, *;q=0.66
> > Konqueror: unicode, utf-8, *
> > Netscape 4.78: iso-8859-1,*,utf-8
> > Opera 6: windows-1252;q=1.0, utf-8;q=1.0, utf-16;q=1.0,iso-8859-1;q=0.6,
> *;q=0.1
> > Internet Explorer 5.0: (nothing)
> > lynx: (nothing)
>
> FWIW, I tried it with nl-be and as you would expect it is the same for
> Netscape and Explorer as above. I'll try what happens if I turn on
> multilanguage-support in Windows.
>
> > In the case of multiple supported encodings, do we use the first one, or
> > the "best" one?
>
> I would say the best one with. The only reason not do so is to show users
> how to write certain characters as entities, but if somebody decides to edit
> a page on the Russian Wikipedia we might expect that this person knows how
> to enter Cyrillic characters on his or her keyboard, or either knows how to
> enter them as entities. I'm not sure about when q!=1.0 for the best one.

Besides, if they would benefit from seeing the entities, they would
benefit just as well from cut-n-paste, wouldn't they? Who's really going
to memorise the numbers?

> > With regards to L. Crocker's earlier comment about UTF-8 being
> > potentially problematic with older browsers; after checking the
> > Accept-Charset field we *know* that a browser supports UTF-8, and can
> > provide an alternate, more limited, encoding if not.
> >
> > That, I think, would be my personal preference.
>
> Mine too.
>
> > While I'm at it -- if instead of HTML entities we were to use UTF-8
> > internally, we could get around the search problem by simply changing
> > high characters in the fulltext index to hex codes.
>
> Just to be clear on this. If we want indexing to work we need to encode it
> in a format that uses only the Latin 1 numbers, letters, "_" and "'". The
> alternative is to run different MySQL servers with appopriate character sets
> for different Wikipedias.


> > > Can you give an example? What is at the moment the behaviour of the
> > > Esperanto Wikipedia?
> >
> > The current behavior that you'll find on visiting eo.wikipedia.com
> > (running the old software):
> > - Article contents, titles, comments, user interface messages, etc. are
> > shown in UTF-8.
> > - The article edit box is in limited UTF-8, with the standard diacritics
> > transliterated (X-system)
> > - Any input (edit box, comment field, search box, username) is
> > internally normalised to X-system, so users can type any way they wish.
> > - On selecting a convenient link at the top of the screen, article
> > contents, user interface messages etc. are optionally shown in X-system.
>
> Ok. I see your point. Does this also work for Windows users?

Yes, it works for everybody.

> > Okay, but what's "special behavior" and what's "default behavior"? Say I
> > visit the Japanese wikipedia, the French wikipedia, and the Esperanto
> > wikipedia; is using a native encoding for native editing in each one
> > "special behavior"? Or is using a common encoding for all "special
> > behavior"?
>
> I would say that the default behaviour is that you see in the edit box the
> best character encoding your browser accepts. If the browser doesn't specify
> this then the site falls back to Latin1 or whatever is specified as
> $defaultEncoding on that site.

Sounds reasonable to me.

> > [...] Whatever happens, though, we still
> > need the custom transliteration functions available for the X-system
> > conversion. Those users without special keyboard drivers (or worse yet,
> > with Netscape 4's Unicode support) need to be able to type and sometimes
> > read that way.
>
> By the way, is this specified in the charset with the POST?

There's no real way to get that information, to my knowledge. This
particular conversion is clean enough that it can be run over any input
the user gives, though, so my preference would be to just keep filter
functions available over all editable text.

> > HTML entities are unacceptably difficult for typing, and
> > not helpful for reading -- a browser that doesn't show the UTF-8 won't
> > do any better with the entities.
>
> My main interest is that we maintain 1 software package that can be adapted
> by setting some variables and includes. If we really cannot get around
> having extra special encoding-links that can be turned off or on depending
> upon some variables, then so be it.

Agreed, which is why I added the little recoding functions in the first
place. They wouldn't be needed most of the time, but when they were, a
function or two could be dropped in with the localised messages.

-- brion vibber (brion @ pobox.com)