Mailing List Archive: New case conversion functions

New case conversion functions

Feb 20, 2002, 3:37 AM

Post #1 of 8 (1937 views)

I've noticed that the traditional locale-based case conversion functions
(ucfirst(), strtolower(), etc) aren't too reliable for anything but
English. Even when they do work, it's very dependant on the system
configuaration, and thus isn't really transparently portable.

So, I've added new case conversion functions ucfirstIntl(),
strtoupperIntl(), and strtolowerIntl() which can more or less properly
convert cases in a system-independent manner. For single-byte character
encodings this is very simple, based on the PHP strtr() function; just
define strings $wikiUpperChars containing all the uppercase characters
and $wikiLowerChars containing all the lowercase chars. (See example for
iso-8859-1 in wikiTextEn.php)

For multibyte character sets it's a little more complex, using the same
function in an array mode that associates byte sequences. Most multibyte
character sets are for Asian languages which don't have a case
distinction, so it's not likely to come up often except for those using
UTF-8. I've included conversion arrays for UTF-8 in utf8Case.php which
should cover just about everything, so any future 'pedias that may use
UTF-8 need just include that (as does wikiTextEo.php).

Also, it should be possible to extend ucfirstIntl() a bit to allow for
multiple-character first letter sequences (for instance treating ij->IJ
as one letter, which I believe is the officially correct behavior for
Dutch).

-- brion vibber (brion @ pobox.com)

Re: New case conversion functions [ In reply to ]

Feb 20, 2002, 1:52 PM

Post #2 of 8 (1905 views)

On mer, 2002-02-20 at 11:33, lcrocker@nupedia.com wrote:
> No! No! The text stored in the database is _always_ single-byte
> ISO-8859-1, no exceptions, even for the foreign wikis. Some of
> those ISO-8859-1 characters may spell out HTML entity references
> to Unicode characters outside the set, but the database should not
> know or care about that.

I'm sorry you feel that way, but that is in fact NOT TRUE. Please take a
look at the non-English non-ISO-8859-1 wikipedias sometime.

Hundreds of pages, with correct charset headers:
ISO-8859-2:
http://pl.wikipedia.com/
UTF-8 with a custom conversion function for certain character
sequences:
http://eo.wikipedia.com/

Stubs:
CP-1251
http://ru.wikipedia.com/
Shift-JIS
http://ja.wikipedia.com/
GB-2312 with a few character references thrown in:
http://zh.wikipedia.com/
Not sure which encodings, but certainly not ISO-8859-1:
http://ar.wikipedia.com/
http://he.wikipedia.com/

Now, if you honestly think that people are going to edit text that
consists *entirely* of HTML character entity references, you're
obviously not concerned about anything like "ease of use".

On top of which, the consensus seems to be to not allow &s (and thus
character entities) into page titles, which would effectively require
all page titles to be in ASCIIized roman characters. Can you imagine
this being acceptable on, say, the Chinese wiki if anyone actually used
it?

Gee, maybe someone *would* use it if they could use an appropriate
character set for their language!

> This policy might have to be changed for the Asian wikis if something
> like shift-JIS is universal enough and dealing with HTML entities
> problematic enough to make working with it difficult,

The mind boggles that you might imagine the situation to be otherwise.

> but in that
> case we'll still standardize on one and only one internal character
> representation for that particular wiki. For all others, that
> internal representation (and also the encoding which is served via
> HTTP) is ISO-8859-1.

Bullshit. Ask the Poles if they'd like to convert their wikipedia to
ISO-8859-1 with HTML character entities.

> If you need to "uppercase" words in titles (as our consensus on
> canonization of titles specifies), go ahead and hard-code the
> function to deal with ISO-8859-1.

Gee, that would be great if such a function would do anything at all for
anything other than ISO-8859-1 characters. But, somehow I can't quite
see a function hardcoded to deal with ISO-8859-1 being the slightest bit
useful for anything else.

-- brion vibber (brion @ pobox.com)

> You Wrote:
> >I've noticed that the traditional locale-based case conversion
> functions
> >(ucfirst(), strtolower(), etc) aren't too reliable for anything but
> >English. Even when they do work, it's very dependant on the system
> >configuaration, and thus isn't really transparently portable.
> >
> >So, I've added new case conversion functions ucfirstIntl(),
> >strtoupperIntl(), and strtolowerIntl() which can more or less
> properly
> >convert cases in a system-independent manner. For single-byte
> character
> >encodings this is very simple, based on the PHP strtr() function;
> just
> >define strings $wikiUpperChars containing all the uppercase
> characters
> >and $wikiLowerChars containing all the lowercase chars. (See example
> for
> >iso-8859-1 in wikiTextEn.php)
> >
> >For multibyte character sets it's a little more complex, using the
> same
> >function in an array mode that associates byte sequences. Most
> multibyte
> >character sets are for Asian languages which don't have a case
> >distinction, so it's not likely to come up often except for those
> using
> >UTF-8. I've included conversion arrays for UTF-8 in utf8Case.php
> which
> >should cover just about everything, so any future 'pedias that may
> use
> >UTF-8 need just include that (as does wikiTextEo.php).
> >
> >Also, it should be possible to extend ucfirstIntl() a bit to allow
> for
> >multiple-character first letter sequences (for instance treating ij-
> >IJ
> >as one letter, which I believe is the officially correct behavior for
> >Dutch).
> >
> >-- brion vibber (brion @ pobox.com)
> >
> >_______________________________________________
> >Wikitech-l mailing list
> >Wikitech-l@ross.bomis.com
> >http://ross.bomis.com/mailman/listinfo/wikitech-l
> >0

Re: New case conversion functions [ In reply to ]

Feb 20, 2002, 4:54 PM

Post #3 of 8 (1990 views)

I'm reposting this to wikitech-l so that discussion doesn't get lost.

On mer, 2002-02-20 at 15:20, lcrocker@nupedia.com wrote:
> You Wrote:
> > Please take a
> > look at the non-English non-ISO-8859-1 wikipedias sometime.
> >
> >Hundreds of pages, with correct charset headers:
> > ISO-8859-2:
> > http://pl.wikipedia.com/
> > UTF-8 with a custom conversion function for certain character
> > sequences:
> > http://eo.wikipedia.com/
>
>
> You're right. Last time I looked at these, the test pages I retrieved
> gave 404s, and the 404 page is still served as ISO-8859-1, but the
> headers of contentful pages are indeed as you say: 8859-2 for "pl"
> and UTF-8 for "eo", etc.
>
> OK, then, I guess we do have to wade into the morass of national
> character sets.

Unless you want to switch to UTF-8, that is a given.

> I have little or no experience using actual foreign-
> made computers; but I /do/ have extensive knowledge about character
> sets and communication protocols, so I'm just trying to make sure we
> don't make the same mistakes hundreds of others have made in the past
> by not getting this stuff right up front, but just diving headlong
> into coding without stepping back a moment to design something that
> will be usable and maintainable in the future.
>
> The way it is now, for example, we won't be able to cut-and-paste
> between wikis if, say, I wanted to include a quote from some Polish
> leader or something.

Sad but true.

> Maybe that's a reasonable sacrifice for ease of
> editing on those wikis.

Lee, let me put it this way. Imagine, if you will, that history had gone
somewhat differently. Let's say that the first computers had been
developed in a politically free, economically strong, highly
industrialized Russia and the standard computer character set around the
world had been based on the Cyrillic alphabet.

In our hypothetical world, there's a Russian version of what we would
have called Wikipedia. They set up some subsites in other languages, one
of which is English, which uses the Latin alphabet.

Now, you want to add some articles to the English site, but the site
administrators have declared that only the standard cyrillic character
set is to be used, with special markup to allow other characters through
the use of numerical codes. This means:
* Pages display fine for viewing, but when you edit, you see nothing
but numeric escape codes.
* You can't type *a single letter of English text* without using a
special numeric escape code.
* All page titles have to be transliterated into Cyrillic, because the
escape codes aren't allowed in titles.

Now, can you honestly tell me that you expect the average
English-speaking wiki contributor to edit a page that looks something
like this:
[[óèêèïçäèà:Óçëêîìç íçóêîìçðñ|Welcome]]
to
[[Óèêèïçäèà|Wikipedia]],
a
collaborative project to produce a complete encyclopedia from scratch.
We started in
January 2001 and
already have
over '''23,000
articles'''.
?

I can't imagine that you would expect that to be acceptable to anyone
else! You'll notice that the two non-ISO-8859-1-language 'pedias that
have actual content (Polish and Esperanto) both use the Latin alphabet
with a few diacritics. So theoretically, they would be the *most*
amenable to using HTML entities -- you can almost read text in the edit
box that way -- yet users of both wikipedias took the effort to tweak
the program to make their customary character encodings work so that
they could actually find people who would be willing to edit pages.

HTML entities are great for tossing in an occasional foreign letter or
word, but at the user level they are poor for regularly used diacritics
and utterly useless for text in other alphabets.

> We could, alternatively, serve UTF-8 on all
> of them, but that would risk breaking older browsers. There are side
> issues of what is stored in the database, and what is allowable in
> titles/URLs, etc.

Another alternative is to use the entities internally in the database,
but work some mojo to make them appear as normal characters in the edit
box. Which means you get zero advantage over simply using the national
character set -- you still have to send a character set header, you have
to know which Unicode characters can be passed through safely and which
need to be escaped, the search engine still breaks words, you still
can't capitalize non-ISO8859-1 titles, you still can't cut-n-paste, etc
etc etc. All of the pain, none of the gain.

> We really need to sit down and spec this out before we get too far
> down the road. That's one reason why I posted the proposed policy on
> foreign characters for the English Wiki; it is explicitly for the
> English one only, but we need something equivalent for the other
> ones.
>
> We had a lot of discussion about these topics in the early months of
> the project: I don't want us to ignore everything we learned back
> then just because the folks working on the code now weren't around
> back then.
> 0

Indeed. What were the conclusions of these discussions, and the
reasoning behind them?

-- brion vibber (brion @ pobox.com)

Re: New case conversion functions [ In reply to ]

Feb 21, 2002, 10:59 AM

Post #4 of 8 (1900 views)

Right now there is a localization problem wrt. indexing. The fulltext index
indexes single words and defines these as series of letters, numbers, and
the odd "'" and "_". Since the standard character set of MySQL is ISO 8859-1
I assume that it knows what are letters in that character set. I really
don't know how this behaves when the character set of MySQL is changed.
Available, by the way, are big5, cp1251, cp1257, czech, danish, dec8, dos,
euc_kr, gb2312, gbk, german1, hebrew, hp8, hungarian, koi8_ru, koi8_ukr,
latin1, latin2, sjis, swe7, tis620, ujis, usa7, and win1251ukr. But I don't
think we want to go that way because then (if I understand the documentation
correctly) we need a separate MySQL server for every character set. Anyway,
in all cases the indexing breaks down for entities because it doesn't index
words with '&' and ';' in them, so it sees "Gödel" as "G" and "del"
with some funny symbols inbetween that it doesn't index. The indexing also
has no idea that this has something to do with "Godel".

Admittedly unaware of any previous discussion on this before, I would
suggest the following:
1. Internally, i.e., in the database fields and URLs we use for bodies and
titles only standard ASCII plus HTML entities. However, to allow indexing we
encode e as something like '_101_' in the database fields.
2. Externally in search and edit boxes the user can type any character the
browser allows, but we always translate internally the non-ASCII ones to
entities.
3. When a request for a page is made we always translate the entities as
much as possible to the character set specified in the request, including
the contents edit boxes.

The main thing is to define the translation functions:

- string encodeEntities ( mb-string external-string, string character-set )
- mb-string decodeEntities ( string internal-string, string character-set )

(With mb-string I mean a multi-byte character string.)

For localization we define the following functions:

- string canonicalTitle ( string internal-string ) translates an internal
title to it's canonical form. It deals with capitalization, for example. If
two strings are translated to the same canonical form they are formally the
same title. If a string is translated to an empty string it is not a valid
title. If you don't want entities in your titles, you can define that here.
- string urlTitle ( string internal-string ) translates an internal
canonized title to its URL form. It probably only replaces space characters
with "+" and escapes ASCII characters that need to be escaped in an URL.

For these functions we also need to define arrays that associate entities
with their uppercase equivalents, and vice versa, for the relevant character
sets.

Having said all this I also want to emphasize that we first need to have a
document that describes exactly how we are going to do this, before we code
another line for localization. We have to realize that we are a real project
now.

-- Jan Hidders

Re: New case conversion functions [ In reply to ]

Feb 21, 2002, 1:47 PM

Post #5 of 8 (1913 views)

On ¼aý, 2002-02-21 at 09:59, Jan Hidders wrote:
> Right now there is a localization problem wrt. indexing. The fulltext index
> indexes single words and defines these as series of letters, numbers, and
> the odd "'" and "_". Since the standard character set of MySQL is ISO 8859-1
> I assume that it knows what are letters in that character set. I really
> don't know how this behaves when the character set of MySQL is changed.
> Available, by the way, are big5, cp1251, cp1257, czech, danish, dec8, dos,
> euc_kr, gb2312, gbk, german1, hebrew, hp8, hungarian, koi8_ru, koi8_ukr,
> latin1, latin2, sjis, swe7, tis620, ujis, usa7, and win1251ukr. But I don't
> think we want to go that way because then (if I understand the documentation
> correctly) we need a separate MySQL server for every character set. Anyway,
> in all cases the indexing breaks down for entities because it doesn't index
> words with '&' and ';' in them, so it sees "Gödel" as "G" and "del"
> with some funny symbols inbetween that it doesn't index. The indexing also
> has no idea that this has something to do with "Godel".
>
> Admittedly unaware of any previous discussion on this before, I would
> suggest the following:
> 1. Internally, i.e., in the database fields and URLs we use for bodies and
> titles only standard ASCII plus HTML entities. However, to allow indexing we
> encode e as something like '_101_' in the database fields.
> 2. Externally in search and edit boxes the user can type any character the
> browser allows, but we always translate internally the non-ASCII ones to
> entities.
> 3. When a request for a page is made we always translate the entities as
> much as possible to the character set specified in the request, including
> the contents edit boxes.

Ugh. Doable, though. Presumably the point of this is so that someone can
type either:
ö (actual o-with-umlaut in the display character encoding)
ö
Ö
Ö
Ö
or any number of other alternatives in the edit box and put the same
actual sequence of bytes into the data?

Also remember that we'll still have to escape entities that _aren't_ in
the display character set in all edit boxes, so that they won't be
disappeared or converted into "?"s when the user hits submit. (I'm
assuming that you don't want to put the raw HTML entities for _every_
non-ASCII character into the edit box appearing as the entity codes? See
my previous message on this subject for why that's a Very Bad Idea.)

> The main thing is to define the translation functions:
>
> - string encodeEntities ( mb-string external-string, string character-set )
> - mb-string decodeEntities ( string internal-string, string character-set )
>
> (With mb-string I mean a multi-byte character string.)

cf $wikiRecodeInput(), $wikiRecodeOutput() if you want a ready place to
do this.

(Keep in mind that ASCII-with-HTML-entities is for all intents and
purposes a multibyte character encoding. It switches from single-byte to
double-byte mode when encountering a "&", and self-recovers if it is not
followed by a correct multibyte code string ending in ";".)

> For localization we define the following functions:
>
> - string canonicalTitle ( string internal-string ) translates an internal
> title to it's canonical form. It deals with capitalization, for example. If
> two strings are translated to the same canonical form they are formally the
> same title. If a string is translated to an empty string it is not a valid
> title. If you don't want entities in your titles, you can define that here.
> - string urlTitle ( string internal-string ) translates an internal
> canonized title to its URL form. It probably only replaces space characters
> with "+" and escapes ASCII characters that need to be escaped in an URL.

cf. wikiTitle->makeSecureTitle()

> For these functions we also need to define arrays that associate entities
> with their uppercase equivalents, and vice versa, for the relevant character
> sets.

Easy enough, I can generate that from the Unicode data tables.

> Having said all this I also want to emphasize that we first need to have a
> document that describes exactly how we are going to do this, before we code
> another line for localization. We have to realize that we are a real project
> now.

Yes, a real project that's already running and has thousands of pages
that don't conform to the as-yet-nonexistant document. Hopefully we can
munge them together!

-- brion vibber (brion @ pobox.com)

Re: New case conversion functions [ In reply to ]

Feb 22, 2002, 3:33 AM

Post #6 of 8 (1912 views)

From: "Brion Vibber" <brion@pobox.com>
>
> Ugh. Doable, though. Presumably the point of this is so that someone can
> type either:
> ö (actual o-with-umlaut in the display character encoding)
> ö
> Ö
> Ö
> Ö
> or any number of other alternatives in the edit box and put the same
> actual sequence of bytes into the data?

Exactly.

> Also remember that we'll still have to escape entities that _aren't_ in
> the display character set in all edit boxes, so that they won't be
> disappeared or converted into "?"s when the user hits submit. (I'm
> assuming that you don't want to put the raw HTML entities for _every_
> non-ASCII character into the edit box appearing as the entity codes? See
> my previous message on this subject for why that's a Very Bad Idea.)

No, no, of course not. What I meant was that we check what the display
character encoding (I said character set, but that is probably not the right
word) is that is given with the request. Suppose someone asks for an edit
page and the browser tells us that it uses ISO-8859-5 (which supports
Cyrillic) then we present the contents of the edit box such that all
entities that have a direct encoding in ISO-8859-5 are translated and all
the other entities simply stay themselves. I think we need
multibyte-character support for this.

The nice thing about this would be that you can cut'n paste anything from
any other Wikipedia by cutting it from the edit box.

> > The main thing is to define the translation functions:
> >
> > - string encodeEntities ( mb-string external-string, string
character-set )
> > - mb-string decodeEntities ( string internal-string, string
character-set )
> >
> > (With mb-string I mean a multi-byte character string.)
>
> cf $wikiRecodeInput(), $wikiRecodeOutput() if you want a ready place to
> do this.

It doesn't have the right arguments. But these are implementation details.
We first should agree on the architecture.

-- Jan Hidders

Re: New case conversion functions [ In reply to ]

Feb 22, 2002, 12:02 PM

Post #7 of 8 (1907 views)

On ven, 2002-02-22 at 02:33, Jan Hidders wrote:
> From: "Brion Vibber" <brion@pobox.com>
> >
> > Ugh. Doable, though. Presumably the point of this is so that someone can
> > type either:
> > ö (actual o-with-umlaut in the display character encoding)
> > ö
> > Ö
> > Ö
> > Ö
> > or any number of other alternatives in the edit box and put the same
> > actual sequence of bytes into the data?
>
> Exactly.
>
> > Also remember that we'll still have to escape entities that _aren't_ in
> > the display character set in all edit boxes, so that they won't be
> > disappeared or converted into "?"s when the user hits submit. (I'm
> > assuming that you don't want to put the raw HTML entities for _every_
> > non-ASCII character into the edit box appearing as the entity codes? See
> > my previous message on this subject for why that's a Very Bad Idea.)
>
> No, no, of course not. What I meant was that we check what the display
> character encoding (I said character set, but that is probably not the right
> word) is that is given with the request. Suppose someone asks for an edit
> page and the browser tells us that it uses ISO-8859-5 (which supports
> Cyrillic) then we present the contents of the edit box such that all
> entities that have a direct encoding in ISO-8859-5 are translated and all
> the other entities simply stay themselves. I think we need
> multibyte-character support for this.

(Not necessarily, just lots of transliteration tables. Or perhaps
compile in iconv support... does iconv allow partial transliteration
between HTML entities and other character sets? ie, HTML entities that
are not in the destination charset are left intact?)

> The nice thing about this would be that you can cut'n paste anything from
> any other Wikipedia by cutting it from the edit box.

If I understand correctly, you're suggesting that the default character
encoding should *not* be based on the language used, but on some ability
of the browser to specify a preferred encoding (for instance, the HTTP
Accept-Charset header), such that the same user would see wikipedias in
different languages come up with the same character encoding?

I'm not convinced that the default value of that would always (or even
often) be acceptable, and most users won't know how to change it.
Simple, obvious at first sight manual switching between UTF-8 and a
standard transliteration format is a non-negotiable requirement for the
Esperanto 'pedia, so retaining the manual override is necessary.

> > > The main thing is to define the translation functions:
> > >
> > > - string encodeEntities ( mb-string external-string, string
> character-set )
> > > - mb-string decodeEntities ( string internal-string, string
> character-set )
> > >
> > > (With mb-string I mean a multi-byte character string.)
> >
> > cf $wikiRecodeInput(), $wikiRecodeOutput() if you want a ready place to
> > do this.
>
> It doesn't have the right arguments. But these are implementation details.
> We first should agree on the architecture.

There's no character set argument because that's a global variable. At
present $wikiCharset specifies the default encoding (that used in the
database), and optional alternate external encodings are in
$wikiCharsetEncodings[] with the user-selected index in
$user->options["encoding"].

-- brion vibber (brion @ pobox.com)

Re: New case conversion functions [ In reply to ]

Feb 23, 2002, 9:24 AM

Post #8 of 8 (1903 views)

From: "Brion Vibber" <brion@pobox.com>
> On ven, 2002-02-22 at 02:33, Jan Hidders wrote:
> >
> > No, no, of course not. What I meant was that we check what the display
> > character encoding (I said character set, but that is probably not the
right
> > word) is that is given with the request. [....]
>
> (Not necessarily, just lots of transliteration tables. Or perhaps
> compile in iconv support... does iconv allow partial transliteration
> between HTML entities and other character sets? ie, HTML entities that
> are not in the destination charset are left intact?)

Ah, I didn't know iconv. My guess would be that it doesn't know about
entities because the default behaviour should be that it doesn't and I don't
see any flags to indicate that it should.

> > The nice thing about this would be that you can cut'n paste anything
from
> > any other Wikipedia by cutting it from the edit box.
>
> If I understand correctly, you're suggesting that the default character
> encoding should *not* be based on the language used, but on some ability
> of the browser to specify a preferred encoding (for instance, the HTTP
> Accept-Charset header), such that the same user would see wikipedias in
> different languages come up with the same character encoding?

Yes, and the accept-charset header is indeed the meta-tag I was thinking of.

> I'm not convinced that the default value of that would always (or even
> often) be acceptable, and most users won't know how to change it.

Can you give an example? What is at the moment the behaviour of the
Esperanto Wikipedia?

> Simple, obvious at first sight manual switching between UTF-8 and a
> standard transliteration format is a non-negotiable requirement for the
> Esperanto 'pedia, so retaining the manual override is necessary.

Of course, users will still be able to choose their encoding (provided they
log in). Note that I was only talking about the editable boxes such as the
edit box and the search box. It makes sense to me to ask people to log in if
they want special behavior there. The encoding for presentation of pages is
another matter that can be decided separately.

> > > cf $wikiRecodeInput(), $wikiRecodeOutput() if you want a ready place
to
> > > do this.
> >
> > It doesn't have the right arguments. But these are implementation
details.
> > We first should agree on the architecture.
>
> There's no character set argument because that's a global variable. At
> present $wikiCharset specifies the default encoding (that used in the
> database),

Ok, then it would be the right place. But note that for indexing reasons the
meaning of the variable that you gave can no longer be correct. The encoding
in the database *has* to be different from the representation in the edit
box and there may even be another encoding used for representing the page.

-- Jan Hidders