Mailing List Archive

Sort order
At some point in the near future I'll be adding in a per-language sort
order adjustment, so that various sorted lists should turn out in more
or less correct order for a change. :)

I'd appreciate pointers to descriptions of various languages' sorting
requirements so I can try to get them right.

I don't know if we can handle Japanese and Chinese sensibly, but
alphabetic languages should generally work fairly well by making a
munged copy of the string such that, eg, if "ó" sorts as the same as
"o" we just change it to "o"; if "ó" sorts after "o" (as in Polish
IIRC), it becomes "o~", which should always sort after any "o" and
before any "p" in a binary ASCII-order string sort.

Simple replacements should generally work, though we can also do more
complicated replacements of certain sequences of characters.

-- brion vibber (brion @ pobox.com)
Re: Sort order [ In reply to ]
On Tue, May 20, 2003 at 11:29:31AM -0700, Brion Vibber wrote:
> I don't know if we can handle Japanese and Chinese sensibly, but

FYI, Japanese is complicated to sort.

I believe this is the order:

Kana:
a i u e o
ka/ga ki/gi (kya/gya kyu/gyu kyo/gyo) ku/gu ke/ge ko/go
sa/za shi/ji (sha/ja shu/ju sho/jo) su/zu se/ze so/zo
ta/da chi/X (cha chu cho) tsu/dzu te/de to/do
na ni (nya nyu nyo) ne ne no
ha/ba/pa hi/bi/pi (hya/bya/pya hyu/byu/pyu hyo/byo/pyo) fu/bu/pu \
he/be/pe ho/bo/po
ma mi (mya myu myo) mu me mo
ya yu yo
ra ri (rya ryu ryo) ru re ro
wa (wo [particle, tho])
n

As far as Kanji goes, I believe it is sorted (any of these is OK):
* First by a characters primary radical, then by remaining strokes
* First by total strokes, then by primary radical
* Sound

Good luck! ;)
--
Nick Reinking -- eschewing obfuscation since 1981 -- Minneapolis, MN
Re: Sort order [ In reply to ]
On Tue, May 20, 2003 at 11:29:31AM -0700, Brion Vibber wrote:
> At some point in the near future I'll be adding in a per-language sort
> order adjustment, so that various sorted lists should turn out in more
> or less correct order for a change. :)
>
> I'd appreciate pointers to descriptions of various languages' sorting
> requirements so I can try to get them right.
>
> I don't know if we can handle Japanese and Chinese sensibly, but
> alphabetic languages should generally work fairly well by making a
> munged copy of the string such that, eg, if "ó" sorts as the same as
> "o" we just change it to "o"; if "ó" sorts after "o" (as in Polish
> IIRC), it becomes "o~", which should always sort after any "o" and
> before any "p" in a binary ASCII-order string sort.
>
> Simple replacements should generally work, though we can also do more
> complicated replacements of certain sequences of characters.

1.
In some languages certain letter pairs are treated as single letter,
for example in Czech, "ch" is a letter, so "ca", "cz", "ch", "da"
would be the correct sort order ;)
Polish is 100% sane about that, maybe with exception of having two
diactrics based on z (order: y z z' z.).

2.
Some languages sort first by primary then by secondary characteristics,
so it's *not lexicographical order*
For exampre to sort Japanese kana you have to:
if (strip_"_ond_o(x) != strip_"_ond_o(y))
return strip_"_ond_o(x)-strip_"_ond_o(y);
else
return x-y;

So order is like: kou gou kouin.
Then, sorting kanji is even worse.
Re: Sort order [ In reply to ]
> (Brion Vibber <brion@pobox.com>):
> At some point in the near future I'll be adding in a per-language sort
> order adjustment, so that various sorted lists should turn out in more
> or less correct order for a change. :)
>
> I'd appreciate pointers to descriptions of various languages' sorting
> requirements so I can try to get them right.

Collation rules for all languages are defined in the Unicode spec;
I believe MySQL contains many of them, but I'm not sure how to tell
it how to use them. It's often a lot more complex than doing a few
character substitutions, even for some fairly common languages (for
example, Spanish requires some 2-to-1 subs, German a 1-to-2, and
French uses accents only when necessary).

--
Lee Daniel Crocker <lee@piclab.com> <http://www.piclab.com/lee/>
"All inventions or works of authorship original to me, herein and past,
are placed irrevocably in the public domain, and may be used or modified
for any purpose, without permission, attribution, or notification."--LDC
Re: Sort order [ In reply to ]
On Tue, 20 May 2003, Lee Daniel Crocker wrote:
> Collation rules for all languages are defined in the Unicode spec;

Well, that could be handy. :) I'll see if I can dig them up.

hmm... This looks like a place to start:
http://www.unicode.org/unicode/reports/tr10/


> I believe MySQL contains many of them, but I'm not sure how to tell
> it how to use them.

MySQL's really ugly in this regard. First, no UTF-8 support at all.* The
collation order modules that it does have (for some 8-bit charsets and
some multibyte) can only be enabled on a server-wide basis, so we can't
say "this database sorts as english, this one sorts as german, this one
sorts as polish" unless we run separate instances of MySQL.

* Allegedly 4.1 has/will have some unicode support. It's not stable
though.

** Yes, I know PostgresQL has Unicode support. :) I don't know if it
supports per-table or per-column selection of collation order, and there
would be much other work to get Wikipedia running on it.

-- brion vibber (brion @ pobox.com)
Re: Sort order [ In reply to ]
On Tue, May 20, 2003 at 01:46:48PM -0700, Brion Vibber wrote:
> On Tue, 20 May 2003, Lee Daniel Crocker wrote:
> > Collation rules for all languages are defined in the Unicode spec;
>
> Well, that could be handy. :) I'll see if I can dig them up.
>
> hmm... This looks like a place to start:
> http://www.unicode.org/unicode/reports/tr10/
>
>
> > I believe MySQL contains many of them, but I'm not sure how to tell
> > it how to use them.
>
> MySQL's really ugly in this regard. First, no UTF-8 support at all.* The
> collation order modules that it does have (for some 8-bit charsets and
> some multibyte) can only be enabled on a server-wide basis, so we can't
> say "this database sorts as english, this one sorts as german, this one
> sorts as polish" unless we run separate instances of MySQL.
>
> * Allegedly 4.1 has/will have some unicode support. It's not stable
> though.
>
> ** Yes, I know PostgresQL has Unicode support. :) I don't know if it
> supports per-table or per-column selection of collation order, and there
> would be much other work to get Wikipedia running on it.

Well, PostgreSQL allows you to set the encoding on a per database basis.
So, you can have some databases with UTF-8, some with EUC_JP, etc. I
don't think you can have some ASCII rows and some unicode rows, although
I could certainly be wrong. Its collation rules are based on whatever
character set the database is.

--
Nick Reinking -- eschewing obfuscation since 1981 -- Minneapolis, MN
Re: Sort order [ In reply to ]
Nick Reinking wrote:
>
> Well, PostgreSQL allows you to set the encoding on a per database basis.
> So, you can have some databases with UTF-8, some with EUC_JP, etc. I
> don't think you can have some ASCII rows and some unicode rows, although
> I could certainly be wrong. Its collation rules are based on whatever
> character set the database is.
>

Database is correct. See:
http://www.postgresql.org/docs/view.php?version=7.3&idoc=0&file=multibyte.html


--
Smurf

smurf@AdamAnt.mud.de
------------------------- Anthill inside! ---------------------------
Re: Sort order [ In reply to ]
Brion Vibber wrote:
> At some point in the near future I'll be adding in a per-language sort
> order adjustment, so that various sorted lists should turn out in more
> or less correct order for a change. :)
>
> I'd appreciate pointers to descriptions of various languages' sorting
> requirements so I can try to get them right.

If you set the environment variable LC_COLLATE to sv_SE.ISO8859-1 or
sv_SE.UTF-8, Linux sort(1), strcmp(3), qsort(3) and MySQL will do the
right thing for Swedish. I think this true for PHP as well.


--
Lars Aronsson (lars@aronsson.se)
Aronsson Datateknik - http://aronsson.se/
Re: [Intlwiki-l] Sort order [ In reply to ]
> At some point in the near future I'll be adding in a per-language sort
> order adjustment, so that various sorted lists should turn out in more
> or less correct order for a change. :)
>
> I'd appreciate pointers to descriptions of various languages' sorting
> requirements so I can try to get them right.

I have recently made a little PHP script that sort correctly in danish.
You can find the result here: http://www.wikipedia.dk/wiki/sortering.php
The code snippet below does the actual sorting, maybe it can give you some
inspiration on how to do it.
The key functions here is strtr() which replaces all the weird characters
with the correct characters for sorting in danish, and usort() which does
the actual sorting.

<?
function cmp ($a, $b)
{
$compa = strtr($a,

"SOZsozY¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿABCDEF
GHIJKLMNOPQRSTUVWXYZ",

"sozsozyyuaaaaaøåceeeeiiiidnoooooæuuuuysaaaaaøåceeeeiiiionoooooæuuuuyyabcdef
ghijklmnopqrstuvwxyz");
$compb = strtr($b,

"SOZsozY¥µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿABCDEF
GHIJKLMNOPQRSTUVWXYZ",

"sozsozyyuaaaaaøåceeeeiiiidnoooooæuuuuysaaaaaøåceeeeiiiionoooooæuuuuyyabcdef
ghijklmnopqrstuvwxyz");
for ($i=0 ; $i<strlen($compa) ; $i++)
{
if (strlen($compb)==$i)
{
if ($_POST["Orden"]=="Stigende")
{
return -1;
}
else
{
return 1;
}
}

if ($compa{$i} > $compb{$i})
{
if ($_POST["Orden"]=="Stigende")
{
return 1;
}
else
{
return -1;
}
}
else if ($compa{$i} < $compb{$i})
{
if ($_POST["Orden"]=="Stigende")
{
return -1;
}
else
{
return 1;
}
}
}
return 0;
}

$tekst = $_POST["foo"];
$myarray = explode("\n",$tekst );
usort($myarray, "cmp");
$tekst = implode("\n",$myarray);
print $tekst;

?>

Regards
Christian

BTW: I have bought wikipedia.dk and made it redirect to da.wikipedia.org,
and it will stay that way untill the folks at da.wikipedia.org might decide
otherwise. I will also place little scripts like this danish sorting script
on http://www.wikipedia.dk/wiki/sortering.php