Mailing List Archive

Locales, sorting, and character encodings
Hi everybody,

I've just been fighting with sorting and alphabetical ordering in
multiple languages, and I've got things to work, but I'm a little
puzzled about how. So if anybody has any insight, I'd be grateful.

This is for IFEX, on something called the "Digest." It's a
regularly-published list of items recently published on the site. You
can see an example here:

http://www.ifex.org/2010/02/12/digest/

It's a big alphabetical list of regions (OK, "International" is at the
top), and within each region is an alphabetical list of countries.

I had been doing the alphabetization with the Schwartz, looking up the
name of each country according to the output channel:

my @alphabetized_cats =
map { $_->[0] }
sort { $a->[1] cmp $b->[1] }
map { [ $_ => $m->scomp('/util/translations.mc', word => $_) ] }
keys(%all_cats);

(translations.mc maps category URIs to country names based on the
current OC).

This was mostly fine, except that the vanilla Perl sort is really only
good for asciibetical order. In Friday's Digest, "Rwanda" was coming
before "République démocratique du Congo."

So I've been trying to use locales, like this:

my %ocs_to_locales = (
'Web (French)' => 'fr_FR.utf8',
'Web (Spanish)' => 'es_ES.utf8',
'Web (Russian)' => 'ru_RU.utf8',
'Web (Arabic)' => 'ar_EG.utf8',
);

use POSIX;
use locale;
if ($ocs_to_locales{$burner->get_oc->get_name}) {
POSIX::setlocale(LC_COLLATE,
$ocs_to_locales{$burner->get_oc->get_name});
}

...then do the sort, and then add this line afterward:

no locale;


Sadly, the utf8 locales seem to have the characters in completely nutty
order. "Rwanda" still came before "République démocratique du Congo."

Dropping the ".utf8" from the French locale name, and using just "fr_FR"
works, though. So I'm full of hope for Spanish and Arabic.

Now, everything in the site is all UTF8, so I'm puzzled about why the
".utf8" locales turned out to be bad choices. Does anybody have any
idea?


Thanks,

Bret



--
Bret Dawson
Producer
Pectopah Productions Inc.
(416) 895-7635
bret@pectopah.com
www.pectopah.com
Locales, sorting, and character encodings [ In reply to ]
Hi everybody,

I've just been fighting with sorting and alphabetical ordering in
multiple languages, and I've got things to work, but I'm a little
puzzled about how. So if anybody has any insight, I'd be grateful.

This is for IFEX, on something called the "Digest." It's a
regularly-published list of items recently published on the site. You
can see an example here:

http://www.ifex.org/2010/02/12/digest/

It's a big alphabetical list of regions (OK, "International" is at the
top), and within each region is an alphabetical list of countries.

I had been doing the alphabetization with the Schwartz, looking up the
name of each country according to the output channel:

my @alphabetized_cats =
map { $_->[0] }
sort { $a->[1] cmp $b->[1] }
map { [ $_ => $m->scomp('/util/translations.mc', word => $_) ] }
keys(%all_cats);

(translations.mc maps category URIs to country names based on the
current OC).

This was mostly fine, except that the vanilla Perl sort is really only
good for asciibetical order. In Friday's Digest, "Rwanda" was coming
before "République démocratique du Congo."

So I've been trying to use locales, like this:

my %ocs_to_locales = (
'Web (French)' => 'fr_FR.utf8',
'Web (Spanish)' => 'es_ES.utf8',
'Web (Russian)' => 'ru_RU.utf8',
'Web (Arabic)' => 'ar_EG.utf8',
);

use POSIX;
use locale;
if ($ocs_to_locales{$burner->get_oc->get_name}) {
POSIX::setlocale(LC_COLLATE,
$ocs_to_locales{$burner->get_oc->get_name});
}

...then do the sort, and then add this line afterward:

no locale;


Sadly, the utf8 locales seem to have the characters in completely nutty
order. "Rwanda" still came before "République démocratique du Congo."

Dropping the ".utf8" from the French locale name, and using just "fr_FR"
works, though. So I'm full of hope for Spanish and Arabic.

Now, everything in the site is all UTF8, so I'm puzzled about why the
".utf8" locales turned out to be bad choices. Does anybody have any
idea?


Thanks,

Bret



--
Bret Dawson
Producer
Pectopah Productions Inc.
(416) 895-7635
bret@pectopah.com
www.pectopah.com
Re: Locales, sorting, and character encodings [ In reply to ]
HI Bret - this looks complicated.

Did you ever get it to work?

Dawn

On 15-Feb-10, at 1:37 PM, Bret Dawson wrote:

> Hi everybody,
>
> I've just been fighting with sorting and alphabetical ordering in
> multiple languages, and I've got things to work, but I'm a little
> puzzled about how. So if anybody has any insight, I'd be grateful.
>
> This is for IFEX, on something called the "Digest." It's a
> regularly-published list of items recently published on the site. You
> can see an example here:
>
> http://www.ifex.org/2010/02/12/digest/
>
> It's a big alphabetical list of regions (OK, "International" is at the
> top), and within each region is an alphabetical list of countries.
>
> I had been doing the alphabetization with the Schwartz, looking up the
> name of each country according to the output channel:
>
> my @alphabetized_cats =
> map { $_->[0] }
> sort { $a->[1] cmp $b->[1] }
> map { [ $_ => $m->scomp('/util/translations.mc', word => $_) ] }
> keys(%all_cats);
>
> (translations.mc maps category URIs to country names based on the
> current OC).
>
> This was mostly fine, except that the vanilla Perl sort is really only
> good for asciibetical order. In Friday's Digest, "Rwanda" was coming
> before "République démocratique du Congo."
>
> So I've been trying to use locales, like this:
>
> my %ocs_to_locales = (
> 'Web (French)' => 'fr_FR.utf8',
> 'Web (Spanish)' => 'es_ES.utf8',
> 'Web (Russian)' => 'ru_RU.utf8',
> 'Web (Arabic)' => 'ar_EG.utf8',
> );
>
> use POSIX;
> use locale;
> if ($ocs_to_locales{$burner->get_oc->get_name}) {
> POSIX::setlocale(LC_COLLATE,
> $ocs_to_locales{$burner->get_oc->get_name});
> }
>
> ...then do the sort, and then add this line afterward:
>
> no locale;
>
>
> Sadly, the utf8 locales seem to have the characters in completely
> nutty
> order. "Rwanda" still came before "République démocratique du Congo."
>
> Dropping the ".utf8" from the French locale name, and using just
> "fr_FR"
> works, though. So I'm full of hope for Spanish and Arabic.
>
> Now, everything in the site is all UTF8, so I'm puzzled about why the
> ".utf8" locales turned out to be bad choices. Does anybody have any
> idea?
>
>
> Thanks,
>
> Bret
>
>
>
> --
> Bret Dawson
> Producer
> Pectopah Productions Inc.
> (416) 895-7635
> bret@pectopah.com
> www.pectopah.com
>
Re: Locales, sorting, and character encodings [ In reply to ]
Hi Dawn,


Yes, it works just fine. I was just confused about why.

When you use the ".utf8" locales, characters are sorted in the wrong
order, so that accented letters come at the end of the alphabet.

Drop the extension, though, and just use something like "fr_FR," and
your sorting comes out perfect.

I had sort of expected the opposite to be the case and wondered if
anyone knew why.

But the happy news is that locale-based alphabetical sorting works just
great, provided the locales you need are installed. (Thanks Alex!)


Cheers,

Bret



On Sat, 2010-03-06 at 22:02 -0500, Dawn Buie wrote:
> HI Bret - this looks complicated.
>
> Did you ever get it to work?
>
> Dawn
>
> On 15-Feb-10, at 1:37 PM, Bret Dawson wrote:
>
> > Hi everybody,
> >
> > I've just been fighting with sorting and alphabetical ordering in
> > multiple languages, and I've got things to work, but I'm a little
> > puzzled about how. So if anybody has any insight, I'd be grateful.
> >
> > This is for IFEX, on something called the "Digest." It's a
> > regularly-published list of items recently published on the site. You
> > can see an example here:
> >
> > http://www.ifex.org/2010/02/12/digest/
> >
> > It's a big alphabetical list of regions (OK, "International" is at the
> > top), and within each region is an alphabetical list of countries.
> >
> > I had been doing the alphabetization with the Schwartz, looking up the
> > name of each country according to the output channel:
> >
> > my @alphabetized_cats =
> > map { $_->[0] }
> > sort { $a->[1] cmp $b->[1] }
> > map { [ $_ => $m->scomp('/util/translations.mc', word => $_) ] }
> > keys(%all_cats);
> >
> > (translations.mc maps category URIs to country names based on the
> > current OC).
> >
> > This was mostly fine, except that the vanilla Perl sort is really only
> > good for asciibetical order. In Friday's Digest, "Rwanda" was coming
> > before "République démocratique du Congo."
> >
> > So I've been trying to use locales, like this:
> >
> > my %ocs_to_locales = (
> > 'Web (French)' => 'fr_FR.utf8',
> > 'Web (Spanish)' => 'es_ES.utf8',
> > 'Web (Russian)' => 'ru_RU.utf8',
> > 'Web (Arabic)' => 'ar_EG.utf8',
> > );
> >
> > use POSIX;
> > use locale;
> > if ($ocs_to_locales{$burner->get_oc->get_name}) {
> > POSIX::setlocale(LC_COLLATE,
> > $ocs_to_locales{$burner->get_oc->get_name});
> > }
> >
> > ...then do the sort, and then add this line afterward:
> >
> > no locale;
> >
> >
> > Sadly, the utf8 locales seem to have the characters in completely
> > nutty
> > order. "Rwanda" still came before "République démocratique du Congo."
> >
> > Dropping the ".utf8" from the French locale name, and using just
> > "fr_FR"
> > works, though. So I'm full of hope for Spanish and Arabic.
> >
> > Now, everything in the site is all UTF8, so I'm puzzled about why the
> > ".utf8" locales turned out to be bad choices. Does anybody have any
> > idea?
> >
> >
> > Thanks,
> >
> > Bret
> >
> >
> >
> > --
> > Bret Dawson
> > Producer
> > Pectopah Productions Inc.
> > (416) 895-7635
> > bret@pectopah.com
> > www.pectopah.com
> >
>
>


--
Bret Dawson
Producer
Pectopah Productions Inc.
(416) 895-7635
bret@pectopah.com
www.pectopah.com
Re: Locales, sorting, and character encodings [ In reply to ]
On Mar 7, 2010, at 10:11 AM, Bret Dawson wrote:

> Yes, it works just fine. I was just confused about why.
>
> When you use the ".utf8" locales, characters are sorted in the wrong
> order, so that accented letters come at the end of the alphabet.
>
> Drop the extension, though, and just use something like "fr_FR," and
> your sorting comes out perfect.
>
> I had sort of expected the opposite to be the case and wondered if
> anyone knew why.
>
> But the happy news is that locale-based alphabetical sorting works just
> great, provided the locales you need are installed. (Thanks Alex!)

Different locales use different collations. The utf8 locales might use utf8-abetical collation, which may not be what you want. But different vendors provide different versions of locales with different collations. For example, I've noticed that using en_US.utf8 in PostgreSQL leads to different sort ordering on Linux than on Mac OS X. IIRC, OS X sorted accented characters the way you want (and I want), while Linux did not.

The situation with locales and collations is, frankly, a complete clusterfuck. It seriously needs standardization.

Best,

David
Re: Locales, sorting, and character encodings [ In reply to ]
Ah. That makes sense.

This was Linux (Gentoo) + PostgreSQL, so maybe a good general rule is to
avoid the utf8 locales on Linux. Either that, or keep trying until you
find a locale that does what you want. :/


Cheers,

Bret




On Sun, 2010-03-07 at 10:22 -0800, David E. Wheeler wrote:
> On Mar 7, 2010, at 10:11 AM, Bret Dawson wrote:
>
> > Yes, it works just fine. I was just confused about why.
> >
> > When you use the ".utf8" locales, characters are sorted in the wrong
> > order, so that accented letters come at the end of the alphabet.
> >
> > Drop the extension, though, and just use something like "fr_FR," and
> > your sorting comes out perfect.
> >
> > I had sort of expected the opposite to be the case and wondered if
> > anyone knew why.
> >
> > But the happy news is that locale-based alphabetical sorting works just
> > great, provided the locales you need are installed. (Thanks Alex!)
>
> Different locales use different collations. The utf8 locales might use utf8-abetical collation, which may not be what you want. But different vendors provide different versions of locales with different collations. For example, I've noticed that using en_US.utf8 in PostgreSQL leads to different sort ordering on Linux than on Mac OS X. IIRC, OS X sorted accented characters the way you want (and I want), while Linux did not.
>
> The situation with locales and collations is, frankly, a complete clusterfuck. It seriously needs standardization.
>
> Best,
>
> David
>
>


--
Bret Dawson
Producer
Pectopah Productions Inc.
(416) 895-7635
bret@pectopah.com
www.pectopah.com
Re: Locales, sorting, and character encodings [ In reply to ]
On Mar 7, 2010, at 12:45 PM, Bret Dawson wrote:

> This was Linux (Gentoo) + PostgreSQL, so maybe a good general rule is to
> avoid the utf8 locales on Linux. Either that, or keep trying until you
> find a locale that does what you want. :/

No, you want the utf8 locales. But there needs to be a better way to get good ones on Linux. I don't know what it is, though.

David