Mailing List Archive

UTF-8 HOWTO
I finally got around to converting our Apache::ASP application so that
it uses UTF-8 throughout, instead of Latin-1. I learned a few things
that aren't discussed in the archives, so I'm setting them down here for
others to find.

1. It's best if you use newer Perls. 5.8.0 is adequate, but has known
bugs in its Unicode handling. When run under 5.8.0, our program
exhibits a double UTF-8 conversion in one circumstance, while the other
screens show the data correctly. When the same program is run under
5.8.5, all screens show the correct data. While it's theoretically
possible to get Perl 5.6.x to cope with UTF-8 data, I don't recommend
messing with it. A few years ago when I first tried using UTF-8, I was
using 5.6 and had many problems with Perl smashing my data back to
Latin-1 incorrectly.

2. Also use the newest mod_perl you can. There are known Unicode bugs
in mod_perl 1.99_09 and older.

3. You must say "use utf8;" at the top of each ASP file. If you use
$Response->Include(), each included file also has to say "use utf8;".
The same goes for any Perl modules you use, if you will be passing UTF-8
strings through them.

4. mod_perl doesn't set the LANG environment variable unless you ask it
to. Perls 5.8 and newer use the LANG environment variable (among other
things) to decide whether to use UTF-8 by default or not. I didn't find
it to be necessary to ask mod_perl to set this variable in my program,
but it can't hurt to do it. If nothing else, it's one less thing you
have to blame if your pages aren't showing the right data. In your
httpd.conf, right after "PerlModule Apache::ASP", say "PerlPassEnv
LANG". This will pass your system's default value for LANG through to
the mod_perl instances, and thus to Apache::ASP.

5. Ensure that your data source is passing UTF-8 data correctly. In our
program, the data comes in via an XML path, so we needed to inform the
XML parser that the data is UTF-8. Otherwise, the XML parser assumes
it's Latin-1, and you get a double UTF-8 conversion.

6. Finally, you need to communicate that the data is UTF-8 to the
browser. This is done with the Content-Type HTTP header, which you can
set in a number of ways. I like to do it in a <meta> tag at the top of
each file that will contain UTF-8 data:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Alternately, if all documents on your server should be treated as UTF-8,
there's an Apache configuration directive to force all output to be
declared as UTF-8.

---------------------------------------------------------------------
To unsubscribe, e-mail: asp-unsubscribe@perl.apache.org
For additional commands, e-mail: asp-help@perl.apache.org
Re: UTF-8 HOWTO [ In reply to ]
Joshua Chamas wrote:
> Do you know why it is that "use utf8" is needed
> at the top of each script?

No, I'm not sure. At this point, I just know that there are pages
where, if I remove the pragma, the UTF-8 characters get munged. I
haven't tried to localize the Perl constructs in which this happens.

> What precisely were the problems that you were running into without this
> setting?

The most common symptom was what looked like double UTF-8 encodings.
That is, Unicode characters that should have encoded as 2 bytes in UTF-8
were showing up as 4 bytes. I didn't try to reverse the double
conversion to make sure this is what was happening, but I can't think of
a more likely explanation for the symptom.

> The opportunity here is that we could automatically add something like this
> to the top of each page.

I'll consider investigating deeper.

---------------------------------------------------------------------
To unsubscribe, e-mail: asp-unsubscribe@perl.apache.org
For additional commands, e-mail: asp-help@perl.apache.org