Mailing List Archive

Character Encoding Trouble
Folks,

I'm trying to figure out a character encoding problem. I'm entering Unicode characters into Bricolage (curly quotes, em dashes, Vietnamese characters, etc.), and they're being horribly mangled every time I save the document. Each subsequent save builds on encoding problems, and after a few saves, a simple right double quote has become a dozen-character string of gobbledegook.

Apache is configured to serve both my public-facing site and Bricolage as UTF-8. LOAD_CHAR_SETS in bricolage.conf is set to UTF-8. AddDefaultCharset is set to utf-8 in httpd.conf in the Bricolage conf directory.

If I set my browser to use UTF-8 text encoding, I can enter Unicode characters, and I can "Save and Stay" as many times as I want and nothing goes wonky. But as soon as I do something like "Check in and Publish," I get (for example) this:

> Story "Hope’s Coffin" saved and checked in to "Publish".
> "Hope’s Coffin" published to VQR.

Note how the first line has the apostrophe rendered properly, while it's been mangled a fraction of a second later, apparently having been through some part of Bricolage that hasn't gotten the UTF-8 memo.

Can anybody suggest what I might be doing wrong?

Best,
Waldo

---
Virginia Quarterly Review
One West Range, Box 400223
University of Virginia
Charlottesville, VA 22904-4223
434-243-4995
Re: Character Encoding Trouble [ In reply to ]
Hey there Waldo,

There's *lots* of banter in the mailing list archives about character
encoding:
http://bit.ly/5YnO0l

Probably a good place to start.

Also, though note what you're looking for, I've taken to using David's
Encode::ZapCP1252 module (http://search.cpan.org/dist/Encode-
ZapCP1252/) to "Zap Windows Western Gremlins," which tends to be the
problem I see the most of related to encoding.

Hope that helps a bit. I'm sure David will have more to say.

Phillip.

On 25-Jan-10, at 6:41 PM, Waldo Jaquith wrote:

> Folks,
>
> I'm trying to figure out a character encoding problem. I'm entering
> Unicode characters into Bricolage (curly quotes, em dashes,
> Vietnamese characters, etc.), and they're being horribly mangled
> every time I save the document. Each subsequent save builds on
> encoding problems, and after a few saves, a simple right double
> quote has become a dozen-character string of gobbledegook.
>
> Apache is configured to serve both my public-facing site and
> Bricolage as UTF-8. LOAD_CHAR_SETS in bricolage.conf is set to
> UTF-8. AddDefaultCharset is set to utf-8 in httpd.conf in the
> Bricolage conf directory.
>
> If I set my browser to use UTF-8 text encoding, I can enter Unicode
> characters, and I can "Save and Stay" as many times as I want and
> nothing goes wonky. But as soon as I do something like "Check in and
> Publish," I get (for example) this:
>
>> Story "Hope’s Coffin" saved and checked in to "Publish".
>> "Hope’s Coffin" published to VQR.
>
> Note how the first line has the apostrophe rendered properly, while
> it's been mangled a fraction of a second later, apparently having
> been through some part of Bricolage that hasn't gotten the UTF-8 memo.
>
> Can anybody suggest what I might be doing wrong?
>
> Best,
> Waldo
>
> ---
> Virginia Quarterly Review
> One West Range, Box 400223
> University of Virginia
> Charlottesville, VA 22904-4223
> 434-243-4995

--
Phillip Smith // Simplifier of Technology // COMMUNITY BANDWIDTH
www.communitybandwidth.ca // www.phillipadsmith.com
Re: Character Encoding Trouble [ In reply to ]
Phillip,

On Jan 26, 2010, at 6:36 AM, Phillip Smith wrote:
> There's *lots* of banter in the mailing list archives about character encoding:
> http://bit.ly/5YnO0l
>
> Probably a good place to start.

For whatever reason, it didn't once cross my mind to check the list archives about this problem...despite starting there with every other problem. :) Thanks for the suggestion!


> Also, though note what you're looking for, I've taken to using David's Encode::ZapCP1252 module (http://search.cpan.org/dist/Encode-ZapCP1252/) to "Zap Windows Western Gremlins," which tends to be the problem I see the most of related to encoding.

Though that may be helpful—and I'll definitely give it a whirl—I'm typing in Unicode characters directly into the Bricolage textarea, in Safari on Mac OS X, without doing anything in Windows or Western encoding. But, man, when I do encounter those crazy Windows Western gremlins, it always take me a good half hour to figure out what I'm dealing with. Whoever added invisible control characters to the Western set should be dragged out in the street and shot. </rant>

Best,
Waldo

---
Virginia Quarterly Review
One West Range, Box 400223
University of Virginia
Charlottesville, VA 22904-4223
434-243-4995
Re: Character Encoding Trouble [ In reply to ]
On Jan 26, 2010, at 7:30 AM, Waldo Jaquith wrote:

> Though that may be helpful—and I'll definitely give it a whirl—I'm typing in Unicode characters directly into the Bricolage textarea, in Safari on Mac OS X, without doing anything in Windows or Western encoding. But, man, when I do encounter those crazy Windows Western gremlins, it always take me a good half hour to figure out what I'm dealing with. Whoever added invisible control characters to the Western set should be dragged out in the street and shot. </rant>

/me chuckles

Your problem doesn't sound like CP1252. But what is your Character Set preference set to? And remember, there is the system-wide preference and then potentially your own user preference.

Best,

David
Re: Character Encoding Trouble [ In reply to ]
On Jan 26, 2010, at 1:26 PM, David E. Wheeler wrote:
> On Jan 26, 2010, at 7:30 AM, Waldo Jaquith wrote:
>
>> Though that may be helpful—and I'll definitely give it a whirl—I'm typing in Unicode characters directly into the Bricolage textarea, in Safari on Mac OS X, without doing anything in Windows or Western encoding. But, man, when I do encounter those crazy Windows Western gremlins, it always take me a good half hour to figure out what I'm dealing with. Whoever added invisible control characters to the Western set should be dragged out in the street and shot. </rant>
>
> /me chuckles
>
> Your problem doesn't sound like CP1252. But what is your Character Set preference set to? And remember, there is the system-wide preference and then potentially your own user preference.

I had no idea that there were so many places to set character sets in Bricolage. :) Under Admin->Sstem->Preferences, character set is set to UTF-8. "Can be Overridden" is set to "No." So everything is apparently OK in Bricolage. I've checked the database (MySQL), and all is well there—every database, table, and column that I've seen is utf8_unicode_ci. I'm running out of places to specify the character set. :)

Incidentally, I just spotted a bug there—overridden is misspelled in the column head as "overriden," with just one "d". Merriam Webster says that only the double-d version is correct.

Best,
Waldo


---
Virginia Quarterly Review
One West Range, Box 400223
University of Virginia
Charlottesville, VA 22904-4223
434-243-4995
Re: Character Encoding Trouble [ In reply to ]
On Jan 26, 2010, at 10:55 AM, Waldo Jaquith wrote:

> I had no idea that there were so many places to set character sets in Bricolage. :) Under Admin->Sstem->Preferences, character set is set to UTF-8. "Can be Overridden" is set to "No." So everything is apparently OK in Bricolage. I've checked the database (MySQL), and all is well there—every database, table, and column that I've seen is utf8_unicode_ci. I'm running out of places to specify the character set. :)

Oh, I forgot you were using MySQL. I'll bet that the issue is that Perl isn't getting the data out of MySQL with the utf8 flag set. Try this patch:

--- a/lib/Bric/Util/DBD/mysql.pm
+++ b/lib/Bric/Util/DBD/mysql.pm
@@ -52,6 +52,7 @@ use constant DSN_STRING => 'database=' . DB_NAME

# This is to set up driver-specific database handle attributes.
use constant DBH_ATTR => (
+ mysql_enable_utf8 => 1,
Callbacks => {
connected => sub {
my $dbh = shift;

> Incidentally, I just spotted a bug there—overridden is misspelled in the column head as "overriden," with just one "d". Merriam Webster says that only the double-d version is correct.

Where? In the database?

David