Mailing List Archive

Getting mad with UTF-8
Hello,

Can someone help me understand what could cause this :

warn "\$content : ".(utf8::is_utf8($content) ? "utf8" : "not utf8");
warn "\$ticketdata[0]->[0] : ".(utf8::is_utf8($ticketdata[0]->[0]) ? "utf8" : "not utf8");
warn "content4=$content";
if ($ticketdata[0]->[0] ne $content) {
warn "content5=$content";
#
warn "content6=$content stored=".$ticketdata[0]->[0];
warn "content7=$content";
}

In apache2 error.log :

[Wed Jun 12 16:35:56 2013] [warn] [12504]ERR: 32: Warning in Perl code: $content : not utf8 at /var/www/sites/recia/rtgi3/rtgilib.pm line 382, <GEN46> line 13.
[Wed Jun 12 16:35:56 2013] [warn] [12504]ERR: 32: Warning in Perl code: $ticketdata[0]->[0] : utf8 at /var/www/sites/recia/rtgi3/rtgilib.pm line 383, <GEN46> line 13.
[Wed Jun 12 16:29:13 2013] [warn] [10974]ERR: 32: Warning in Perl code: content4=h\xc3\xa9 at /var/www/sites/recia/rtgi3/rtgilib.pm line 381, <GEN47> line 13.
[Wed Jun 12 16:29:13 2013] [warn] [10974]ERR: 32: Warning in Perl code: content5=h\xc3\xa9 at /var/www/sites/recia/rtgi3/rtgilib.pm line 383, <GEN47> line 13.
[Wed Jun 12 16:29:13 2013] [warn] [10974]ERR: 32: Warning in Perl code: content6=h\xc3\x83\xc2\xa9 stored=h\xc3\xa9 at /var/www/sites/recia/rtgi3/rtgilib.pm line 385, <GEN47> line 13.
[Wed Jun 12 16:29:13 2013] [warn] [10974]ERR: 32: Warning in Perl code: content7=h\xc3\xa9 at /var/www/sites/recia/rtgi3/rtgilib.pm line 386, <GEN47> line 13.

As you see, the $content variable changes from one line to the other ?!?
$ticketdata[0]->[0] contains "hé" coming from a DB (configured as UTF-8) and the test should not fail.

I guess the problem comes from the fact that on the same line I have one utf-8 variable and one non-utf8 one.

$content comes from $fdat{content} (not marked as utf8 while the page encoding is declared and recognized as utf-8).

What can I do to force embperl to always set the utf-8 flag on $fdat{...} ?

If you know a way of telling Apache/EmbPerl that no encoding other than UTF-8 exist in the world, I'll take it. And it's not a problem if I'm incompatible with anything.

Thanks for your help,

(using libembperl-perl 2.5.0~rc3-1 on Debian/wheezy with apache2-mpm-prefork 2.2.22-13)

---------------------------------------------------------------------
To unsubscribe, e-mail: embperl-unsubscribe@perl.apache.org
For additional commands, e-mail: embperl-help@perl.apache.org
Re: Getting mad with UTF-8 [ In reply to ]
Hello Jean-Christophe,

Am 12.06.2013 um 16:44 schrieb Jean-Christophe Boggio:

> Hello,
>
> Can someone help me understand what could cause this :
>
> warn "\$content : ".(utf8::is_utf8($content) ? "utf8" : "not utf8");
> warn "\$ticketdata[0]->[0] : ".(utf8::is_utf8($ticketdata[0]->[0]) ? "utf8" : "not utf8");
> warn "content4=$content";
> if ($ticketdata[0]->[0] ne $content) {
> warn "content5=$content";
> #
> warn "content6=$content stored=".$ticketdata[0]->[0];
> warn "content7=$content";
> }
>

[...]

> I guess the problem comes from the fact that on the same line I have one utf-8 variable and one non-utf8 one.
>
> $content comes from $fdat{content} (not marked as utf8 while the page encoding is declared and recognized as utf-8).
>
> What can I do to force embperl to always set the utf-8 flag on $fdat{...} ?
>
> If you know a way of telling Apache/EmbPerl that no encoding other than UTF-8 exist in the world, I'll take it. And it's not a problem if I'm incompatible with anything.



I guess your guess is right - having one utf8 flagged variable in a statement converts all other things to utf8 also - and perl uses ISO-8895-1 for the conversion!
So your string is destroyed after that. The same thing happens, when you use a Freeze::Thaw or a DataDumper - bad for serializing and storing something in a database :-(

Embperl decides for itself, if the %fdat parameters are utf8 or not - I don't know, how it does so, maybe Gerald could say something about that - but we had a lot of "funny" things in the past regarding this problem. Our website is in different encodings (not UTF8 and not ISO-8859-1) so we ran in the trouble. We implemented an own "thaw" method which tries to thaw the data and if that fails, it converts the data to utf8 and thaws it again...

A solution for you could be: use "$content=decode('UTF-8',$content)" to flag your variable or walk over %fdat to do it with all keys which are not already utf8-flagged. After that, you should have UTF8-only variables and everything works as expected.

One little additional comment: using non utf8-flagged variables with utf8-content (as your $content variable) breaks a lot of perl stuff: lc, uc, cmp, le, gt, length, sort, ....


With best regards,

Dirk Melchers
/// IT/Software-Development ///

NUREG GmbH ///
Dorfäckerstraße 31 | 90427 Nürnberg | Germany
Tel. +49-911-32002-256 | Fax +49-911-32002-299
Mobil +49-172-9354670 | www.nureg.de
Nürnberg HRB 22653 | USt.ID DE 814 685 653
Geschäftsführer: Michael Schmidt, Stefan Boas


---------------------------------------------------------------------
To unsubscribe, e-mail: embperl-unsubscribe@perl.apache.org
For additional commands, e-mail: embperl-help@perl.apache.org
AW: Getting mad with UTF-8 [ In reply to ]
Hi,

sorry for the late reply.

Perl utf8 flag does NOT says that your data is utf8 or not. It tell us something about the internal representation of your data inside of Perl. So utf8 data can have the utf8 set, but it need not, also everything is alright.

Unfortunately when I wrote the utf8 %fdat handling I was not fully aware of this fact.

It might help to access your %fdat data via

$data = Encode::decode_utf8 ($fdat{foo}) ;

Decode_utf8 will convert the utf8 data (that Embperl delivers) to the correct internal representation.

I will fix this in a further release

Hope this helps

Gerald

> -----Ursprüngliche Nachricht-----
> Von: Jean-Christophe Boggio [mailto:embperl@thefreecat.org]
> Gesendet: Mittwoch, 12. Juni 2013 16:44
> An: embperl@perl.apache.org
> Betreff: Getting mad with UTF-8
>
> Hello,
>
> Can someone help me understand what could cause this :
>
> warn "\$content : ".(utf8::is_utf8($content) ? "utf8" : "not utf8"); warn
> "\$ticketdata[0]->[0] : ".(utf8::is_utf8($ticketdata[0]->[0]) ? "utf8" : "not
> utf8"); warn "content4=$content"; if ($ticketdata[0]->[0] ne $content) {
> warn "content5=$content";
> #
> warn "content6=$content stored=".$ticketdata[0]->[0];
> warn "content7=$content";
> }
>
> In apache2 error.log :
>
> [Wed Jun 12 16:35:56 2013] [warn] [12504]ERR: 32: Warning in Perl code:
> $content : not utf8 at /var/www/sites/recia/rtgi3/rtgilib.pm line 382,
> <GEN46> line 13.
> [Wed Jun 12 16:35:56 2013] [warn] [12504]ERR: 32: Warning in Perl code:
> $ticketdata[0]->[0] : utf8 at /var/www/sites/recia/rtgi3/rtgilib.pm line 383,
> <GEN46> line 13.
> [Wed Jun 12 16:29:13 2013] [warn] [10974]ERR: 32: Warning in Perl code:
> content4=h\xc3\xa9 at /var/www/sites/recia/rtgi3/rtgilib.pm line 381,
> <GEN47> line 13.
> [Wed Jun 12 16:29:13 2013] [warn] [10974]ERR: 32: Warning in Perl code:
> content5=h\xc3\xa9 at /var/www/sites/recia/rtgi3/rtgilib.pm line 383,
> <GEN47> line 13.
> [Wed Jun 12 16:29:13 2013] [warn] [10974]ERR: 32: Warning in Perl code:
> content6=h\xc3\x83\xc2\xa9 stored=h\xc3\xa9 at
> /var/www/sites/recia/rtgi3/rtgilib.pm line 385, <GEN47> line 13.
> [Wed Jun 12 16:29:13 2013] [warn] [10974]ERR: 32: Warning in Perl code:
> content7=h\xc3\xa9 at /var/www/sites/recia/rtgi3/rtgilib.pm line 386,
> <GEN47> line 13.
>
> As you see, the $content variable changes from one line to the other ?!?
> $ticketdata[0]->[0] contains "hé" coming from a DB (configured as UTF-8) and
> the test should not fail.
>
> I guess the problem comes from the fact that on the same line I have one
> utf-8 variable and one non-utf8 one.
>
> $content comes from $fdat{content} (not marked as utf8 while the page
> encoding is declared and recognized as utf-8).
>
> What can I do to force embperl to always set the utf-8 flag on $fdat{...} ?
>
> If you know a way of telling Apache/EmbPerl that no encoding other than
> UTF-8 exist in the world, I'll take it. And it's not a problem if I'm incompatible
> with anything.
>
> Thanks for your help,
>
> (using libembperl-perl 2.5.0~rc3-1 on Debian/wheezy with apache2-mpm-
> prefork 2.2.22-13)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: embperl-unsubscribe@perl.apache.org
> For additional commands, e-mail: embperl-help@perl.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: embperl-unsubscribe@perl.apache.org
For additional commands, e-mail: embperl-help@perl.apache.org
Re: AW: Getting mad with UTF-8 [ In reply to ]
Gerald,

Le 03/07/2013 17:47, Gerald Richter - ECOS a écrit :
> sorry for the late reply.

No problem.

> Perl utf8 flag does NOT says that your data is utf8 or not. It tell
> us something about the internal representation of your data inside of
> Perl.

I agree with that.

> So utf8 data can have the utf8 set, but it need not, also everything
> is alright.

But isn't it what causes my problem ? Data comes as UTF-8 but is not
"seen" by perl as such. So it gets re-encoded.
That's how I understand it.

> It might help to access your %fdat data via $data =
> Encode::decode_utf8 ($fdat{foo}) ;

Yes but I'd have to do it everywhere %fdat is concerned.

> Decode_utf8 will convert the utf8 data (that Embperl delivers) to the
> correct internal representation. I will fix this in a further
> release Hope this helps

I guess that will solve many things.

I have read many docs about UTF-8 but am still confused. I still don't
understand what decode_utf8 *really* does. For example, what happens if
you do it twice ? Like :

$fdat{foo} = decode_utf8 ( decode_utf8 ($fdat{foo}) );

Will it decode it once and then see it's already UTF-8 (because it has
the utf8 flag set) and don't do it a second time ?

Also, I still don't understand why I seem to be the only one having
problems with UTF-8 :-)

Thanks for taking care of this issue.

Best regards,

JC

---------------------------------------------------------------------
To unsubscribe, e-mail: embperl-unsubscribe@perl.apache.org
For additional commands, e-mail: embperl-help@perl.apache.org