Mailing List Archive

Catalyst Unicode
Hi Everybody,

I noticed that paramters in my Catalyst application are not
automatically decoded to UTF-8, although I use

use Catalyst qw/
...
Unicode::Encoding
...
/;
__PACKAGE__->config(
...
encoding => 'UTF-8',
...
);

I tracked this down to Catalyst::Plugin::Unicode::Encoding::\
_handle_param_unicode_decoding() where I find the line

$enc->decode($value, $CHECK)

that does not seem to have the (my) expected outcome.

Here is a small test script:

#!perl -w
use strict;
use Encode;
my $encoding = Encode::find_encoding("UTF-8");
warn "Encoding: $encoding\n";
my $t = "Ümläuts";
warn "Is utf8: ", Encode::is_utf8($t), " ", utf8::is_utf8($t), "\n";
$encoding->decode($t);
# utf8::decode($t);
warn "Is utf8: ", Encode::is_utf8($t), " ", utf8::is_utf8($t), "\n";
binmode STDOUT, ':utf8';
print "T=$t\n";
------
$ perl enc.pl
Encoding: Encode::utf8=HASH(0x1994cb8)
Is utf8:
Is utf8:
T=Ümläuts

When I replace above '$encoding->decode($t)' with 'utf8::decode($t)'
then it works as expected:

$ perl enc.pl
Encoding: Encode::utf8=HASH(0x1e02cb8)
Is utf8:
Is utf8: 1 1
T=Ümläuts

What am I doing wrong here? Or do I misunderstand what this code should
do? Any hints are highly appreciated. Thanks.

Christian

--
Dr. Christian Lackas, Managing Partner
inviCRO, LLC -- In Imaging Yours
P: +1 617 963 0263, F: +49 2203 9034722, E: lackas@invicro.com
http://www.invicro.com/ http://www.spect-ct.com/

_______________________________________________
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/
Re: Catalyst Unicode [ In reply to ]
https://metacpan.org/pod/release/JJNAPIORK/Catalyst-Runtime-5.90053/lib/Catalyst/Upgrading.pod#Catalyst::Plugin::Unicode::Encoding-is-now-core


and in your test script, im pretty sure if you are using unicode chars
in your perl code you have to "use utf8" in it.

also, "use warnings" instead of "perl -w"

#!/usr/bin/env perl
use strict;
use warnings;
use utf8;
use Encode;
my $encoding = Encode::find_encoding("UTF-8");
warn "Encoding: $encoding\n";
my $t = "Ümläuts";
warn "Is utf8: ", Encode::is_utf8($t), " ", utf8::is_utf8($t), "\n";
$encoding->decode($t);
# utf8::decode($t);
warn "Is utf8: ", Encode::is_utf8($t), " ", utf8::is_utf8($t), "\n";
binmode STDOUT, ':utf8';
print "T=$t\n";


david@nio:~$ perl foo.pl
Encoding: Encode::utf8=HASH(0x1f22390)
Is utf8: 1 1
Is utf8:
T=Ümläuts

On 31 January 2014 12:29, Christian Lackas <christian@lackas.net> wrote:
> Hi Everybody,
>
> I noticed that paramters in my Catalyst application are not
> automatically decoded to UTF-8, although I use
>
> use Catalyst qw/
> ...
> Unicode::Encoding
> ...
> /;
> __PACKAGE__->config(
> ...
> encoding => 'UTF-8',
> ...
> );
>
> I tracked this down to Catalyst::Plugin::Unicode::Encoding::\
> _handle_param_unicode_decoding() where I find the line
>
> $enc->decode($value, $CHECK)
>
> that does not seem to have the (my) expected outcome.
>
> Here is a small test script:
>
> #!perl -w
> use strict;
> use Encode;
> my $encoding = Encode::find_encoding("UTF-8");
> warn "Encoding: $encoding\n";
> my $t = "Ümläuts";
> warn "Is utf8: ", Encode::is_utf8($t), " ", utf8::is_utf8($t), "\n";
> $encoding->decode($t);
> # utf8::decode($t);
> warn "Is utf8: ", Encode::is_utf8($t), " ", utf8::is_utf8($t), "\n";
> binmode STDOUT, ':utf8';
> print "T=$t\n";
> ------
> $ perl enc.pl
> Encoding: Encode::utf8=HASH(0x1994cb8)
> Is utf8:
> Is utf8:
> T=Ümläuts
>
> When I replace above '$encoding->decode($t)' with 'utf8::decode($t)'
> then it works as expected:
>
> $ perl enc.pl
> Encoding: Encode::utf8=HASH(0x1e02cb8)
> Is utf8:
> Is utf8: 1 1
> T=Ümläuts
>
> What am I doing wrong here? Or do I misunderstand what this code should
> do? Any hints are highly appreciated. Thanks.
>
> Christian
>
> --
> Dr. Christian Lackas, Managing Partner
> inviCRO, LLC -- In Imaging Yours
> P: +1 617 963 0263, F: +49 2203 9034722, E: lackas@invicro.com
> http://www.invicro.com/ http://www.spect-ct.com/
>
> _______________________________________________
> List: Catalyst@lists.scsys.co.uk
> Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
> Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
> Dev site: http://dev.catalyst.perl.org/

_______________________________________________
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/
Re: Catalyst Unicode [ In reply to ]
* Christian Lackas <christian@lackas.net> [140131 06:33]:

Hi Everybody,

sorry, has to be

$t = $encoding->decode($t);

below, of course (and then it works in the test).
That said, in my application I have:

my $file = $c->request->param('file');
warn "is_utf8: ", utf8::is_utf8($file), "\n";

and see that $file has not been decoded to UTF-8, although debug lines
in _handle_param_unicode_decoding() show that the value is passed
through there.

What else could cause this issue here?

Christian

--
Dr. Christian Lackas, Managing Partner
inviCRO, LLC -- In Imaging Yours
P: +1 617 963 0263, F: +49 2203 9034722, E: lackas@invicro.com
http://www.invicro.com/ http://www.spect-ct.com/

> Hi Everybody,
>
> I noticed that paramters in my Catalyst application are not
> automatically decoded to UTF-8, although I use
>
> use Catalyst qw/
> ...
> Unicode::Encoding
> ...
> /;
> __PACKAGE__->config(
> ...
> encoding => 'UTF-8',
> ...
> );
>
> I tracked this down to Catalyst::Plugin::Unicode::Encoding::\
> _handle_param_unicode_decoding() where I find the line
>
> $enc->decode($value, $CHECK)
>
> that does not seem to have the (my) expected outcome.
>
> Here is a small test script:
>
> #!perl -w
> use strict;
> use Encode;
> my $encoding = Encode::find_encoding("UTF-8");
> warn "Encoding: $encoding\n";
> my $t = "Ümläuts";
> warn "Is utf8: ", Encode::is_utf8($t), " ", utf8::is_utf8($t), "\n";
> $encoding->decode($t);
> # utf8::decode($t);
> warn "Is utf8: ", Encode::is_utf8($t), " ", utf8::is_utf8($t), "\n";
> binmode STDOUT, ':utf8';
> print "T=$t\n";
> ------
> $ perl enc.pl
> Encoding: Encode::utf8=HASH(0x1994cb8)
> Is utf8:
> Is utf8:
> T=Ümläuts
>
> When I replace above '$encoding->decode($t)' with 'utf8::decode($t)'
> then it works as expected:
>
> $ perl enc.pl
> Encoding: Encode::utf8=HASH(0x1e02cb8)
> Is utf8:
> Is utf8: 1 1
> T=Ümläuts
>
> What am I doing wrong here? Or do I misunderstand what this code should
> do? Any hints are highly appreciated. Thanks.
>
> Christian
>
> --
> Dr. Christian Lackas, Managing Partner
> inviCRO, LLC -- In Imaging Yours
> P: +1 617 963 0263, F: +49 2203 9034722, E: lackas@invicro.com
> http://www.invicro.com/ http://www.spect-ct.com/
>
> _______________________________________________
> List: Catalyst@lists.scsys.co.uk
> Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
> Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
> Dev site: http://dev.catalyst.perl.org/

_______________________________________________
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/
Re: Catalyst Unicode [ In reply to ]
On 31 January 2014 11:53, Christian Lackas <christian@lackas.net> wrote:
...
> That said, in my application I have:
>
> my $file = $c->request->param('file');
> warn "is_utf8: ", utf8::is_utf8($file), "\n";
>
> and see that $file has not been decoded to UTF-8, although debug lines
> in _handle_param_unicode_decoding() show that the value is passed
> through there.
>
> What else could cause this issue here?

If the string has been decoded *from* UTF-8 to Perl's internal
representation, it's *not* going to be marked as UTF8 internally; it
*shouldn't* be. It's no longer a "UTF8" string but a "Unicode" string,
complete with wide characters. If anything, the internal "UTF8" flag
means "this string needs decoding" rather than "has been decoded".

_______________________________________________
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/
Re: Catalyst Unicode [ In reply to ]
* David Schmidt <davewood@gmx.at> [140131 06:53]:

Dear David,

thanks for your prompt reply.
If I use 'use utf8', I already tell Perl that all literals are in UTF-8,
and thus would not need to decode manually at all (as you see in your
output it already says is_utf8 even before you decode manually).
What I tried to simulate is: I have some text in a variable, and I know
its UTF-8, however, Perl does not, yet, thus I have to decode it.

Please also note the small error I made with not taking the return value
of '$encoding->decode()'.

Christian

--
Dr. Christian Lackas, Managing Partner
inviCRO, LLC -- In Imaging Yours
P: +1 617 963 0263, F: +49 2203 9034722, E: lackas@invicro.com
http://www.invicro.com/ http://www.spect-ct.com/

> https://metacpan.org/pod/release/JJNAPIORK/Catalyst-Runtime-5.90053/lib/Catalyst/Upgrading.pod#Catalyst::Plugin::Unicode::Encoding-is-now-core
>
>
> and in your test script, im pretty sure if you are using unicode chars
> in your perl code you have to "use utf8" in it.
>
> also, "use warnings" instead of "perl -w"
>
> #!/usr/bin/env perl
> use strict;
> use warnings;
> use utf8;
> use Encode;
> my $encoding = Encode::find_encoding("UTF-8");
> warn "Encoding: $encoding\n";
> my $t = "Ümläuts";
> warn "Is utf8: ", Encode::is_utf8($t), " ", utf8::is_utf8($t), "\n";
> $encoding->decode($t);
> # utf8::decode($t);
> warn "Is utf8: ", Encode::is_utf8($t), " ", utf8::is_utf8($t), "\n";
> binmode STDOUT, ':utf8';
> print "T=$t\n";
>
>
> david@nio:~$ perl foo.pl
> Encoding: Encode::utf8=HASH(0x1f22390)
> Is utf8: 1 1
> Is utf8:
> T=Ümläuts
>
> On 31 January 2014 12:29, Christian Lackas <christian@lackas.net> wrote:
> > Hi Everybody,
> >
> > I noticed that paramters in my Catalyst application are not
> > automatically decoded to UTF-8, although I use
> >
> > use Catalyst qw/
> > ...
> > Unicode::Encoding
> > ...
> > /;
> > __PACKAGE__->config(
> > ...
> > encoding => 'UTF-8',
> > ...
> > );
> >
> > I tracked this down to Catalyst::Plugin::Unicode::Encoding::\
> > _handle_param_unicode_decoding() where I find the line
> >
> > $enc->decode($value, $CHECK)
> >
> > that does not seem to have the (my) expected outcome.
> >
> > Here is a small test script:
> >
> > #!perl -w
> > use strict;
> > use Encode;
> > my $encoding = Encode::find_encoding("UTF-8");
> > warn "Encoding: $encoding\n";
> > my $t = "Ümläuts";
> > warn "Is utf8: ", Encode::is_utf8($t), " ", utf8::is_utf8($t), "\n";
> > $encoding->decode($t);
> > # utf8::decode($t);
> > warn "Is utf8: ", Encode::is_utf8($t), " ", utf8::is_utf8($t), "\n";
> > binmode STDOUT, ':utf8';
> > print "T=$t\n";
> > ------
> > $ perl enc.pl
> > Encoding: Encode::utf8=HASH(0x1994cb8)
> > Is utf8:
> > Is utf8:
> > T=Ümläuts
> >
> > When I replace above '$encoding->decode($t)' with 'utf8::decode($t)'
> > then it works as expected:
> >
> > $ perl enc.pl
> > Encoding: Encode::utf8=HASH(0x1e02cb8)
> > Is utf8:
> > Is utf8: 1 1
> > T=Ümläuts
> >
> > What am I doing wrong here? Or do I misunderstand what this code should
> > do? Any hints are highly appreciated. Thanks.
> >
> > Christian
> >
> > --
> > Dr. Christian Lackas, Managing Partner
> > inviCRO, LLC -- In Imaging Yours
> > P: +1 617 963 0263, F: +49 2203 9034722, E: lackas@invicro.com
> > http://www.invicro.com/ http://www.spect-ct.com/
> >
> > _______________________________________________
> > List: Catalyst@lists.scsys.co.uk
> > Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
> > Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
> > Dev site: http://dev.catalyst.perl.org/
>
> _______________________________________________
> List: Catalyst@lists.scsys.co.uk
> Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
> Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
> Dev site: http://dev.catalyst.perl.org/

_______________________________________________
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/
Re: Catalyst Unicode [ In reply to ]
* Will Crawford <billcrawford1970@gmail.com> [140131 07:00]:

Dear Will,

first of all, thanks for your input, also.

> > That said, in my application I have:
> > my $file = $c->request->param('file');
> > warn "is_utf8: ", utf8::is_utf8($file), "\n";
> > and see that $file has not been decoded to UTF-8
> If the string has been decoded *from* UTF-8 to Perl's internal
> representation, it's *not* going to be marked as UTF8 internally;

Maybe I am misunderstanding you, however, if I have a text string that I
know contains UTF-8, however, it is not UTF-8 for Perl, yet (is_utf8
false), then I have to use decode() to fix that, right?

This is what Catalyst does (and what I showed in my small example):

# from Catalyt/Plugin/Unicode/Encoding:
Encode::is_utf8( $value ) ?
$value
: $enc->decode( $value, $CHECK );

After this, everything is good, is_utf8 is true and I could happily live
ever after. In theory.

I am just confused that this decoded (which does work as expected,
contrary to my initial thought) does not reach me in

my $file = $c->request->param('file');
warn "is_utf8: ", utf8::is_utf8($file), "\n";

where my parameter is not UTF8, as I need it.
Does prepare_uploads() of C::P::Unicode::Encoding not process all my
parameters automatically for me?
When I add a debug line in the loop of $c->request->{parameters} there I
do see my value comeing in, and leaving as UTF8.

Christian

--
Dr. Christian Lackas, Managing Partner
inviCRO, LLC -- In Imaging Yours
P: +1 617 963 0263, F: +49 2203 9034722, E: lackas@invicro.com
http://www.invicro.com/ http://www.spect-ct.com/


_______________________________________________
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/
Re: Catalyst Unicode [ In reply to ]
* Will Crawford <billcrawford1970@gmail.com> [2014-01-31 13:05]:
> If the string has been decoded *from* UTF-8 to Perl's internal
> representation, it's *not* going to be marked as UTF8 internally; it
> *shouldn't* be. It's no longer a "UTF8" string but a "Unicode" string,
> complete with wide characters. If anything, the internal "UTF8" flag
> means "this string needs decoding" rather than "has been decoded".

Sorry, this is nonsense. The UTF8 flag means the string is internally
stored as a variable-width integer sequence using the same encoding
scheme as UTF-8, which means it can store characters > 0xFF. If the
UTF8 flag is off, the string is stored as a byte array.

You are correct only insofar as that decoding a string could in theory
yield a string with the UTF8 flag *off*.

Because the UTF8 flag doesn’t mean anything. It only means that the
string can store characters > 0xFF, which only matters to perl
internally, since UTF8=0 strings will be transparently promoted to
UTF8=1 whenever necessary.

But Perl can’t tell whether a string is a Unicode string or byte string.
The UTF8 flag is irrelevant.

*You* can tell, because `length` returns 2 for a byte string with a “ü”
represented in UTF-8, and 1 for a Unicode string with the character “ü”.

(But `length` can return 1 for a UTF8=0 string, because the codepoint is
0xFC which can be stored as a single byte just fine; and it can return
2 even for a UTF8=1 string, because the UTF-8 encoded representation of
“ü” is 0xC3 0xBC and it doesn’t matter whether you store that in
a UTF8=0 or UTF8=1 string, it’s still the sequence 0xC3 0xBC.)


Christian:

This also affects you: you should not be looking at `is_utf8`. Instead
you should be looking at whether `length` returns the correct value.


Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>

_______________________________________________
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/
Re: Catalyst Unicode [ In reply to ]
On Fri, Jan 31, 2014 at 3:58 AM, Will Crawford
<billcrawford1970@gmail.com>wrote:

>
> If the string has been decoded *from* UTF-8 to Perl's internal
> representation, it's *not* going to be marked as UTF8 internally; it
> *shouldn't* be. It's no longer a "UTF8" string but a "Unicode" string,
> complete with wide characters. If anything, the internal "UTF8" flag
> means "this string needs decoding" rather than "has been decoded".
>


$ perl -le 'use Encode; my $chars = decode_utf8( "bytes" ); print
Encode::is_utf8( $chars ) ? "Is flagged utf8\n" : "not flagged\n"; use
Devel::Peek; Dump($chars)'
Is flagged utf8

SV = PV(0x7fb8c10023f0) at 0x7fb8c102b6a8
REFCNT = 1
FLAGS = (PADMY,POK,pPOK,UTF8)
PV = 0x7fb8c0e01170 "bytes"\0 [UTF8 "bytes"]
CUR = 5
LEN = 16

Everything is encoded. The flag tells Perl that its internal
representation is encoded as utf8 so knows to work with it as utf8
characters (e.g. length() is length of chars, matching works on chars, etc.)

$ perl -le 'use Encode; my $chars = decode( 'latin1', "bytes" ); print
Encode::is_utf8( $chars ) ? "Is flagged utf8\n" : "not flagged\n"; use
Devel::Peek; Dump($chars)'
Is flagged utf8



--
Bill Moseley
moseley@hank.org
Re: Catalyst Unicode [ In reply to ]
Yes, I got the wrong end of the stick. Sorry (repeatedly :)).

On 31 January 2014 15:03, Bill Moseley <moseley@hank.org> wrote:
>
>
>
> On Fri, Jan 31, 2014 at 3:58 AM, Will Crawford <billcrawford1970@gmail.com>
> wrote:
>>
>>
>> If the string has been decoded *from* UTF-8 to Perl's internal
>> representation, it's *not* going to be marked as UTF8 internally; it
>> *shouldn't* be. It's no longer a "UTF8" string but a "Unicode" string,
>> complete with wide characters. If anything, the internal "UTF8" flag
>> means "this string needs decoding" rather than "has been decoded".
>
>
>
> $ perl -le 'use Encode; my $chars = decode_utf8( "bytes" ); print
> Encode::is_utf8( $chars ) ? "Is flagged utf8\n" : "not flagged\n"; use
> Devel::Peek; Dump($chars)'
> Is flagged utf8
>
> SV = PV(0x7fb8c10023f0) at 0x7fb8c102b6a8
> REFCNT = 1
> FLAGS = (PADMY,POK,pPOK,UTF8)
> PV = 0x7fb8c0e01170 "bytes"\0 [UTF8 "bytes"]
> CUR = 5
> LEN = 16
>
> Everything is encoded. The flag tells Perl that its internal
> representation is encoded as utf8 so knows to work with it as utf8
> characters (e.g. length() is length of chars, matching works on chars, etc.)
>
> $ perl -le 'use Encode; my $chars = decode( 'latin1', "bytes" ); print
> Encode::is_utf8( $chars ) ? "Is flagged utf8\n" : "not flagged\n"; use
> Devel::Peek; Dump($chars)'
> Is flagged utf8
>
>
>
> --
> Bill Moseley
> moseley@hank.org
>
> _______________________________________________
> List: Catalyst@lists.scsys.co.uk
> Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
> Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
> Dev site: http://dev.catalyst.perl.org/
>

_______________________________________________
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/