Mailing List Archive

Url Encoded UTF8 parameters
Hi,

if a URL parameter contains a Unicode character (e.g. www.example.com/?param=%D6lso%DF <http://www.example.com/?param=%D6lso%DF> which stands for param=Ölsoße), the parameter is not correctly parsed as Unicode.



I did the following to test it:

1. Create a new Catalyst App: catalyst.pl UnicodeTest

2. Set Root.pm Controller to correct encoding by adding ‘use utf8;’ at the top of the file

3. Then add the following lines to the Root.pm Controller in the index function (don’t forget to use Data::Dumper):
$c->log->debug(Dumper($c->req->params));
$c->log->debug(Dumper("Ölsoße"));

4. This outputs for the example url: localhost:3000/?param=%D6lso%DF:



[debug] $VAR1 = {

'param' => "\x{fffd}lso\x{fffd}e"

};

[debug] $VAR1 = '\x{d6}lso\x{df}e';





As you can see, the first output only contains one equal character: \x{fffd} which is obviously not the same as it should be: \x{d6}lso\x{df}e



I already tried adding the ‘encoding => 'UTF-8'’ config option, but this didn’t change anything.



Am I missing a specific setting?



My Catalyst version: 5.90097

Perl version: v5.18.2



Thanks for your help!

Stefan
Re: Url Encoded UTF8 parameters [ In reply to ]
On Sat, Aug 1, 2015 at 6:31 AM, Stefan <maillist@s.profanter.me> wrote:

> Hi,
>
> if a URL parameter contains a Unicode character (e.g.
> www.example.com/?param=%D6lso%DF which stands for param=Ölsoße), the
> parameter is not correctly parsed as Unicode.
>

4. This outputs for the example url: localhost:3000/?param=%D6lso%DF:
>
> [debug] $VAR1 = {
>
> 'param' => "\x{fffd}lso\x{fffd}e"
>
> };
>
> [debug] $VAR1 = '\x{d6}lso\x{df}e';
>
>
>
>
>
> As you can see, the first output only contains one equal character:
> \x{fffd} which is obviously not the same as it should be: \x{d6}lso\x{df}e
>

\x{fffd} is the unicode replacement character used by Encode to replace
invalid UTF-8 sequences you are passing in.

Try this instead in your browser:

?param=Ölsoße


And then print $c->request->parameters->{param} -- and if you check
Encode::is_utf8( $param ) it should be true, too, indicating the param was
decoded correctly into characters.

Or if you prefer:

perl -le 'use URI::Escape; print uri_escape( "Ölsoße" )'
%C3%96lso%C3%9Fe


so,

?param=%C3%96lso%C3%9Fe


but most likely the browser will turn it back into ?param=Ölsoße


If you really want to say you are using utf8 constant strings (i.e. "use
utf8;"):

$ perl -le 'use URI::Escape; use Encode; use utf8; use Encode; print
uri_escape( encode_utf8( "Ölsoße" ) )'
%C3%96lso%C3%9Fe

or

$ perl -le 'use URI::Escape; use Encode; use utf8; use Encode; print
uri_escape_utf8( "Ölsoße" )'
%C3%96lso%C3%9Fe


All the same thing.


--
Bill Moseley
moseley@hank.org
Re: Url Encoded UTF8 parameters [ In reply to ]
BTW -- I wonder about the Catalyst behavior here.

On Sat, Aug 1, 2015 at 10:36 PM, Bill Moseley <moseley@hank.org> wrote:

>
>
> On Sat, Aug 1, 2015 at 6:31 AM, Stefan <maillist@s.profanter.me> wrote:
>
>> Hi,
>>
>> if a URL parameter contains a Unicode character (e.g.
>> www.example.com/?param=%D6lso%DF which stands for param=Ölsoße), the
>> parameter is not correctly parsed as Unicode.
>>
>
One note here -- data over the wire must be encoded into octets. So, all
Unicode characters must be encoded and then decoded when received. (You
can't send "Unicode characters".) UTF-8 is used now (for obvious
reasons). http://tools.ietf.org/html/rfc3986.

You are specifying %D6 -- although the Unicode characters is U+00D6, the
UTF-8 octet sequence is 0xC3 0x96. See:
http://www.fileformat.info/info/unicode/char/00D6/index.htm

Unless otherwise instructed, Catalyst uses UTF-8
<https://github.com/perl-catalyst/catalyst-runtime/blob/master/lib/Catalyst/Engine.pm#L579>
as the encoding for decoding query parameters -- query parameters are
decoded from UTF-8 octets to Perl characters.

As your example showed, if you use invalid UTF-8 sequences then
Encode::decode() as used by Catalyst will replace those with the U+FFFD
substitution character
<http://www.fileformat.info/info/unicode/char/fffd/index.htm> "�".

This may or may not be what you want. Personally, I think it's not
correct to silently modify user input. You intended to pass "Ölsoße" but
ended up with "�lso�e" -- is that really the data you would want to
process/store for the request? Seems unlikely.

If "param" is suppose to be passed as textual, UTF-8-encoded octets, and it
isn't, then maybe returning a 400 is a better way of handling that. That
probably would have helped you see what is wrong in this case.

i.e. use "eval { decode( $default_query_encoding, $str, FB_CROAK |
LEAVE_SRC ); }" to catch invalid data and return to the client the "$str"
that failed and why.

Of course, it is also possible that you have some query parameters that you
want decoded as UTF-8 and some that might represent something else (a raw
sequence of bytes), and want more manual control. In that case
$c->config->{do_not_decode_query} could be used to bypass the decoding.
But then, you must manually decode() yourself.

--
Bill Moseley
moseley@hank.org
Re: Url Encoded UTF8 parameters [ In reply to ]
I'd be interesting in having some sort of flag on request, that indicated if the incoming query was bad.  I can't do a die here for legacy reasons.
jnap


On Sunday, August 2, 2015 9:39 AM, Bill Moseley <moseley@hank.org> wrote:


BTW -- I wonder about the Catalyst behavior here.

On Sat, Aug 1, 2015 at 10:36 PM, Bill Moseley <moseley@hank.org> wrote:



On Sat, Aug 1, 2015 at 6:31 AM, Stefan <maillist@s.profanter.me> wrote:

Hi,if a URL parameter contains a Unicode character (e.g. www.example.com/?param=%D6lso%DF which stands for param=Ölsoße), the parameter is not correctly parsed as Unicode.


One note here -- data over the wire must be encoded into octets.   So, all Unicode characters must be encoded and then decoded when received.  (You can't send "Unicode characters".)   UTF-8 is used now (for obvious reasons).  http://tools.ietf.org/html/rfc3986.
You are specifying %D6 -- although the Unicode characters is U+00D6, the UTF-8 octet sequence is 0xC3 0x96. See: http://www.fileformat.info/info/unicode/char/00D6/index.htm
Unless otherwise instructed, Catalyst uses UTF-8 as the encoding for decoding query parameters -- query parameters are decoded from UTF-8 octets to Perl characters.
As your example showed, if you use invalid UTF-8 sequences then Encode::decode() as used by Catalyst will replace those with the U+FFFD substitution character "�".
This may or may not be what you want.   Personally, I think it's not correct to silently modify user input.   You intended to pass "Ölsoße" but ended up with "�lso�e" -- is that really the data you would want to process/store for the request?   Seems unlikely.
If "param" is suppose to be passed as textual, UTF-8-encoded octets, and it isn't, then maybe returning a 400 is a better way of handling that.   That probably would have helped you see what is wrong in this case.
i.e. use "eval { decode( $default_query_encoding, $str, FB_CROAK | LEAVE_SRC ); }" to catch invalid data and return to the client the "$str" that failed and why.
Of course, it is also possible that you have some query parameters that you want decoded as UTF-8 and some that might represent something else (a raw sequence of bytes), and want more manual control.  In that case $c->config->{do_not_decode_query} could be used to bypass the decoding.   But then, you must manually decode() yourself.
--
Bill Moseley
moseley@hank.org
_______________________________________________
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/