On 2/8/21 2:34, Felipe Gasper wrote:
>
>
>> On Aug 1, 2021, at 10:23 AM, Leon Timmermans <fawaka@gmail.com> wrote:
>>
>> Code is not binary, it is text. E.g.:
>>
>> use 5.010;
>> { no utf8; say "éé" =~ /\N{LATIN SMALL LETTER E WITH ACUTE}/ ? "yes" : "no" };
>> { use utf8; say "éé" =~ /\N{LATIN SMALL LETTER E WITH ACUTE}/ ? "yes" : "no" };
>>
>> The status quo is only reasonable in that 95% of all code is actually ASCII, so it usually doesn't matter.
>
> Code is indeed text, but this is not reasonable:
>
>> perl -Mutf8 -e'print "é"'
> ?
>
> … particularly in contrast to this:
>
>> echo é | perl -Mutf8 -e 'print <>'
> é
>
> … and these:
>
>> node -e 'console.log("é")'
> é
>
>> python -c 'print("é")'
> é
>
>> ruby -e 'puts "é"'
> é
>
>> echo '<?php print "é" ?>' | php
> é
>
>> echo | awk '{print "é"}'
> é
>
>> julia -e'print("é")'
> é
>
>> lua -e'print "é"'
> é
>
>
> For Unicode-aware applications it is indeed useful to auto-decode the strings, but is it really worth making Perl’s “modern default” the exceptionally weird behaviour of making:
>
> perl -E'print "¡Hola, mundo!"'
>
> … *not* print the given text correctly?
>
> It just doesn’t seem a very workable “modern default”. How feasible, instead, would something like the following be:
>
> ------
>
> 1. Devote 2 bits of each SV to storing whether the PV is text or bytes:
>
> 0 0 = unknown
> 0 1 = text
> 1 0 = bytes
> 1 1 = reserved/unused
IMO that's not the correct way to approach the problem here.
Perl already has PerlIO that allows transparent encoding/decoding of
data on some IO interfaces, and that support should be expanded to
support all of them.
Otherwise you are asking the programmer to do that translation
explicitly every time some data goes through any builtin doing IO, as in:
mkdir do_encoding($dirname);
That doesn't make sense at all.
What we need is to add proper support for the transparent translation of
data between the internal representation and the outside encoding
everywhere. And that means:
a) Adding this translation feature to all the builtins doing IO
b) Adding a mechanism so that the developer can configure it (for
instance, set filesystem encoding).
c) Infer sane defaults from the environment (utf8 has been the default
encoding in most Linux/Unix systems for the last two decades, but Perl
still expects latin1 from STDIO!)
And regarding (c), that's also why most of your examples above work.
Your terminal sends and expects utf-8 encoded data from perl but perl
expects/sends latin1. It just happens that in your examples it is
consistently wrong, but there are myriads of other cases where it isn't.
>
>
>> On Aug 1, 2021, at 10:23 AM, Leon Timmermans <fawaka@gmail.com> wrote:
>>
>> Code is not binary, it is text. E.g.:
>>
>> use 5.010;
>> { no utf8; say "éé" =~ /\N{LATIN SMALL LETTER E WITH ACUTE}/ ? "yes" : "no" };
>> { use utf8; say "éé" =~ /\N{LATIN SMALL LETTER E WITH ACUTE}/ ? "yes" : "no" };
>>
>> The status quo is only reasonable in that 95% of all code is actually ASCII, so it usually doesn't matter.
>
> Code is indeed text, but this is not reasonable:
>
>> perl -Mutf8 -e'print "é"'
> ?
>
> … particularly in contrast to this:
>
>> echo é | perl -Mutf8 -e 'print <>'
> é
>
> … and these:
>
>> node -e 'console.log("é")'
> é
>
>> python -c 'print("é")'
> é
>
>> ruby -e 'puts "é"'
> é
>
>> echo '<?php print "é" ?>' | php
> é
>
>> echo | awk '{print "é"}'
> é
>
>> julia -e'print("é")'
> é
>
>> lua -e'print "é"'
> é
>
>
> For Unicode-aware applications it is indeed useful to auto-decode the strings, but is it really worth making Perl’s “modern default” the exceptionally weird behaviour of making:
>
> perl -E'print "¡Hola, mundo!"'
>
> … *not* print the given text correctly?
>
> It just doesn’t seem a very workable “modern default”. How feasible, instead, would something like the following be:
>
> ------
>
> 1. Devote 2 bits of each SV to storing whether the PV is text or bytes:
>
> 0 0 = unknown
> 0 1 = text
> 1 0 = bytes
> 1 1 = reserved/unused
IMO that's not the correct way to approach the problem here.
Perl already has PerlIO that allows transparent encoding/decoding of
data on some IO interfaces, and that support should be expanded to
support all of them.
Otherwise you are asking the programmer to do that translation
explicitly every time some data goes through any builtin doing IO, as in:
mkdir do_encoding($dirname);
That doesn't make sense at all.
What we need is to add proper support for the transparent translation of
data between the internal representation and the outside encoding
everywhere. And that means:
a) Adding this translation feature to all the builtins doing IO
b) Adding a mechanism so that the developer can configure it (for
instance, set filesystem encoding).
c) Infer sane defaults from the environment (utf8 has been the default
encoding in most Linux/Unix systems for the last two decades, but Perl
still expects latin1 from STDIO!)
And regarding (c), that's also why most of your examples above work.
Your terminal sends and expects utf-8 encoded data from perl but perl
expects/sends latin1. It just happens that in your examples it is
consistently wrong, but there are myriads of other cases where it isn't.