Mailing List Archive

Re: "use v5.36.0" should imply UTF-8 encoded source Leon Timmermans <fawaka@gmail.com>
On 2/8/21 2:34, Felipe Gasper wrote:
>
>
>> On Aug 1, 2021, at 10:23 AM, Leon Timmermans <fawaka@gmail.com> wrote:
>>
>> Code is not binary, it is text. E.g.:
>>
>> use 5.010;
>> { no utf8; say "éé" =~ /\N{LATIN SMALL LETTER E WITH ACUTE}/ ? "yes" : "no" };
>> { use utf8; say "éé" =~ /\N{LATIN SMALL LETTER E WITH ACUTE}/ ? "yes" : "no" };
>>
>> The status quo is only reasonable in that 95% of all code is actually ASCII, so it usually doesn't matter.
>
> Code is indeed text, but this is not reasonable:
>
>> perl -Mutf8 -e'print "é"'
> ?
>
> … particularly in contrast to this:
>
>> echo é | perl -Mutf8 -e 'print <>'
> é
>
> … and these:
>
>> node -e 'console.log("é")'
> é
>
>> python -c 'print("é")'
> é
>
>> ruby -e 'puts "é"'
> é
>
>> echo '<?php print "é" ?>' | php
> é
>
>> echo | awk '{print "é"}'
> é
>
>> julia -e'print("é")'
> é
>
>> lua -e'print "é"'
> é
>
>
> For Unicode-aware applications it is indeed useful to auto-decode the strings, but is it really worth making Perl’s “modern default” the exceptionally weird behaviour of making:
>
> perl -E'print "¡Hola, mundo!"'
>
> … *not* print the given text correctly?
>
> It just doesn’t seem a very workable “modern default”. How feasible, instead, would something like the following be:
>
> ------
>
> 1. Devote 2 bits of each SV to storing whether the PV is text or bytes:
>
> 0 0 = unknown
> 0 1 = text
> 1 0 = bytes
> 1 1 = reserved/unused

IMO that's not the correct way to approach the problem here.

Perl already has PerlIO that allows transparent encoding/decoding of
data on some IO interfaces, and that support should be expanded to
support all of them.

Otherwise you are asking the programmer to do that translation
explicitly every time some data goes through any builtin doing IO, as in:

mkdir do_encoding($dirname);

That doesn't make sense at all.

What we need is to add proper support for the transparent translation of
data between the internal representation and the outside encoding
everywhere. And that means:

a) Adding this translation feature to all the builtins doing IO
b) Adding a mechanism so that the developer can configure it (for
instance, set filesystem encoding).
c) Infer sane defaults from the environment (utf8 has been the default
encoding in most Linux/Unix systems for the last two decades, but Perl
still expects latin1 from STDIO!)

And regarding (c), that's also why most of your examples above work.
Your terminal sends and expects utf-8 encoded data from perl but perl
expects/sends latin1. It just happens that in your examples it is
consistently wrong, but there are myriads of other cases where it isn't.
Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]
> On Aug 3, 2021, at 5:01 AM, Salvador Fandiño <sfandino@gmail.com> wrote:
>
> What we need is to add proper support for the transparent translation of data between the internal representation and the outside encoding everywhere.

For that to work Perl needs *some* stronger means of distinguishing text from binary. Ideally IMO that should be in the SV, but maybe other approaches would be beetter. It doesn’t do to assume every string is text; lots of Perl code does I/O on raw bytes, by design.

> And that means:
>
> a) Adding this translation feature to all the builtins doing IO
> b) Adding a mechanism so that the developer can configure it (for instance, set filesystem encoding).
> c) Infer sane defaults from the environment (utf8 has been the default encoding in most Linux/Unix systems for the last two decades, but Perl still expects latin1 from STDIO!)
>
> And regarding (c), that's also why most of your examples above work. Your terminal sends and expects utf-8 encoded data from perl but perl expects/sends latin1. It just happens that in your examples it is consistently wrong, but there are myriads of other cases where it isn't.

It sounds like you want STDIN/STDOUT/STDERR and @ARGV to be decoded by default, in addition to the source code. TBH that makes more sense to me than the present proposal since it would preserve parity of character encoding for simple programs, though I think it would still sow confusion since other filehandles will remain binary by default:

-----
use utf8;
pipe my $r, my $w;
print {$w} "¡Hola, mundo!"; # oops! I just make mojibake, but nothing told me.
-----

-F