Mailing List Archive: "use v5.36.0" should imply UTF-8 encoded source

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

darren at darrenduncan

Jul 31, 2021, 12:33 PM

Post #26 of 66 (1414 views)

On 2021-07-31 12:17 p.m., Darren Duncan wrote:
> Now conversely, I don't have a problem with actually waiting until v5.38 to
> fully implement the change IF 5.36 contained some kind of precursor to prepare
> the way, such as that 5.36 would issue warnings for code with a "use 5.36" that
> wasn't valid UTF-8, saying that this code might parse differently under "use
> 5.38". That would let users know in a transitional version what might be a
> problem before it is.

So to clarify, I have a very specific proposal:

1. That a "use 5.36;" will behave the same with respect to the uft8 stuff as
"use 5.34;", but that if the source file / input stream is not entirely valid
UTF-8 under a strict interpretation, the Perl interpreter will issue a warning
saying so and why it matters.

2. That a "use 5.38;", if the source file / input stream is not entirely valid
UTF-8 under a strict interpretation, the Perl interpreter will issue a fatal
error / die saying so and why it matters, and that as a result the parsing has
failed.

So a key thing is that the UTF-8 mode triggered by 5.36/5.38 is strict, doesn't
use substitution characters or delete characters, it either passes the input
unchanged as valid UTF-8 or it complains. If "use utf8;" already does this then
its the same, and otherwise it is stricter.

Since this isn't spelled the same as "use utf8;" the new feature doesn't need to
be identical in every way, we don't have to limit ourselves to that and the
issues of silent corruption from substitution/deleting being the implicit
operation, if that is what it used to do.

On a further point, unlike a lot of the other "use" statements, I assume there
is no good reason for a single file to be a mixture of literal encodings, and so
having multiple "use encoding" statements in a file, either explicit or implied
by a "use 5.38" etc, should be considered an error, and any occurrence of one
would be expected to describe the entire file and not just the lexical scope it
appears in, unlike strict/warnings/etc, its not flipped on or off mid-file.

-- Darren Duncan

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

felipe at felipegasper

Jul 31, 2021, 1:15 PM

Post #27 of 66 (1414 views)

> On Jul 31, 2021, at 4:16 AM, Yuki Kimoto <kimoto.yuki@gmail.com> wrote:
>
> 2021-7-31 16:17 Darren Duncan <darren@darrenduncan.net>:
> On 2021-07-30 11:15 p.m., Yuki Kimoto wrote:
> > 2021-7-30 23:46 Ricardo Signes wrote:
> > I propose that "use v5.36.0" should imply that the source code is,
> > subsequently, UTF-8 encoded.
> >
> > At least after v5.38+.
> >
> > It is good to change one by one.
> >
> > I want to see the effect and hear the user experience of "use warnings" in the
> > next release.
>
> I strongly disagree. The warnings and utf8 are unrelated features. These are
> each also minor changes considering they are lexical. Perl interpreter
> development is already moving at a relatively glacial pace, there is no benefit
> and a lot of downside of delaying the utf8 for a year just to see what people
> say after a production with warnings is released. The 5.36 is still about 9
> months away, that is plenty of time for people to give feedback on either that
> or the warnings.

Turning on warnings in the feature bundle will break things that worked under prior feature bundles, but the breakage will be visible and obvious.

Adding an auto-UTF-8-decode to all source text is a much more subtle breakage, and thus much more prone to confuse people. It’s basically the same type of change as making “my $foo = 123” parse the “123” in hex rather than decimal.

The proposal here is basically for “modern Perl” to make strings in the source code unable to be output as they are (integrally, that is). It seems *awfully* likely to confuse people. Even that aside, in, e.g., JavaScript or Python the interpreter could at least tell you, “hey, you’re trying to print a character string, and I don’t know what encoding you want.” Or, “whoa, that’s a byte string, and this output stream encodes to UTF-8.” Perl has no way of doing that.

Perl’s status quo is that all inputs are byte strings, and all outputs are byte strings. This is simple and consistent: until an application willingly interacts with something that needs or gives text strings (e.g., JSON), everything works similarly to “classic” C strings.

When we start worrying about “text”, though, confusion abounds: Perl can’t tell you when you’ve got the wrong “type”, and the language itself doesn’t even implement its own internal abstraction consistently (see CPAN’s Sys::Binmode). And how many interfaces out there neglect to document whether they expect/give encoded/decoded strings? Making “modern Perl” aggravate that further by defaulting to disparate encoding levels--inputs from the source will need encoding to be printed, but inputs from STDIN won’t … ?!?--will add even more “landmines”.

Decoding source code as UTF-8 makes tons of sense, but only *after* the critical first step is taken of teaching Perl to distinguish text from bytes. (I have ideas for how to achieve that, if there are folks here interested in discussing it further.) That way we can change the “modern” default, and, as with warnings, breakages will come with useful error messages that point to where the problem is and how to fix it.

As a side note, this will facilitate other, hugely useful improvements like making it practical to use Windows’s Unicode APIs, preventing double-encode/decode, etc.

Thanks to all who’ve read and considered.

cheers,
-Felipe

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

grinnz at gmail

Jul 31, 2021, 2:18 PM

Post #28 of 66 (1414 views)

On Sat, Jul 31, 2021 at 3:33 PM Darren Duncan <darren@darrenduncan.net>
wrote:

> On 2021-07-31 12:17 p.m., Darren Duncan wrote:
> > Now conversely, I don't have a problem with actually waiting until v5.38
> to
> > fully implement the change IF 5.36 contained some kind of precursor to
> prepare
> > the way, such as that 5.36 would issue warnings for code with a "use
> 5.36" that
> > wasn't valid UTF-8, saying that this code might parse differently under
> "use
> > 5.38". That would let users know in a transitional version what might
> be a
> > problem before it is.
>
> So to clarify, I have a very specific proposal:
>
> 1. That a "use 5.36;" will behave the same with respect to the uft8 stuff
> as
> "use 5.34;", but that if the source file / input stream is not entirely
> valid
> UTF-8 under a strict interpretation, the Perl interpreter will issue a
> warning
> saying so and why it matters.
>
> 2. That a "use 5.38;", if the source file / input stream is not entirely
> valid
> UTF-8 under a strict interpretation, the Perl interpreter will issue a
> fatal
> error / die saying so and why it matters, and that as a result the parsing
> has
> failed.
>
> So a key thing is that the UTF-8 mode triggered by 5.36/5.38 is strict,
> doesn't
> use substitution characters or delete characters, it either passes the
> input
> unchanged as valid UTF-8 or it complains. If "use utf8;" already does
> this then
> its the same, and otherwise it is stricter.
>
> Since this isn't spelled the same as "use utf8;" the new feature doesn't
> need to
> be identical in every way, we don't have to limit ourselves to that and
> the
> issues of silent corruption from substitution/deleting being the implicit
> operation, if that is what it used to do.
>
> On a further point, unlike a lot of the other "use" statements, I assume
> there
> is no good reason for a single file to be a mixture of literal encodings,
> and so
> having multiple "use encoding" statements in a file, either explicit or
> implied
> by a "use 5.38" etc, should be considered an error, and any occurrence of
> one
> would be expected to describe the entire file and not just the lexical
> scope it
> appears in, unlike strict/warnings/etc, its not flipped on or off mid-file.
>

You seem to be interpreting the major problem here as "source code which is
not valid or intended as UTF-8". This is not a significant issue and its
failure mode is rather obvious. There isn't a further discussion to be had
there.

The subtle issue is that "use utf8" changes (valid UTF-8) non-ascii literal
strings in the code to have different contents. Literal strings *must* be
used differently depending whether "use utf8" was active where they were
written. Without "use utf8", it's a byte string; with "use utf8", it's a
character string.

-Dan

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

darren at darrenduncan

Jul 31, 2021, 2:26 PM

Post #29 of 66 (1414 views)

Thank you Felipe, your latest comment here is something I can much more easily
get behind.

And I am actually a lot more of the same mind as what you expressed in the wider
case.

I actually consider matters of source text encoding to be very distinct and very
separate from all other matters of syntax.

From a purist perspective I actually believe it is best for any declarations of
source encoding to remain explicit and permanently separate from the "use vN;" etc.

The only reason I generally supported rolling a UTF-8 declaration into "use vN;"
was making it easier for Perl users to avoid extra boilerplate in a common case.

On further thought I think I will downgrade to neutral my level of support on
the proposal.

I have been thinking about these matters a lot for years in the context of my
own independent language/format/etc such as
https://github.com/muldis/Muldis_Object_Notation/blob/master/spec/Muldis_Object_Notation_Syntax_Plain_Text.md
which is in progress.

In the context that I've been able to design something green field, I have
further generalized what Perl and Raku have but many languages don't, about the
program source code itself explicitly declaring what it is, so it can be
interpreted most reliably as intended by the writer, rather than relying on
external context.

In particular, source code supports 3 very distinct explicit declarations:

1. "script" - What the character encoding of the source is. This is intended to
disambiguate when there is no 100% reliable heuristic to determine it from
analyzing the byte stream itself. Parsers are expected to support UTF-8 (and
hence also ASCII) at an absolute minimum, and others optionally. Also parsers
are in the general case always intended to take their input as octet strings and
tokenizers would take and return octets rather than characters.

2. "syntax" - What concrete syntax or grammar or format applies to the file. In
the sense that say JSON or XML or YAML or SQL or whatever are syntaxes.

3. "model" - What data model applies to the file, loosely what data model type
each literal etc maps to. For example do Integer and Fraction literals map to
distinct types or to the same type.

Now this is designed around static syntaxes where one can completely and
unambiguously parse a source code string without any knowledge of user-defined
operators or whatever, in contrast to Perl and Raku, where the parser itself
changes how it interprets things as it goes along based on higher level
user-defined things; in contrast my language/format intentionally doesn't do that.

Given that Perl is quite different and has its legacy, what I'm saying above has
very limited applicability to the current Perl discussion, however I feel that
the Perl community can still learn lessons from it.

-- Darren Duncan

On 2021-07-31 1:15 p.m., Felipe Gasper wrote:
> Turning on warnings in the feature bundle will break things that worked under prior feature bundles, but the breakage will be visible and obvious.
>
> Adding an auto-UTF-8-decode to all source text is a much more subtle breakage, and thus much more prone to confuse people. It’s basically the same type of change as making “my $foo = 123” parse the “123” in hex rather than decimal.
>
> The proposal here is basically for “modern Perl” to make strings in the source code unable to be output as they are (integrally, that is). It seems *awfully* likely to confuse people. Even that aside, in, e.g., JavaScript or Python the interpreter could at least tell you, “hey, you’re trying to print a character string, and I don’t know what encoding you want.” Or, “whoa, that’s a byte string, and this output stream encodes to UTF-8.” Perl has no way of doing that.
>
> Perl’s status quo is that all inputs are byte strings, and all outputs are byte strings. This is simple and consistent: until an application willingly interacts with something that needs or gives text strings (e.g., JSON), everything works similarly to “classic” C strings.
>
> When we start worrying about “text”, though, confusion abounds: Perl can’t tell you when you’ve got the wrong “type”, and the language itself doesn’t even implement its own internal abstraction consistently (see CPAN’s Sys::Binmode). And how many interfaces out there neglect to document whether they expect/give encoded/decoded strings? Making “modern Perl” aggravate that further by defaulting to disparate encoding levels--inputs from the source will need encoding to be printed, but inputs from STDIN won’t … ?!?--will add even more “landmines”.
>
> Decoding source code as UTF-8 makes tons of sense, but only *after* the critical first step is taken of teaching Perl to distinguish text from bytes. (I have ideas for how to achieve that, if there are folks here interested in discussing it further.) That way we can change the “modern” default, and, as with warnings, breakages will come with useful error messages that point to where the problem is and how to fix it.
>
> As a side note, this will facilitate other, hugely useful improvements like making it practical to use Windows’s Unicode APIs, preventing double-encode/decode, etc.

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

felipe at felipegasper

Jul 31, 2021, 5:53 PM

Post #30 of 66 (1414 views)

> On Jul 31, 2021, at 5:18 PM, Dan Book <grinnz@gmail.com> wrote:
>
> On Sat, Jul 31, 2021 at 3:33 PM Darren Duncan <darren@darrenduncan.net> wrote:
> On 2021-07-31 12:17 p.m., Darren Duncan wrote:
> > Now conversely, I don't have a problem with actually waiting until v5.38 to
> > fully implement the change IF 5.36 contained some kind of precursor to prepare
> > the way, such as that 5.36 would issue warnings for code with a "use 5.36" that
> > wasn't valid UTF-8, saying that this code might parse differently under "use
> > 5.38". That would let users know in a transitional version what might be a
> > problem before it is.
>
> So to clarify, I have a very specific proposal:
>
> 1. That a "use 5.36;" will behave the same with respect to the uft8 stuff as
> "use 5.34;", but that if the source file / input stream is not entirely valid
> UTF-8 under a strict interpretation, the Perl interpreter will issue a warning
> saying so and why it matters.
>
> 2. That a "use 5.38;", if the source file / input stream is not entirely valid
> UTF-8 under a strict interpretation, the Perl interpreter will issue a fatal
> error / die saying so and why it matters, and that as a result the parsing has
> failed.
>
> So a key thing is that the UTF-8 mode triggered by 5.36/5.38 is strict, doesn't
> use substitution characters or delete characters, it either passes the input
> unchanged as valid UTF-8 or it complains. If "use utf8;" already does this then
> its the same, and otherwise it is stricter.
>
> Since this isn't spelled the same as "use utf8;" the new feature doesn't need to
> be identical in every way, we don't have to limit ourselves to that and the
> issues of silent corruption from substitution/deleting being the implicit
> operation, if that is what it used to do.
>
> On a further point, unlike a lot of the other "use" statements, I assume there
> is no good reason for a single file to be a mixture of literal encodings, and so
> having multiple "use encoding" statements in a file, either explicit or implied
> by a "use 5.38" etc, should be considered an error, and any occurrence of one
> would be expected to describe the entire file and not just the lexical scope it
> appears in, unlike strict/warnings/etc, its not flipped on or off mid-file.
>
> You seem to be interpreting the major problem here as "source code which is not valid or intended as UTF-8". This is not a significant issue and its failure mode is rather obvious. There isn't a further discussion to be had there.
>
> The subtle issue is that "use utf8" changes (valid UTF-8) non-ascii literal strings in the code to have different contents. Literal strings *must* be used differently depending whether "use utf8" was active where they were written. Without "use utf8", it's a byte string; with "use utf8", it's a character string.

Another way to look at it: the content of the parsed strings actually differs between the two:

my $x = do { no utf8; "éé" };
my $y = do { use utf8; "éé" };

In the above, $x is a sequence of 4 code points (195, 169, 195, 169), whereas $y is a sequence of 2 code points (233, 233). That’s it; there is no other difference between $x and $y. Perl doesn’t know that $x is a “byte string” and $y is a “character string”; it just knows the code points.

This would, I think, easily be the most disruptive, potentially “surprising” change yet introduced to a feature bundle.

-FG

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

Aug 1, 2021, 1:01 AM

Post #31 of 66 (1414 views)

From the keyboard of Felipe Gasper [31.07.21,20:53]:

[..]
> Another way to look at it: the content of the parsed strings actually differs between the two:
>
> my $x = do { no utf8; "éé" };
> my $y = do { use utf8; "éé" };
>
> In the above, $x is a sequence of 4 code points (195, 169, 195, 169), whereas $y is a sequence of 2 code points (233, 233). That’s it; there is no other difference between $x and $y. Perl doesn’t know that $x is a “byte string” and $y is a “character string”; it just knows the code points.

This actually depends on the utf8-awareness of the editor used to input
that program text. Entered on a terminal with LANG=en_GB.utf8 via vi, both
$x and $y are a sequence of 4 code points, the latter with the UTF8 flag
set which condenses two code points into chr(233). Why? See explanation
below, and please correct me if I am wrong.

Program written with LANG=en_GB.utf8 and its output piped to less(1):

#!/usr/bin/perl
use 5.10.0;
use Devel::Peek;
my $x = do { no utf8; "éé" };
my $y = do { use utf8; "éé" };
my $z = chr(233) x 2;
$| = 1;
say "\$x: ",$x; Dump $x;
say "\$y: ",$y; Dump $y;
say "\$z: ",$z; Dump $z;
__END__
$x: éé
SV = PV(0x8debc0) at 0x904530
REFCNT = 1
FLAGS = (PADMY,POK,IsCOW,pPOK)
PV = 0x908630 "\303\251\303\251"\0
CUR = 4
LEN = 10
COW_REFCNT = 1
$y: <E9><E9>
SV = PV(0x8dec40) at 0x904548
REFCNT = 1
FLAGS = (PADMY,POK,IsCOW,pPOK,UTF8)
PV = 0x9085d0 "\303\251\303\251"\0 [UTF8 "\x{e9}\x{e9}"]
CUR = 4
LEN = 10
COW_REFCNT = 1
$z: <E9><E9>
SV = PV(0x8dea90) at 0x904b18
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x8f5de0 "\351\351"\0
CUR = 2
LEN = 10

This had me confused all the time: why does an utf8 literal with the UTF8
flag set result in an ISO-8859 sequence? That's because the utf8 feature
was introduced in times when terminals defaulted to some latin-1 variant
and allowed use of UTF-8 which resulted in the appropriate latin-1 string
representation. Now that terminals, editors and such pretty always default
to using UTF-8, the utf8 pragma is meaningless except for weird cases in
which you want your literals be treated as latin-1.

Entering the above program text in a terminal with LANG=en_GB.ISO-8859-1
produces the following:

Malformed UTF-8 character (unexpected non-continuation byte 0xe9, immediately after start byte 0xe9) at utf8-iso.pl line 5.
Malformed UTF-8 character (1 byte, need 3, after start byte 0xe9) at utf8-iso.pl line 5.
$x: éé
SV = PV(0x203dbc0) at 0x2063630
REFCNT = 1
FLAGS = (PADMY,POK,IsCOW,pPOK)
PV = 0x20678b0 "\351\351"\0
CUR = 2
LEN = 10
COW_REFCNT = 1
$y: ^@^@
SV = PV(0x203dc40) at 0x2063648
REFCNT = 1
FLAGS = (PADMY,POK,IsCOW,pPOK,UTF8)
PV = 0x2067850 "\0\0"\0 [UTF8 "\x{0}\x{0}"]
CUR = 2
LEN = 10
COW_REFCNT = 1
$z: éé
SV = PV(0x203da90) at 0x2063cf0
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x2054e90 "\351\351"\0
CUR = 2
LEN = 10
$s: éé
SV = PV(0x203dc50) at 0x2063ca8
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x2042360 "\351\351"\0
CUR = 2
LEN = 10

So, to procude "éé" in $y, line 5 should be - in an ISO-8859 or latin-1
environment - proper written as

my $y = do { use utf8; "Ã©Ã©" };

because then the literal is valid UTF-8 expressed in latin-1.

> This would, I think, easily be the most disruptive, potentially “surprising” change yet introduced to a feature bundle.
>
> -FG

I agree. And as said above, the utf8 pragma is useless most all of the
time and people get its effect backwards, since nowadays most work in a
UTF-8 aware environment.

If you write your programs in an UTF-8 environment and get your output to
the same, perl already does the right thing, no matter whether you output
bytes or characters, because those bytes actually resemble valid UTF-8.

In an UTF-8 environment perl already does the right thing reading your
program. Characters vs. bytes gets interesting in regexes, but that's a
well covered area, and then there's substr, chop/chomp for which detection
or explicit of bytes vs. chars makes sense.

0--gg-

--
_($_=" "x(1<<5)."?\n".q·/)Oo. G°\ /
/\_¯/(q /
---------------------------- \__(m.====·.(_("always off the crowd"))."·
");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

Aug 1, 2021, 1:24 AM

Post #32 of 66 (1414 views)

From the keyboard of shmem [01.08.21,10:01]:

> From the keyboard of Felipe Gasper [31.07.21,20:53]:
>
> [..]
>> Another way to look at it: the content of the parsed strings actually
>> differs between the two:
>>
>> my $x = do { no utf8; "éé" };
>> my $y = do { use utf8; "éé" };
>>
>> In the above, $x is a sequence of 4 code points (195, 169, 195, 169),
>> whereas $y is a sequence of 2 code points (233, 233). That’s it; there is
>> no other difference between $x and $y. Perl doesn’t know that $x is a “byte
>> string” and $y is a “character string”; it just knows the code points.
>
> This actually depends on the utf8-awareness of the editor used to input
> that program text. Entered on a terminal with LANG=en_GB.utf8 via vi, both
> $x and $y are a sequence of 4 code points, the latter with the UTF8 flag
> set which condenses two code points into chr(233). Why? See explanation
> below, and please correct me if I am wrong.

Correcting myself: $y *is* two code points, the internal representation
is 4 bytes. Without the UTF8 flag the internal representation is idem
with code points.

> PV = 0x9085d0 "\303\251\303\251"\0 [UTF8 "\x{e9}\x{e9}"]
code points ---------------------------------^^^^^^^^^^^^

Sorry for my confusion :-P

0--gg-

--
_($_=" "x(1<<5)."?\n".q·/)Oo. G°\ /
/\_¯/(q /
---------------------------- \__(m.====·.(_("always off the crowd"))."·
");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

fawaka at gmail

Aug 1, 2021, 7:23 AM

Post #33 of 66 (1414 views)

On Fri, Jul 30, 2021 at 8:46 PM Felipe Gasper <felipe@felipegasper.com>
wrote:

> FWIW I think it’s easier to think of the default I/O mode as “bytes” or
> “native” 8-bit encoding” rather than “Latin-1”. In that light it’s easier
> to see the status quo as the more reasonable default: we parse the code as
> bytes, and we print as bytes.
>

Code is not binary, it is text. E.g.:

use 5.010;
{ no utf8; say "éé" =~ /\N{LATIN SMALL LETTER E WITH ACUTE}/ ? "yes" :
"no" };
{ use utf8; say "éé" =~ /\N{LATIN SMALL LETTER E WITH ACUTE}/ ? "yes" :
"no" };

The status quo is only reasonable in that 95% of all code is actually
ASCII, so it usually doesn't matter.

> Changing it so that the (“modern”) default is to decode strings as UTF-8
> but still output them as bytes seems likely to introduce lots of confusion,
> which will either a) discourage adoption of “use v5.36”, or b) discourage
> use of Perl at all:
>
> Anti-Perler: Hey that new Perl script you wrote mangles our CEO’s name.
> Perler: That’s weird … I used the modern defaults … wonder where the bug
> is …
> Anti-Perler: Maybe you should just switch to $otherlang, where this stuff
> doesn’t happen.

TBH I expect the exact opposite to happen.

Leon

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

felipe at felipegasper

Aug 1, 2021, 5:34 PM

Post #34 of 66 (1414 views)

> On Aug 1, 2021, at 10:23 AM, Leon Timmermans <fawaka@gmail.com> wrote:
>
> Code is not binary, it is text. E.g.:
>
> use 5.010;
> { no utf8; say "éé" =~ /\N{LATIN SMALL LETTER E WITH ACUTE}/ ? "yes" : "no" };
> { use utf8; say "éé" =~ /\N{LATIN SMALL LETTER E WITH ACUTE}/ ? "yes" : "no" };
>
> The status quo is only reasonable in that 95% of all code is actually ASCII, so it usually doesn't matter.

Code is indeed text, but this is not reasonable:

> perl -Mutf8 -e'print "é"'
?

… particularly in contrast to this:

> echo é | perl -Mutf8 -e 'print <>'
é

… and these:

> node -e 'console.log("é")'
é

> python -c 'print("é")'
é

> ruby -e 'puts "é"'
é

> echo '<?php print "é" ?>' | php
é

> echo | awk '{print "é"}'
é

> julia -e'print("é")'
é

> lua -e'print "é"'
é

For Unicode-aware applications it is indeed useful to auto-decode the strings, but is it really worth making Perl’s “modern default” the exceptionally weird behaviour of making:

perl -E'print "¡Hola, mundo!"'

… *not* print the given text correctly?

It just doesn’t seem a very workable “modern default”. How feasible, instead, would something like the following be:

------

1. Devote 2 bits of each SV to storing whether the PV is text or bytes:

0 0 = unknown
0 1 = text
1 0 = bytes
1 1 = reserved/unused

2. Create string::decode_utf8() and string::encode_utf8() built-ins that access those bits. (Or string::from('UTF-16LE', …) etc.)

3. Under `use experimental 'autoencode'` blocks, teach Perl to auto-encode text strings when printing them (or otherwise sending them to the OS). Such blocks would also imply `use utf8`.

4. Outside such blocks, any operations on the strings reset the bytes/text bits.

Then, once/if that feature works, Perl can *really* up its game: better Windows support, JSON could fail if asked to encode binary or decode text, etc.

-F

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

kimoto.yuki at gmail

Aug 1, 2021, 10:11 PM

Post #35 of 66 (1414 views)

2021-8-1 9:54 Felipe Gasper <felipe@felipegasper.com> wrote:

>
> Another way to look at it: the content of the parsed strings actually
> differs between the two:
>
> my $x = do { no utf8; "éé" };
> my $y = do { use utf8; "éé" };
>
>
Felipe

I have a question.

I think the problem is which is the better default in 2021 for general
application users.

The existing code is "no utf8" so it won't break.

In the new code, the generally recommended way is

use strict;
use warnings;
use utf8;

If user needs old behavior, he need to write

use v5.xx;
no utf8;

Are you clearly aware that this is a default change, not internal
representation changes?

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

kimoto.yuki at gmail

Aug 1, 2021, 10:31 PM

Post #36 of 66 (1414 views)

2021-8-1 4:17 Darren Duncan <darren@darrenduncan.net> wrote:

> However I also agree that there is plenty of time to properly have that
> discussion, such that we would know whether or not it is a good idea for
> "use
> vN;" to do something about UTF-8, so that if we agree it is a good idea,
> we can
> implement it for "v5.36", rather than putting it off to "v5.38".
>
>
Darren

It may be discussed internally where you are talking.

However, I saw this before a few days ago in the p5p mailing list.

In other words, it was the first time for an outside user to see it.

Perl has a long history of character code confusion. Old code may be left
unattended.

The goals you see are probably the same as mine.

The problem is the process.

People will say "Oh, Before I knew it and discussed it, "use utf8;" was
introduced by default."

I think the Perl core team needs to send a lot of messages to end users
about what we recommend for the proper treatment of Perl character encoding.

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

fawaka at gmail

Aug 2, 2021, 1:12 AM

Post #37 of 66 (1414 views)

On Mon, Aug 2, 2021 at 2:35 AM Felipe Gasper <felipe@felipegasper.com>
wrote:

> Code is indeed text, but this is not reasonable:
>
> > perl -Mutf8 -e'print "é"'
> ?
>
> … particularly in contrast to this:
>
> > echo é | perl -Mutf8 -e 'print <>'
> é
>
> … and these:
>
> > node -e 'console.log("é")'
> é
>
> > python -c 'print("é")'
> é
>
> > ruby -e 'puts "é"'
> é
>
> > echo '<?php print "é" ?>' | php
> é
>
> > echo | awk '{print "é"}'
> é
>
> > julia -e'print("é")'
> é
>
> > lua -e'print "é"'
> é
>

I don't think the proposal said anything about -e or -E

Leon

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

darren at darrenduncan

Aug 2, 2021, 1:43 AM

Post #38 of 66 (1414 views)

On 2021-08-02 1:12 a.m., Leon Timmermans wrote:
> On Mon, Aug 2, 2021 at 2:35 AM Felipe Gasper wrote:
>
> Code is indeed text, but this is not reasonable:
>
> > perl -Mutf8 -e'print "é"'
> ?
> <snip>
>
> I don't think the proposal said anything about -e or -E

I think it implicitly did. What if someone puts a "use v5.36;" inside the -e or
-E argument string? -- Darren Duncan

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

felipe at felipegasper

Aug 2, 2021, 3:54 AM

Post #39 of 66 (1414 views)

> On Aug 2, 2021, at 4:12 AM, Leon Timmermans <fawaka@gmail.com> wrote:
>
> On Mon, Aug 2, 2021 at 2:35 AM Felipe Gasper <felipe@felipegasper.com> wrote:
> Code is indeed text, but this is not reasonable:
>
> > perl -Mutf8 -e'print "é"'
> ?
>
> … particularly in contrast to this:
>
> > echo é | perl -Mutf8 -e 'print <>'
> é
>
> … and these:
>
> > node -e 'console.log("é")'
> é
>
> > python -c 'print("é")'
> é
>
> > ruby -e 'puts "é"'
> é
>
> > echo '<?php print "é" ?>' | php
> é
>
> > echo | awk '{print "é"}'
> é
>
> > julia -e'print("é")'
> é
>
> > lua -e'print "é"'
> é
>
> I don't think the proposal said anything about -e or -E

1. -E loads the feature bundle. So if the 5.36 feature bundle is to include utf8.pm, so will -E, right?

2. The feature bundle represents a “best, most modern” set of default Perl features. The proposal will make Perl at its “best, most modern” complicate a simple “¡Hola, mundo” with a need to encode--and no indication of what’s wrong if that encoding is missing. Worse, the STDIN-piped variant of that code will still work as usual.

-F

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

felipe at felipegasper

Aug 2, 2021, 4:13 AM

Post #40 of 66 (1414 views)

> On Aug 2, 2021, at 1:11 AM, Yuki Kimoto <kimoto.yuki@gmail.com> wrote:
>
>
> 2021-8-1 9:54 Felipe Gasper <felipe@felipegasper.com> wrote:
>
> Another way to look at it: the content of the parsed strings actually differs between the two:
>
> my $x = do { no utf8; "éé" };
> my $y = do { use utf8; "éé" };
>
>
> Felipe
>
> I have a question.
>
> I think the problem is which is the better default in 2021 for general application users.
>
> The existing code is "no utf8" so it won't break.
>
> In the new code, the generally recommended way is
>
> use strict;
> use warnings;
> use utf8;

Recommended by whom? I generally don’t `use utf8`, and $work actually forbids it. The status quo’s consistency (i.e., everything’s a byte string until something explicitly decodes it) far outpaces whatever value I’d get from having `length "é"` return 1 rather than 2.

> If user needs old behavior, he need to write
>
> use v5.xx;
> no utf8;
>
> Are you clearly aware that this is a default change, not internal representation changes?

Yup, I know that this would only affect code that does `use 5.36`, or -E at the command line. The former would, by definition, be new code, and the latter is inherently unstable, so there’s no problem with the fact that it’s a behaviour change from default per se.

The problem is that the feature bundles, by definition, represent Perl at its ostensible best, its most modern. This particular proposal would make `perl -E'say "¡Hola, mundo!"` print mojibake. That seems undesirable in the extreme; no other major language introduces that complexity for such a trivial task, and if one did, it would give some indication of what’s wrong rather than Perl’s “silent failure” approach.

This all said: if the desire is more to be able to use non-ASCII in identifier names (e.g., `sub épée { … }`), could a variant of utf8.pm be created that leaves string literals undecoded but just decodes sub names and the like? *That* would seem a reasonable improvement upon status quo.

-F

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

leonerd at leonerd

Aug 2, 2021, 7:23 AM

Post #41 of 66 (1414 views)

On Mon, 2 Aug 2021 06:54:43 -0400
Felipe Gasper <felipe@felipegasper.com> wrote:

> 1. -E loads the feature bundle. So if the 5.36 feature bundle is to
> include utf8.pm, so will -E, right?

We didn't say the :5.36 feature bundle would `use utf8`.

We suggested having `use v5.36` do so.

Compare to the way that `use v5.12` onwards will `use strict`, but even
perl5.34 -E does not:

$ perl -e 'use v5.12; $x = 123'
Global symbol "$x" requires explicit package name (did you forget to
declare "my $x"?) at -e line 1.
Execution of -e aborted due to compilation errors.

$ perl5.34.0 -E '$x = 123; say $x'
123

--
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk | https://metacpan.org/author/PEVANS
http://www.leonerd.org.uk/ | https://www.tindie.com/stores/leonerd/

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

felipe at felipegasper

Aug 2, 2021, 7:42 AM

Post #42 of 66 (1414 views)

> On Aug 2, 2021, at 10:23 AM, Paul LeoNerd Evans <leonerd@leonerd.org.uk> wrote:
>
> On Mon, 2 Aug 2021 06:54:43 -0400
> Felipe Gasper <felipe@felipegasper.com> wrote:
>
>> 1. -E loads the feature bundle. So if the 5.36 feature bundle is to
>> include utf8.pm, so will -E, right?
>
> We didn't say the :5.36 feature bundle would `use utf8`.
>
> We suggested having `use v5.36` do so.
>
> Compare to the way that `use v5.12` onwards will `use strict`, but even
> perl5.34 -E does not:
>
> $ perl -e 'use v5.12; $x = 123'
> Global symbol "$x" requires explicit package name (did you forget to
> declare "my $x"?) at -e line 1.
> Execution of -e aborted due to compilation errors.
>
> $ perl5.34.0 -E '$x = 123; say $x'
> 123

Fair enough.

My point is still that this:

-----
use v5.36;
print 'Hello, world!';
-----

… should not be “subtly wrong”.

-F

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

Aug 2, 2021, 8:04 AM

Post #43 of 66 (1414 views)

On Fri, 30 Jul 2021 10:45:53 -0400, "Ricardo Signes"
<perl.p5p@rjbs.manxome.org> wrote:

> Porters,
>
> I propose that "use v5.36.0" should imply that the source code is,
> subsequently, UTF-8 encoded.

+1

But the docs and release notes should very clearly state that this is
about the source code itself, as I see many discussions rambling about
the data handling being changed, which is not the case.

> Currently, I advise the following boilerplate:
> use v5.34.0;
> use warnings;
> use utf8;
>
> We're on the cusp or merging warnings in. Next, we merge in utf8.
> This shouldn't break existing programs, only programs that opt to
> change behavior by adding v5.36.0.
>
> With that, the boilerplate could be:
> use v5.36.0;
>
> This doesn't need to load utf8.pm, and could just alter $^H, but:
> whatever.

--
H.Merijn Brand https://tux.nl Perl Monger http://amsterdam.pm.org/
using perl5.00307 .. 5.33 porting perl5 on HP-UX, AIX, and Linux
https://tux.nl/email.html http://qa.perl.org https://www.test-smoke.org

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

rabbiveesh at gmail

Aug 2, 2021, 8:17 AM

Post #44 of 66 (1414 views)

>
>
> My point is still that this:
>
> -----
> use v5.36;
> print 'Hello, world!';
> -----
>
> … should not be “subtly wrong”.
>
> -F

Since 5.36 is meant to turn on warnings, this will be explicitly wrong, not
subtly.

Perhaps the "wide character" warning is too unclear, but we can always
improve the text to include a doc link as such.

What compels me more is the following example.
Let's say I'm looking for customers in my database named josé. Easy, I'll
use DBIC:

$customer_rs->search({ name => 'josé' })

But when I run it, I get nothing. That's because the various DBDs will
handle encoding and decoding for you, bc perl is meant to deal with text in
userland.

Had utf8 been turned on, then I would've started with text, not bytes, and
found my customers instead of mojibake (though on the other hand, the non
utf8 is a great way to find double encoded text).

I think this is a more realistic example than printing a string literal,
where the behavior is surprising and conceptually inconsistent.

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

grinnz at gmail

Aug 2, 2021, 8:25 AM

Post #45 of 66 (1414 views)

On Mon, Aug 2, 2021 at 11:17 AM Veesh Goldman <rabbiveesh@gmail.com> wrote:

>
>
>>
>> My point is still that this:
>>
>> -----
>> use v5.36;
>> print 'Hello, world!';
>> -----
>>
>> … should not be “subtly wrong”.
>>
>> -F
>
>
> Since 5.36 is meant to turn on warnings, this will be explicitly wrong,
> not subtly.
>
> Perhaps the "wide character" warning is too unclear, but we can always
> improve the text to include a doc link as such.
>
> What compels me more is the following example.
> Let's say I'm looking for customers in my database named josé. Easy, I'll
> use DBIC:
>
> $customer_rs->search({ name => 'josé' })
>
> But when I run it, I get nothing. That's because the various DBDs will
> handle encoding and decoding for you, bc perl is meant to deal with text in
> userland.
>
> Had utf8 been turned on, then I would've started with text, not bytes, and
> found my customers instead of mojibake (though on the other hand, the non
> utf8 is a great way to find double encoded text).
>
> I think this is a more realistic example than printing a string literal,
> where the behavior is surprising and conceptually inconsistent.
>

Yes, this is a tradeoff between interfaces that will expect bytes and
interfaces that will expect characters, as both exist in modern Perl.

STDOUT and STDERR expect bytes unless one does "use open ':std', IO =>
':encoding(UTF-8)';" which changes the assumption of those interfaces so
isn't great. DBI drivers, Mojolicious interfaces, etc expect characters.

I think it is both true that having "use utf8" in use VERSION will surprise
people, and not having it in use VERSION will continue to surprise people.

I think we can make this step with proper documentation, but we must
understand the concerns Felipe mentions are real.

-Dan

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

grinnz at gmail

Aug 2, 2021, 8:27 AM

Post #46 of 66 (1414 views)

On Mon, Aug 2, 2021 at 11:05 AM H.Merijn Brand <perl5@tux.freedom.nl> wrote:

> On Fri, 30 Jul 2021 10:45:53 -0400, "Ricardo Signes"
> <perl.p5p@rjbs.manxome.org> wrote:
>
> > Porters,
> >
> > I propose that "use v5.36.0" should imply that the source code is,
> > subsequently, UTF-8 encoded.
>
> +1
>
> But the docs and release notes should very clearly state that this is
> about the source code itself, as I see many discussions rambling about
> the data handling being changed, which is not the case.
>

The data in a literal non-ASCII string under "use utf8" is different from
the data in an identically-written string without "use utf8"; thus it is
factual that the data handling will change for these strings. That it's
changing because the source code itself was decoded doesn't change the
practical implications.

-Dan

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

grinnz at gmail

Aug 2, 2021, 8:29 AM

Post #47 of 66 (1414 views)

On Mon, Aug 2, 2021 at 11:17 AM Veesh Goldman <rabbiveesh@gmail.com> wrote:

>
>
>>
>> My point is still that this:
>>
>> -----
>> use v5.36;
>> print 'Hello, world!';
>> -----
>>
>> … should not be “subtly wrong”.
>>
>> -F
>
>
> Since 5.36 is meant to turn on warnings, this will be explicitly wrong,
> not subtly.
>
> Perhaps the "wide character" warning is too unclear, but we can always
> improve the text to include a doc link as such.
>

This is not always the case; the wide character warning only catches
strings with codepoints over 255 as they unambiguously cannot be byte
strings. Strings with non-ascii characters 255 and under get blindly
printed, effectively as ISO-8859-1.

-Dan

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

felipe at felipegasper

Aug 2, 2021, 8:31 AM

Post #48 of 66 (1414 views)

> On Aug 2, 2021, at 11:17 AM, Veesh Goldman <rabbiveesh@gmail.com> wrote:
>
>
>
>
> My point is still that this:
>
> -----
> use v5.36;
> print 'Hello, world!';
> -----
>
> … should not be “subtly wrong”.
>
> -F
>
> Since 5.36 is meant to turn on warnings, this will be explicitly wrong, not subtly.
>
> Perhaps the "wide character" warning is too unclear, but we can always improve the text to include a doc link as such.

There’s no “wide character” warning when there happen to be no wide characters.

>
> What compels me more is the following example.
> Let's say I'm looking for customers in my database named josé. Easy, I'll use DBIC:
>
> $customer_rs->search({ name => 'josé' })
>
> But when I run it, I get nothing. That's because the various DBDs will handle encoding and decoding for you, bc perl is meant to deal with text in userland.

Which DBDs?

- DBD::SQLite is bytes by default, but it has the SvPV bug (i.e., it sends the internal PV to SQLite).

- DBD::mysql is also bytes w/ SvPV bug by default.

(I haven’t tried DBD::Pg.)

>
> Had utf8 been turned on, then I would've started with text, not bytes, and found my customers instead of mojibake (though on the other hand, the non utf8 is a great way to find double encoded text).
>
> I think this is a more realistic example than printing a string literal, where the behavior is surprising and conceptually inconsistent.

Why would you query on a string constant? More likely you’ll be accepting $name via some input, in which case you have to decode it. But if you tried it with a constant you may be confused at why you *didn’t* have to decode it there.

-F

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

Aug 2, 2021, 8:32 AM

Post #49 of 66 (1414 views)

On Mon, 2 Aug 2021 11:27:15 -0400, Dan Book <grinnz@gmail.com> wrote:

> On Mon, Aug 2, 2021 at 11:05 AM H.Merijn Brand <perl5@tux.freedom.nl>
> wrote:
>
> > On Fri, 30 Jul 2021 10:45:53 -0400, "Ricardo Signes"
> > <perl.p5p@rjbs.manxome.org> wrote:
> >
> > > Porters,
> > >
> > > I propose that "use v5.36.0" should imply that the source code is,
> > > subsequently, UTF-8 encoded.
> >
> > +1
> >
> > But the docs and release notes should very clearly state that this
> > is about the source code itself, as I see many discussions rambling
> > about the data handling being changed, which is not the case.
> >
>
> The data in a literal non-ASCII string under "use utf8" is different
> from the data in an identically-written string without "use utf8";
> thus it is factual that the data handling will change for these
> strings. That it's changing because the source code itself was
> decoded doesn't change the practical implications.

Yes, but as long as the script itself does not change its own start to
'use v5.36;' (or higher) it will work just as it did ever before.

I think the pro's weight up to the cons when well-documented and warned

> -Dan

--
H.Merijn Brand https://tux.nl Perl Monger http://amsterdam.pm.org/
using perl5.00307 .. 5.33 porting perl5 on HP-UX, AIX, and Linux
https://tux.nl/email.html http://qa.perl.org https://www.test-smoke.org

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

felipe at felipegasper

Aug 2, 2021, 8:46 AM

Post #50 of 66 (1414 views)

> On Aug 2, 2021, at 11:25 AM, Dan Book <grinnz@gmail.com> wrote:
>
> STDOUT and STDERR expect bytes unless one does "use open ':std', IO => ':encoding(UTF-8)';" which changes the assumption of those interfaces so isn't great. DBI drivers, Mojolicious interfaces, etc expect characters.

Of note: DBI drivers, by default, seem more predominantly to expect *bytes*, not characters. At least, SQLite, MySQL, and PostgreSQL do.

-F