Mailing List Archive: "use v5.36.0" should imply UTF-8 encoded source

"use v5.36.0" should imply UTF-8 encoded source

perl.p5p at rjbs

Jul 30, 2021, 7:45 AM

Post #1 of 66 (1816 views)

Porters,

I propose that "use v5.36.0" should imply that the source code is, subsequently, UTF-8 encoded.

Currently, I advise the following boilerplate:
use v5.34.0;
use warnings;
use utf8;

We're on the cusp or merging warnings in. Next, we merge in utf8. This shouldn't break existing programs, only programs that opt to change behavior by adding v5.36.0.

With that, the boilerplate could be:
use v5.36.0;

This doesn't need to load utf8.pm, and could just alter $^H, but: whatever.

--
rjbs

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

fawaka at gmail

Jul 30, 2021, 7:52 AM

Post #2 of 66 (1816 views)

On Fri, Jul 30, 2021 at 4:46 PM Ricardo Signes <perl.p5p@rjbs.manxome.org>
wrote:

> Porters,
>
> I propose that "use v5.36.0" should imply that the source code is,
> subsequently, UTF-8 encoded.
>
> Currently, I advise the following boilerplate:
>
> use v5.34.0;
> use warnings;
> use utf8;
>
>
> We're on the cusp or merging warnings in. Next, we merge in utf8. This
> shouldn't break existing programs, only programs that opt to change
> behavior by adding v5.36.0.
>
> With that, the boilerplate could be:
>
> use v5.36.0;
>
>
> This doesn't need to load utf8.pm, and could just alter $^H, but:
> whatever.
>
> --
> rjbs
>

I agree.

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

leonerd at leonerd

Jul 30, 2021, 7:57 AM

Post #3 of 66 (1816 views)

On Fri, 30 Jul 2021 10:45:53 -0400
"Ricardo Signes" <perl.p5p@rjbs.manxome.org> wrote:

> Porters,
>
> I propose that "use v5.36.0" should imply that the source code is,
> subsequently, UTF-8 encoded.

With PSC hat on: +1.

(We discussed this in the meeting today - minutes to come out when I
have written them up).

--
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk | https://metacpan.org/author/PEVANS
http://www.leonerd.org.uk/ | https://www.tindie.com/stores/leonerd/

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

grinnz at gmail

Jul 30, 2021, 8:01 AM

Post #4 of 66 (1816 views)

On Fri, Jul 30, 2021 at 10:46 AM Ricardo Signes <perl.p5p@rjbs.manxome.org>
wrote:

> Porters,
>
> I propose that "use v5.36.0" should imply that the source code is,
> subsequently, UTF-8 encoded.
>
> Currently, I advise the following boilerplate:
>
> use v5.34.0;
> use warnings;
> use utf8;
>
>
> We're on the cusp or merging warnings in. Next, we merge in utf8. This
> shouldn't break existing programs, only programs that opt to change
> behavior by adding v5.36.0.
>
> With that, the boilerplate could be:
>
> use v5.36.0;
>
>
> This doesn't need to load utf8.pm, and could just alter $^H, but:
> whatever.
>

+1: https://dev.to/grinnz/perl-7-a-modest-proposal-434m#apply-utf8

But I do think we need to take care that the documentation for this clearly
specifies the change in assumptions this necessitates, namely that
non-ASCII strings in the source will no longer be suitable for directly
printing to byte handles like STDOUT (by default).

-Dan

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

scott at perturb

Jul 30, 2021, 8:25 AM

Post #5 of 66 (1816 views)

I support this... and any other good changes that modernize Perl.

Thank you PSC for moving forward on changes like this. Signatures next?

- Scott

On 7/30/2021 7:45 AM, Ricardo Signes wrote:
> Porters,
>
> I propose that "use v5.36.0" should imply that the source code is,
> subsequently, UTF-8 encoded.
>
> Currently, I advise the following boilerplate:
> use v5.34.0;
> use warnings;
> use utf8;
>
> We're on the cusp or merging warnings in. Next, we merge in utf8.
> This shouldn't break existing programs, only programs that opt to
> change behavior by adding v5.36.0.
>
> With that, the boilerplate could be:
> use v5.36.0;
>
> This doesn't need to load utf8.pm, and could just alter $^H, but:
> whatever.
>
> --
> rjbs

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

felipe at felipegasper

Jul 30, 2021, 9:56 AM

Post #6 of 66 (1816 views)

> On Jul 30, 2021, at 10:45 AM, Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:
>
> Porters,
>
> I propose that "use v5.36.0" should imply that the source code is, subsequently, UTF-8 encoded.
>
> Currently, I advise the following boilerplate:
> use v5.34.0;
> use warnings;
> use utf8;
>
>
> We're on the cusp or merging warnings in. Next, we merge in utf8. This shouldn't break existing programs, only programs that opt to change behavior by adding v5.36.0.

FWIW, I think this will regress Perl’s usability.

Probably the worst part about character encoding in Perl is that nothing indicates when you’ve over-encoded or under-encoded. But, at the very least everything right now is consistent by default: source code is parsed as bytes (“Latin-1”), and I/O happens as bytes. Thus, a “minimal-effort” approach to writing Perl will at least minimize the odds of encoding mismatches: you only run into trouble if you explicitly decode/encode.

If `use v5.36` is to disrupt that consistency by making source code UTF-8-decoded but *leaving* I/O as bytes, this seems likely to add another “shin-bumper” to use of Perl that doesn’t happen in languages that type byte strings differently from text strings.

So quick-and-simple things like `print "é"` will now, in “modern” Perl, break, with no indication of where/why until a human being comes along, notices the problem, and puts in the time to debug it.

It’s going to be particularly problematic with stuff like `mkdir "épée" because now we’re *really* expecting the SvPV bug--where we give the raw PV to the kernel/OS--to stick around.

UTF-8 decoding by default is a fine idea, but until Perl can tell me the difference between a byte string and a character string, I think the change would yield more harm than good.

-FG

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

fawaka at gmail

Jul 30, 2021, 10:48 AM

Post #7 of 66 (1816 views)

On Fri, Jul 30, 2021 at 6:56 PM Felipe Gasper <felipe@felipegasper.com>
wrote:

> FWIW, I think this will regress Perl’s usability.
>
> Probably the worst part about character encoding in Perl is that nothing
> indicates when you’ve over-encoded or under-encoded. But, at the very least
> everything right now is consistent by default: source code is parsed as
> bytes (“Latin-1”), and I/O happens as bytes. Thus, a “minimal-effort”
> approach to writing Perl will at least minimize the odds of encoding
> mismatches: you only run into trouble if you explicitly decode/encode.
>
> If `use v5.36` is to disrupt that consistency by making source code
> UTF-8-decoded but *leaving* I/O as bytes, this seems likely to add another
> “shin-bumper” to use of Perl that doesn’t happen in languages that type
> byte strings differently from text strings.
>
> So quick-and-simple things like `print "é"` will now, in “modern” Perl,
> break, with no indication of where/why until a human being comes along,
> notices the problem, and puts in the time to debug it.
>

It doesn't actually break. PerlIO will try to downgrade that for a
non-:utf8 handle, or upgrade for a :utf8 handle.

> It’s going to be particularly problematic with stuff like `mkdir "épée"
> because now we’re *really* expecting the SvPV bug--where we give the raw PV
> to the kernel/OS--to stick around.
>

That problem exists with or without this change. That said, I don't think
I've ever seen a hard-coded non-ascii path in a program, I don't think this
is much of an issue.

Leon

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

grinnz at gmail

Jul 30, 2021, 10:54 AM

Post #8 of 66 (1816 views)

On Fri, Jul 30, 2021 at 1:48 PM Leon Timmermans <fawaka@gmail.com> wrote:

> On Fri, Jul 30, 2021 at 6:56 PM Felipe Gasper <felipe@felipegasper.com>
> wrote:
>
>> FWIW, I think this will regress Perl’s usability.
>>
>> Probably the worst part about character encoding in Perl is that nothing
>> indicates when you’ve over-encoded or under-encoded. But, at the very least
>> everything right now is consistent by default: source code is parsed as
>> bytes (“Latin-1”), and I/O happens as bytes. Thus, a “minimal-effort”
>> approach to writing Perl will at least minimize the odds of encoding
>> mismatches: you only run into trouble if you explicitly decode/encode.
>>
>> If `use v5.36` is to disrupt that consistency by making source code
>> UTF-8-decoded but *leaving* I/O as bytes, this seems likely to add another
>> “shin-bumper” to use of Perl that doesn’t happen in languages that type
>> byte strings differently from text strings.
>>
>> So quick-and-simple things like `print "é"` will now, in “modern” Perl,
>> break, with no indication of where/why until a human being comes along,
>> notices the problem, and puts in the time to debug it.
>>
>
> It doesn't actually break. PerlIO will try to downgrade that for a
> non-:utf8 handle, or upgrade for a :utf8 handle.
>

Not that it will break in implementation, but in logic. It will print the
ISO-8859-1 bytes instead of how it currently would print the UTF-8 encoded
bytes, since it started as that. (But also string operations on that
UTF-8-encoded string within the code would be wrong, but that doesn't
always matter.)

-Dan

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

felipe at felipegasper

Jul 30, 2021, 10:56 AM

Post #9 of 66 (1816 views)

> On Jul 30, 2021, at 1:48 PM, Leon Timmermans <fawaka@gmail.com> wrote:
>
> On Fri, Jul 30, 2021 at 6:56 PM Felipe Gasper <felipe@felipegasper.com> wrote:
> FWIW, I think this will regress Perl’s usability.
>
> Probably the worst part about character encoding in Perl is that nothing indicates when you’ve over-encoded or under-encoded. But, at the very least everything right now is consistent by default: source code is parsed as bytes (“Latin-1”), and I/O happens as bytes. Thus, a “minimal-effort” approach to writing Perl will at least minimize the odds of encoding mismatches: you only run into trouble if you explicitly decode/encode.
>
> If `use v5.36` is to disrupt that consistency by making source code UTF-8-decoded but *leaving* I/O as bytes, this seems likely to add another “shin-bumper” to use of Perl that doesn’t happen in languages that type byte strings differently from text strings.
>
> So quick-and-simple things like `print "é"` will now, in “modern” Perl, break, with no indication of where/why until a human being comes along, notices the problem, and puts in the time to debug it.
>
> It doesn't actually break. PerlIO will try to downgrade that for a non-:utf8 handle, or upgrade for a :utf8 handle.

It’ll downgrade it, but it won’t encode it, so you’ll get mojibake:

> perl -Mutf8 -e'print "é"'
?

> It’s going to be particularly problematic with stuff like `mkdir "épée" because now we’re *really* expecting the SvPV bug--where we give the raw PV to the kernel/OS--to stick around.
>
> That problem exists with or without this change. That said, I don't think I've ever seen a hard-coded non-ascii path in a program, I don't think this is much of an issue.

The problem exists, yes, but this change will make the bug that much more painful to fix.

I would wager that folks using Perl in the context of non-Latin languages (Cyrillic, CJK, &c.) will be more likely to hard-code non-ASCII paths. I personally mostly do it for testing. And, of course, the problem pertains not just to filesystem paths, but to any string we give to the kernel (e.g., args to exec()).

-F

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

fawaka at gmail

Jul 30, 2021, 11:27 AM

Post #10 of 66 (1816 views)

On Fri, Jul 30, 2021 at 7:56 PM Felipe Gasper <felipe@felipegasper.com>
wrote:

>
>
> > On Jul 30, 2021, at 1:48 PM, Leon Timmermans <fawaka@gmail.com> wrote:
> >
> > On Fri, Jul 30, 2021 at 6:56 PM Felipe Gasper <felipe@felipegasper.com>
> wrote:
> > FWIW, I think this will regress Perl’s usability.
> >
> > Probably the worst part about character encoding in Perl is that nothing
> indicates when you’ve over-encoded or under-encoded. But, at the very least
> everything right now is consistent by default: source code is parsed as
> bytes (“Latin-1”), and I/O happens as bytes. Thus, a “minimal-effort”
> approach to writing Perl will at least minimize the odds of encoding
> mismatches: you only run into trouble if you explicitly decode/encode.
> >
> > If `use v5.36` is to disrupt that consistency by making source code
> UTF-8-decoded but *leaving* I/O as bytes, this seems likely to add another
> “shin-bumper” to use of Perl that doesn’t happen in languages that type
> byte strings differently from text strings.
> >
> > So quick-and-simple things like `print "é"` will now, in “modern” Perl,
> break, with no indication of where/why until a human being comes along,
> notices the problem, and puts in the time to debug it.
> >
> > It doesn't actually break. PerlIO will try to downgrade that for a
> non-:utf8 handle, or upgrade for a :utf8 handle.
>
> It’ll downgrade it, but it won’t encode it, so you’ll get mojibake:
>
> > perl -Mutf8 -e'print "é"'
> ?
>

It will print mojibake as well if the script is latin-1 encoded. It's
mojibake because the terminal is utf-8, but the IO handle is latin1.

Leon

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

andrew at afresh1

Jul 30, 2021, 11:34 AM

Post #11 of 66 (1816 views)

On Fri, Jul 30, 2021 at 10:45:53AM -0400, Ricardo Signes wrote:
> Porters,
>
> I propose that "use v5.36.0" should imply that the source code is, subsequently, UTF-8 encoded.
>
> Currently, I advise the following boilerplate:
> use v5.34.0;
> use warnings;
> use utf8;

Tom recommends a bit more boilerplate than that, although the only
one (other than fatal warnings) that hasn't been mentioned is:

use open qw(:std :encoding(UTF-8)); # undeclared streams in UTF-8

https://perldoc.perl.org/perlunicook#%E2%84%9E-0:-Standard-preamble

And wow would I like to get rid of the "wide character in output" from
my `perl -E` one-liners (I know, `perl -C -E`, but boilerplate!).

l8rZ,
--
andrew - http://afresh1.com

At the source of every error which is blamed on the computer, you
will find at least two human errors, including the error of blaming
it on the computer.

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

grinnz at gmail

Jul 30, 2021, 11:39 AM

Post #12 of 66 (1816 views)

On Fri, Jul 30, 2021 at 2:34 PM Andrew Hewus Fresh <andrew@afresh1.com>
wrote:

> On Fri, Jul 30, 2021 at 10:45:53AM -0400, Ricardo Signes wrote:
> > Porters,
> >
> > I propose that "use v5.36.0" should imply that the source code is,
> subsequently, UTF-8 encoded.
> >
> > Currently, I advise the following boilerplate:
> > use v5.34.0;
> > use warnings;
> > use utf8;
>
> Tom recommends a bit more boilerplate than that, although the only
> one (other than fatal warnings) that hasn't been mentioned is:
>
> use open qw(:std :encoding(UTF-8)); # undeclared streams in UTF-8
>
> https://perldoc.perl.org/perlunicook#%E2%84%9E-0:-Standard-preamble
>
> And wow would I like to get rid of the "wide character in output" from
> my `perl -E` one-liners (I know, `perl -C -E`, but boilerplate!).
>

This is not feasible because it affects STDIN/STDOUT/STDERR globally and so
will break the assumptions of any code in the process that uses them.
Lexically setting defaults for only handles only opened in that scope may
be feasible for use VERSION (the equivalent of the same declaration without
:std), but since such filehandles can still be passed around that is also
something that would need a "risk assessment".

-Dan

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

grinnz at gmail

Jul 30, 2021, 11:43 AM

Post #13 of 66 (1816 views)

On Fri, Jul 30, 2021 at 2:28 PM Leon Timmermans <fawaka@gmail.com> wrote:

> On Fri, Jul 30, 2021 at 7:56 PM Felipe Gasper <felipe@felipegasper.com>
> wrote:
>
>>
>>
>> > On Jul 30, 2021, at 1:48 PM, Leon Timmermans <fawaka@gmail.com> wrote:
>> >
>> > On Fri, Jul 30, 2021 at 6:56 PM Felipe Gasper <felipe@felipegasper.com>
>> wrote:
>> > FWIW, I think this will regress Perl’s usability.
>> >
>> > Probably the worst part about character encoding in Perl is that
>> nothing indicates when you’ve over-encoded or under-encoded. But, at the
>> very least everything right now is consistent by default: source code is
>> parsed as bytes (“Latin-1”), and I/O happens as bytes. Thus, a
>> “minimal-effort” approach to writing Perl will at least minimize the odds
>> of encoding mismatches: you only run into trouble if you explicitly
>> decode/encode.
>> >
>> > If `use v5.36` is to disrupt that consistency by making source code
>> UTF-8-decoded but *leaving* I/O as bytes, this seems likely to add another
>> “shin-bumper” to use of Perl that doesn’t happen in languages that type
>> byte strings differently from text strings.
>> >
>> > So quick-and-simple things like `print "é"` will now, in “modern” Perl,
>> break, with no indication of where/why until a human being comes along,
>> notices the problem, and puts in the time to debug it.
>> >
>> > It doesn't actually break. PerlIO will try to downgrade that for a
>> non-:utf8 handle, or upgrade for a :utf8 handle.
>>
>> It’ll downgrade it, but it won’t encode it, so you’ll get mojibake:
>>
>> > perl -Mutf8 -e'print "é"'
>> ?
>>
>
> It will print mojibake as well if the script is latin-1 encoded. It's
> mojibake because the terminal is utf-8, but the IO handle is latin1.
>

The difference is the orders of magnitude of people that would accidentally
run a latin1 script on a utf8 terminal, vs that would run a utf8 script on
a utf8 terminal with "use utf8" and not understand that they have to encode
the output.

-Dan

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

felipe at felipegasper

Jul 30, 2021, 11:46 AM

Post #14 of 66 (1816 views)

> On Jul 30, 2021, at 2:27 PM, Leon Timmermans <fawaka@gmail.com> wrote:
>
> On Fri, Jul 30, 2021 at 7:56 PM Felipe Gasper <felipe@felipegasper.com> wrote:
>
>
> > On Jul 30, 2021, at 1:48 PM, Leon Timmermans <fawaka@gmail.com> wrote:
> >
> > On Fri, Jul 30, 2021 at 6:56 PM Felipe Gasper <felipe@felipegasper.com> wrote:
> > FWIW, I think this will regress Perl’s usability.
> >
> > Probably the worst part about character encoding in Perl is that nothing indicates when you’ve over-encoded or under-encoded. But, at the very least everything right now is consistent by default: source code is parsed as bytes (“Latin-1”), and I/O happens as bytes. Thus, a “minimal-effort” approach to writing Perl will at least minimize the odds of encoding mismatches: you only run into trouble if you explicitly decode/encode.
> >
> > If `use v5.36` is to disrupt that consistency by making source code UTF-8-decoded but *leaving* I/O as bytes, this seems likely to add another “shin-bumper” to use of Perl that doesn’t happen in languages that type byte strings differently from text strings.
> >
> > So quick-and-simple things like `print "é"` will now, in “modern” Perl, break, with no indication of where/why until a human being comes along, notices the problem, and puts in the time to debug it.
> >
> > It doesn't actually break. PerlIO will try to downgrade that for a non-:utf8 handle, or upgrade for a :utf8 handle.
>
> It’ll downgrade it, but it won’t encode it, so you’ll get mojibake:
>
> > perl -Mutf8 -e'print "é"'
> ?
>
> It will print mojibake as well if the script is latin-1 encoded. It's mojibake because the terminal is utf-8, but the IO handle is latin1.

FWIW I think it’s easier to think of the default I/O mode as “bytes” or “native” 8-bit encoding” rather than “Latin-1”. In that light it’s easier to see the status quo as the more reasonable default: we parse the code as bytes, and we print as bytes.

Changing it so that the (“modern”) default is to decode strings as UTF-8 but still output them as bytes seems likely to introduce lots of confusion, which will either a) discourage adoption of “use v5.36”, or b) discourage use of Perl at all:

Anti-Perler: Hey that new Perl script you wrote mangles our CEO’s name.
Perler: That’s weird … I used the modern defaults … wonder where the bug is …
Anti-Perler: Maybe you should just switch to $otherlang, where this stuff doesn’t happen.

-F

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

darren at darrenduncan

Jul 30, 2021, 6:34 PM

Post #15 of 66 (1816 views)

On 2021-07-30 7:45 a.m., Ricardo Signes wrote:
> I propose that "use v5.36.0" should imply that the source code is, subsequently,
> UTF-8 encoded.
>
> We're on the cusp or merging warnings in. Next, we merge in utf8. This
> shouldn't break existing programs, only programs that opt to change behavior by
> adding v5.36.0.
>
> With that, the boilerplate could be:
>
> use v5.36.0;
>
> This doesn't need to load utf8.pm, and could just alter $^H, but: whatever.

This gets a +1 from me.

In theory this could be a problem if the source file isn't actually UTF-8
encoded and someone adding that new boilerplate didn't realize this particular
effect.

One thing we could do to help mitigate this is that Perl upon seeing that
boilerplate will do a strict verification of the source file that it is indeed
valid UTF-8 and die with a parsing error if it is not.

I don't know if "use utf8;" is already strict like that, instead using
substitution characters or something, but "use v5.36.0;" can be.

On 2021-07-30 11:46 a.m., Felipe Gasper wrote:
> Changing it so that the (“modern”) default is to decode strings as UTF-8 but still output them as bytes seems likely to introduce lots of confusion, which will either a) discourage adoption of “use v5.36”, or b) discourage use of Perl at all:

I don't see a problem here, especially if my strict mode proposal is used.

I see that the encoding of how a program source is interpreted is completely
separate and unrelated to the encoding of other filehandle operations.

It seems entirely appropriate for the source to be taken as UTF-8 but other
filehandles still default to bytes.

-- Darren Duncan

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

Eirik-Berg.Hanssen at allverden

Jul 30, 2021, 7:15 PM

Post #16 of 66 (1816 views)

On Fri, Jul 30, 2021 at 8:28 PM Leon Timmermans <fawaka@gmail.com> wrote:

> On Fri, Jul 30, 2021 at 7:56 PM Felipe Gasper <felipe@felipegasper.com>
> wrote:
>
>>
>> It’ll downgrade it, but it won’t encode it, so you’ll get mojibake:
>>
>> > perl -Mutf8 -e'print "é"'
>> ?
>>
>
> It will print mojibake as well if the script is latin-1 encoded. It's
> mojibake because the terminal is utf-8, but the IO handle is latin1.
>

In this case there is no "script" other than the command line, in the
terminal. Round-tripping characters from the terminal to the terminal,
broken. Sounds painful.

I'd expect the encoding to be the same for the code as for the standard
handles, unless either is otherwise specified. It would surprise me if a
simple perl -E broke that.

I'm leaning towards thinking that, while there's no problem with lexical,
explicit declarations of source encodings, the default source encoding is
more of a global thing, and to avoid nasty surprises, ought to correspond
to the default encoding of the standard handles.

Eirik

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

grinnz at gmail

Jul 30, 2021, 7:28 PM

Post #17 of 66 (1816 views)

On Fri, Jul 30, 2021 at 10:15 PM Eirik Berg Hanssen <
Eirik-Berg.Hanssen@allverden.no> wrote:

> On Fri, Jul 30, 2021 at 8:28 PM Leon Timmermans <fawaka@gmail.com> wrote:
>
>> On Fri, Jul 30, 2021 at 7:56 PM Felipe Gasper <felipe@felipegasper.com>
>> wrote:
>>
>>>
>>> It’ll downgrade it, but it won’t encode it, so you’ll get mojibake:
>>>
>>> > perl -Mutf8 -e'print "é"'
>>> ?
>>>
>>
>> It will print mojibake as well if the script is latin-1 encoded. It's
>> mojibake because the terminal is utf-8, but the IO handle is latin1.
>>
>
> In this case there is no "script" other than the command line, in the
> terminal. Round-tripping characters from the terminal to the terminal,
> broken. Sounds painful.
>
> I'd expect the encoding to be the same for the code as for the standard
> handles, unless either is otherwise specified. It would surprise me if a
> simple perl -E broke that.
>
> I'm leaning towards thinking that, while there's no problem with
> lexical, explicit declarations of source encodings, the default source
> encoding is more of a global thing, and to avoid nasty surprises, ought to
> correspond to the default encoding of the standard handles.
>

This isn't the "default", it's the entire function of "use utf8" and only
applies to that lexical scope.

-Dan

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

Eirik-Berg.Hanssen at allverden

Jul 30, 2021, 8:03 PM

Post #18 of 66 (1816 views)

On Sat, Jul 31, 2021 at 4:29 AM Dan Book <grinnz@gmail.com> wrote:

> On Fri, Jul 30, 2021 at 10:15 PM Eirik Berg Hanssen <
> Eirik-Berg.Hanssen@allverden.no> wrote:
>
>> On Fri, Jul 30, 2021 at 8:28 PM Leon Timmermans <fawaka@gmail.com> wrote:
>>
>>> On Fri, Jul 30, 2021 at 7:56 PM Felipe Gasper <felipe@felipegasper.com>
>>> wrote:
>>>
>>>>
>>>> It’ll downgrade it, but it won’t encode it, so you’ll get mojibake:
>>>>
>>>> > perl -Mutf8 -e'print "é"'
>>>> ?
>>>>
>>>
>>> It will print mojibake as well if the script is latin-1 encoded. It's
>>> mojibake because the terminal is utf-8, but the IO handle is latin1.
>>>
>>
>> In this case there is no "script" other than the command line, in the
>> terminal. Round-tripping characters from the terminal to the terminal,
>> broken. Sounds painful.
>>
>> I'd expect the encoding to be the same for the code as for the standard
>> handles, unless either is otherwise specified. It would surprise me if a
>> simple perl -E broke that.
>>
>> I'm leaning towards thinking that, while there's no problem with
>> lexical, explicit declarations of source encodings, the default source
>> encoding is more of a global thing, and to avoid nasty surprises, ought to
>> correspond to the default encoding of the standard handles.
>>
>
> This isn't the "default", it's the entire function of "use utf8" and only
> applies to that lexical scope.
>

It is the "default" in the sense of "what you get in the absence of
explicit declarations like use utf8 and no utf8".

(Is there a better word?)

Eirik

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

grinnz at gmail

Jul 30, 2021, 8:06 PM

Post #19 of 66 (1816 views)

On Fri, Jul 30, 2021 at 11:03 PM Eirik Berg Hanssen <
Eirik-Berg.Hanssen@allverden.no> wrote:

> On Sat, Jul 31, 2021 at 4:29 AM Dan Book <grinnz@gmail.com> wrote:
>
>> On Fri, Jul 30, 2021 at 10:15 PM Eirik Berg Hanssen <
>> Eirik-Berg.Hanssen@allverden.no> wrote:
>>
>>> On Fri, Jul 30, 2021 at 8:28 PM Leon Timmermans <fawaka@gmail.com>
>>> wrote:
>>>
>>>> On Fri, Jul 30, 2021 at 7:56 PM Felipe Gasper <felipe@felipegasper.com>
>>>> wrote:
>>>>
>>>>>
>>>>> It’ll downgrade it, but it won’t encode it, so you’ll get mojibake:
>>>>>
>>>>> > perl -Mutf8 -e'print "é"'
>>>>> ?
>>>>>
>>>>
>>>> It will print mojibake as well if the script is latin-1 encoded. It's
>>>> mojibake because the terminal is utf-8, but the IO handle is latin1.
>>>>
>>>
>>> In this case there is no "script" other than the command line, in the
>>> terminal. Round-tripping characters from the terminal to the terminal,
>>> broken. Sounds painful.
>>>
>>> I'd expect the encoding to be the same for the code as for the
>>> standard handles, unless either is otherwise specified. It would surprise
>>> me if a simple perl -E broke that.
>>>
>>> I'm leaning towards thinking that, while there's no problem with
>>> lexical, explicit declarations of source encodings, the default source
>>> encoding is more of a global thing, and to avoid nasty surprises, ought to
>>> correspond to the default encoding of the standard handles.
>>>
>>
>> This isn't the "default", it's the entire function of "use utf8" and only
>> applies to that lexical scope.
>>
>
> It is the "default" in the sense of "what you get in the absence of
> explicit declarations like use utf8 and no utf8".
>
> (Is there a better word?)
>

In the absence of explicit declarations, the source code is bytes, the same
as the standard handles; so I'm not sure what your point is.

-Dan

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

felipe at felipegasper

Jul 30, 2021, 8:08 PM

Post #20 of 66 (1816 views)

> On Jul 30, 2021, at 22:29, Dan Book <grinnz@gmail.com> wrote:
>
> ?
>> On Fri, Jul 30, 2021 at 10:15 PM Eirik Berg Hanssen <Eirik-Berg.Hanssen@allverden.no> wrote:
>
>>> On Fri, Jul 30, 2021 at 8:28 PM Leon Timmermans <fawaka@gmail.com> wrote:
>>
>>>> On Fri, Jul 30, 2021 at 7:56 PM Felipe Gasper <felipe@felipegasper.com> wrote:
>>>>
>>>> It’ll downgrade it, but it won’t encode it, so you’ll get mojibake:
>>>>
>>>> > perl -Mutf8 -e'print "é"'
>>>> ?
>>>
>>> It will print mojibake as well if the script is latin-1 encoded. It's mojibake because the terminal is utf-8, but the IO handle is latin1.
>>
>> In this case there is no "script" other than the command line, in the terminal. Round-tripping characters from the terminal to the terminal, broken. Sounds painful.
>>
>> I'd expect the encoding to be the same for the code as for the standard handles, unless either is otherwise specified. It would surprise me if a simple perl -E broke that.
>>
>> I'm leaning towards thinking that, while there's no problem with lexical, explicit declarations of source encodings, the default source encoding is more of a global thing, and to avoid nasty surprises, ought to correspond to the default encoding of the standard handles.
>
> This isn't the "default", it's the entire function of "use utf8" and only applies to that lexical scope.

It’s not “default”, but feature bundles are a sort of “quasi-default”, as in, “modern Perl starts with a feature bundle.” They usually just enable useful stuff, breaking “outlier”-type things (like custom functions named say()).

If “modern Perl” heightens the requirement to juggle text vs. byte strings, providing no lifeline to help distinguish one from the other, that seems to me to make the language harder to use, not easier.

-F

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

kimoto.yuki at gmail

Jul 30, 2021, 11:15 PM

Post #21 of 66 (1816 views)

2021-7-30 23:46 Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:

Porters,
>
> I propose that "use v5.36.0" should imply that the source code is,
> subsequently, UTF-8 encoded.
>
> Currently, I advise the following boilerplate:
>
> use v5.34.0;
> use warnings;
> use utf8;
>
>
> We're on the cusp or merging warnings in. Next, we merge in utf8. This
> shouldn't break existing programs, only programs that opt to change
> behavior by adding v5.36.0.
>
> With that, the boilerplate could be:
>
> use v5.36.0;
>
>
> This doesn't need to load utf8.pm, and could just alter $^H, but:
> whatever.
>
> --
> rjbs
>

At least after v5.38+.

It is good to change one by one.

I want to see the effect and hear the user experience of "use warnings" in
the next release.

My intuition is that there is a lot of code that hasn't been UTF-8 yet.

Over the next year or two, we should send messages to end users a lot about
using "use utf8" and writing source code in UTF-8.

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

darren at darrenduncan

Jul 31, 2021, 12:17 AM

Post #22 of 66 (1816 views)

On 2021-07-30 11:15 p.m., Yuki Kimoto wrote:
> 2021-7-30 23:46 Ricardo Signes wrote:
> I propose that "use v5.36.0" should imply that the source code is,
> subsequently, UTF-8 encoded.
>
> At least after v5.38+.
>
> It is good to change one by one.
>
> I want to see the effect and hear the user experience of "use warnings" in the
> next release.

I strongly disagree. The warnings and utf8 are unrelated features. These are
each also minor changes considering they are lexical. Perl interpreter
development is already moving at a relatively glacial pace, there is no benefit
and a lot of downside of delaying the utf8 for a year just to see what people
say after a production with warnings is released. The 5.36 is still about 9
months away, that is plenty of time for people to give feedback on either that
or the warnings.

> My intuition is that there is a lot of code that hasn't been UTF-8 yet.

A tiny minority most likely. My intuition is that the vast majority of code is
already UTF-8, either because it is plain ASCII, which is a proper subset of
UTF-8, or it was written in the last 15+ years when the gradually increasing
amount of non-ASCII literals would have been done in the Unicode way.

> Over the next year or two, we should send messages to end users a lot about
> using "use utf8" and writing source code in UTF-8.

We should definitely message a lot, but if folding in the utf8 is reasonable as
a feature, we should know this in time to include it in 5.36 or never.

-- Darren Duncan

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

kimoto.yuki at gmail

Jul 31, 2021, 1:16 AM

Post #23 of 66 (1816 views)

2021-7-31 16:17 Darren Duncan <darren@darrenduncan.net>:

> On 2021-07-30 11:15 p.m., Yuki Kimoto wrote:
> > 2021-7-30 23:46 Ricardo Signes wrote:
> > I propose that "use v5.36.0" should imply that the source code is,
> > subsequently, UTF-8 encoded.
> >
> > At least after v5.38+.
> >
> > It is good to change one by one.
> >
> > I want to see the effect and hear the user experience of "use warnings"
> in the
> > next release.
>
> I strongly disagree. The warnings and utf8 are unrelated features. These
> are
> each also minor changes considering they are lexical. Perl interpreter
> development is already moving at a relatively glacial pace, there is no
> benefit
> and a lot of downside of delaying the utf8 for a year just to see what
> people
> say after a production with warnings is released. The 5.36 is still about
> 9
> months away, that is plenty of time for people to give feedback on either
> that
> or the warnings.
>
>
On this mailing list, Felipe has a strong interest in this topic and
disagrees.

I personally think that the "use utf8;" and the internal representation of
Perl strings are independent things.

However, I feel that We need a little more time to think about it.

We need a conversation where Felipe isn't overwhelmed by the opinions of
others.

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

leonerd at leonerd

Jul 31, 2021, 2:23 AM

Post #24 of 66 (1816 views)

On Fri, 30 Jul 2021 18:34:23 -0700
Darren Duncan <darren@darrenduncan.net> wrote:

> In theory this could be a problem if the source file isn't actually
> UTF-8 encoded and someone adding that new boilerplate didn't realize
> this particular effect.

I think there's a wider point to be made here.

Up until recently, `use VERSION` didn't really have much interesting
effect besides declaring what version of perl was required, so perhaps
authors have got used to being able to fairly trivially change the
number they put there.

Currently, it activates numbered feature bundles and turns on `use
strict`; we're about to have it turn on `warnings`, discussing `utf8`,
and I can see a future in which we gain more `strict` flags that one
day get turned on too. All of this suggests that people should be more
careful when updating the `use VERSION` declaration of existing code.

That's not to say that we, perl core, need to do anything about that,
beyond making some better messaging. We need to convey the point that
when writing new code, feel free to pick a nicely late version that has
whatever features you need, but be more careful when updating older
code to bump that version number upwards as it may have unintended
effects.

--
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk | https://metacpan.org/author/PEVANS
http://www.leonerd.org.uk/ | https://www.tindie.com/stores/leonerd/

Re: "use v5.36.0" should imply UTF-8 encoded source [ In reply to ]

darren at darrenduncan

Jul 31, 2021, 12:17 PM

Post #25 of 66 (1816 views)

On 2021-07-31 1:16 a.m., Yuki Kimoto wrote:
> 2021-7-31 16:17 Darren Duncan:
>
> On 2021-07-30 11:15 p.m., Yuki Kimoto wrote:
> > 2021-7-30 23:46 Ricardo Signes wrote:
> > I propose that "use v5.36.0" should imply that the source code is,
> > subsequently, UTF-8 encoded.
> >
> > At least after v5.38+.
> >
> > It is good to change one by one.
> >
> > I want to see the effect and hear the user experience of "use warnings"
> in the
> > next release.
>
> I strongly disagree. The warnings and utf8 are unrelated features. These are
> each also minor changes considering they are lexical. Perl interpreter
> development is already moving at a relatively glacial pace, there is no benefit
> and a lot of downside of delaying the utf8 for a year just to see what people
> say after a production with warnings is released. The 5.36 is still about 9
> months away, that is plenty of time for people to give feedback on either that
> or the warnings.
>
> On this mailing list, Felipe has a strong interest in this topic and disagrees.
>
> I personally think that the "use utf8;" and the internal representation of Perl
> strings are independent things.
>
> However, I feel that We need a little more time to think about it.
>
> We need a conversation where Felipe isn't overwhelmed by the opinions of others.

I agree that the need that conversation.

However I also agree that there is plenty of time to properly have that
discussion, such that we would know whether or not it is a good idea for "use
vN;" to do something about UTF-8, so that if we agree it is a good idea, we can
implement it for "v5.36", rather than putting it off to "v5.38".

What I object to is not properly starting the discussion until after Perl 5.36
is released stable as you seemed to be saying. If it was already early 2022 by
the time the subject came up that's one thing, but its only July of 2021.

Now conversely, I don't have a problem with actually waiting until v5.38 to
fully implement the change IF 5.36 contained some kind of precursor to prepare
the way, such as that 5.36 would issue warnings for code with a "use 5.36" that
wasn't valid UTF-8, saying that this code might parse differently under "use
5.38". That would let users know in a transitional version what might be a
problem before it is.

But the key thing is we would have already had some kind of proper discussion on
where we want to end up and its just a 2 phase rollout. But I feel that waiting
until a bundled use warnings is deployed before even asking about the utf8, that
is wrong.

-- Darren Duncan