Mailing List Archive

tightening up source code encoding semantics
Porters,

This is a long email which ends up with me mostly spitballing an idea or two about how to improve our handling of source code encoding. Sorry?

I've been talking with Karl about source::encoding, utf8, and related topics. We got talking about whether "no source::encoding" made sense. Meanwhile, Paul was posting about disallowing downgrade from utf8. Then Karl asked about bytes.pm.

I think the whole situation could do with another round of "Yeah, but what would the best world be?" I will start by saying, "In the best world, bytes.pm would not exist." But it does, and I think we can generally allow it to continue to … do what it does. I will not refer to bytes.pm again in this email.

The big question is, how are we to allow Perl source code to be encoded? I think there are a few options worth mentioning:
* ASCII only, all non-ASCII must be represented by escape sequences
* UTF-8 only, all non-ASCII data must be represented by escape sequences
* bytes only, all bytes read from source become the corresponding codepoint in the source (this is sometimes described as "It's Latin-1 by default", which has been a contentious claim over the years)
* a mixture of bytes and UTF-8
The option we've given, for years, is the last one. We start in bytes mode. "use utf8" indicates that the source document is in UTF-8. When utf8 leaves effect, either because its scope ends or because of "no utf8", we return to bytes mode. This is pretty terrible, in my opinion. What's one's editor to make of this?

If we imagine that the reader can correctly swap between reading bytes and UTF-8 at scope boundaries (which I think I've seen recent evidence that it cannot reliably do), this may be a technically sustainable position. I think it's a *bad* position, though.

"The source is bytes" is a bad position and always has been one, with the *possible* exception of string literals. Unfortunately, we have relatively terrible failure modes around non-ASCII outside of string literals.

*Program:*
<<GROß;
foo
GROß

*Output:*
Can't find string terminator "GRO" anywhere before EOF at - line 1.

I think what we really want is to say *either* "This program has stupid legacy behavior" *or* "this program is encoded in UTF-8". Then we want to strongly, *strongly* encourage the second option. You may want to cry out, now, "I thought you said months ago that we wouldn't force everyone to use UTF-8 encoded source!" I am not quite contradicting myself.

Remember, fellow porter, that ASCII encoded data is a subset of UTF-8 encoded data. Once the source is declared to be in UTF-8, it's much less of a problem to say "specifically, entirely codepoints 0-127 except in scopes where that restriction is lifted." I think the problem with "no utf8" is not that it lets you disallow Japanese text, but that it switches back to bytes mode.

The whole thing makes me think that we want source::encoding (or something like it) to say "this document is UTF-8" and optionally "but only ASCII characters." Once that's said, it can't be undone. There is no "no source::encoding", only a switch to ASCII or not. Ideally, this would be the natural state of the program, but given the "the boilerplate should be a single line" doctrine, I think this is what we want implied by "use v5.x".

This gets us back to the "use v5.x should imply ascii encoding", but further to, "and you can't switch it off". I'd say something like:
* you must declare source encoding before any non-ASCII byte is encountered
* you must declare source encoding at the outermost lexical scope in a file, if you are to declare it at all
--
rjbs
Re: tightening up source code encoding semantics [ In reply to ]
On Mon, Feb 21, 2022 at 9:56 PM Ricardo Signes <perl.p5p@rjbs.manxome.org>
wrote:

> Porters,
>
> This is a long email which ends up with me mostly spitballing an idea or
> two about how to improve our handling of source code encoding. Sorry?
>
> I've been talking with Karl about source::encoding, utf8, and related
> topics. We got talking about whether "no source::encoding" made sense.
> Meanwhile, Paul was posting about disallowing downgrade from utf8. Then
> Karl asked about bytes.pm.
>
> I think the whole situation could do with another round of "Yeah, but what
> would the best world be?" I will start by saying, "In the best world,
> bytes.pm would not exist." But it does, and I think we can generally
> allow it to continue to … do what it does. I will not refer to bytes.pm
> again in this email.
>
> The big question is, how are we to allow Perl source code to be encoded?
> I think there are a few options worth mentioning:
>
> - ASCII only, all non-ASCII must be represented by escape sequences
> - UTF-8 only, all non-ASCII data must be represented by escape
> sequences
> - bytes only, all bytes read from source become the corresponding
> codepoint in the source (this is sometimes described as "It's Latin-1 by
> default", which has been a contentious claim over the years)
> - a mixture of bytes and UTF-8
>
>
Is the second one meant to just be "UTF-8 only" with no caveats? UTF-8
without non-ASCII data is ... just ASCII.

-Dan
Re: tightening up source code encoding semantics [ In reply to ]
Op 22-02-2022 om 03:55 schreef Ricardo Signes:

> The whole thing makes me think that we want source::encoding (or
> something like it) to say "this document is UTF-8" and optionally "but
> only ASCII characters."  Once that's said, it can't be undone.  There
> is no "no source::encoding", only a switch to ASCII or not.  Ideally,
> this would be the natural state of the program, but given the "the
> boilerplate should be a single line" doctrine, I think this is what we
> want implied by "use v5.x".
>

I would not call it "UTF8 but only ASCII characters". That is confusing
as hell. Call it what it is, ASCII.


> you must declare source encoding before any non-ASCII byte is encountered
>
> * you must declare source encoding at the outermost lexical scope in
> a file, if you are to declare it at all
>
>

This makes so much sense. Would it makes sense to continue the current
behaviour, but print a deprication warning when non-ASCII is encountered
without an explicit or implicit 'use utf8'? (implicit, because this
opens the way for instance for use v5.040 to imply use utf8).


But how much existing code does it break? That is still the big problem
with this proposal, it breaks currently perfectly valid Perl code. Or
did I miss something?


M4
Re: tightening up source code encoding semantics [ In reply to ]
Hi there,

On Mon, 21 Feb 2022, Ricardo Signes wrote:

> ... Once the source is declared to be in UTF-8, it's much less of a
> problem to say "specifically, entirely codepoints 0-127 except in
> scopes where that restriction is lifted."

I'd be pleased to be able to do that. Maybe something like the

use stricter;

that I mentioned?

> ... I think the problem with "no utf8" is not that it lets you
> disallow Japanese text, but that it switches back to bytes mode.

That's awful.

> ...
> This gets us back to the "use v5.x should imply ascii encoding", but
> further to, "and you can't switch it off" ... something like:
>
> * you must declare source encoding before any non-ASCII byte is encountered

I could live with that. I'd be happy to.

> * you must declare source encoding at the outermost lexical scope in
> a file, if you are to declare it at all

I haven't thought through implications but I think I like that too.

--

73,
Ged.
Re: tightening up source code encoding semantics [ In reply to ]
> On Feb 21, 2022, at 21:55, Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:
>
> The whole thing makes me think that we want source::encoding (or something like it) to say "this document is UTF-8" and optionally "but only ASCII characters." Once that's said, it can't be undone. There is no "no source::encoding", only a switch to ASCII or not. Ideally, this would be the natural state of the program, but given the "the boilerplate should be a single line" doctrine, I think this is what we want implied by "use v5.x".

IMO “modern Perl” should require source code to be valid UTF-8, full stop. There seems little reason why Rik’s “GROß” example should fail.

Auto-decoding of string literals is a separate, more problematic question. Valid use cases exist either way. If Perl reliably differentiated between decoded/text and non-decoded/byte strings, auto-decode would be a sensible default, but that’s not where we are. As I wrote months ago, auto-decode makes `print "hello"` subtly wrong, which will frustrate a neophyte’s already-thorny first encounter with character encoding in Perl.

For context: cPanel’s internal rule is “all strings are byte strings unless you really need text”. It’s rare that we need Unicode semantics, and forgoing both decode and encode steps all but eliminates that class of bugs for us. (FWIW I think this would actually serve many Perl applications besides cPanel better than the decode/encode workflow.)

Requiring source code to be valid UTF-8, but *not* auto-decoding literals, would solve the “GROß” problem while still avoiding the print-hello-is-subtly-wrong awkwardness.

-F
Re: tightening up source code encoding semantics [ In reply to ]
On Tue, 22 Feb 2022 08:45:52 -0500, Felipe Gasper <felipe@felipegasper.com> wrote:

> > On Feb 21, 2022, at 21:55, Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:
> >
> > The whole thing makes me think that we want source::encoding (or
> > something like it) to say "this document is UTF-8" and optionally
> > "but only ASCII characters." Once that's said, it can't be undone.
> > There is no "no source::encoding", only a switch to ASCII or not.
> > Ideally, this would be the natural state of the program, but given
> > the "the boilerplate should be a single line" doctrine, I think
> > this is what we want implied by "use v5.x".
>
> IMO “modern Perl” should require source code to be valid UTF-8, full
> stop.

Respectfully disagree. There are still plenty environments where
entering UTF-8 in source files is a real problem, and entering
ISO-8859-1 is easy.

There seems little reason why Rik’s “GROß” example should fail.

What if the .pl is in iso-8859? Currently it fails just like Rik's example

$ cat test.pl
#!env perl

use 5.18.3;
use warnings;

say <<GROß;
Hello Iso-8859-1
GROß

$ dump test.pl
[DUMP 0.6.01]

00000000 23 21 2F 70 72 6F 2F 62 69 6E 2F 70 65 72 6C 0A #!/pro/bin/perl.
00000010 0A 75 73 65 20 35 2E 31 38 2E 33 3B 0A 75 73 65 .use 5.18.3;.use
00000020 20 77 61 72 6E 69 6E 67 73 3B 0A 0A 73 61 79 20 warnings;..say
00000030 3C 3C 47 52 4F DF 3B 0A 48 65 6C 6C 6F 20 49 73 <<GRO.;.Hello Is
00000040 6F 2D 38 38 35 39 2D 31 0A 47 52 4F DF 0A o-8859-1.GRO..

$ perl test.pl
Can't find string terminator "GRO" anywhere before EOF at test.pl line 6.


Anyway, heredoc separators should be quoted (your opinion may differ)

$ cat test.pl
#!env perl

use 5.18.3;
use warnings;

say <<"GROß";
Hello Iso-8859-1
GROß

$ dump test.pl
[DUMP 0.6.01]

00000000 23 21 2F 70 72 6F 2F 62 69 6E 2F 70 65 72 6C 0A #!/pro/bin/perl.
00000010 0A 75 73 65 20 35 2E 31 38 2E 33 3B 0A 75 73 65 .use 5.18.3;.use
00000020 20 77 61 72 6E 69 6E 67 73 3B 0A 0A 73 61 79 20 warnings;..say
00000030 3C 3C 22 47 52 4F DF 22 3B 0A 48 65 6C 6C 6F 20 <<"GRO.";.Hello
00000040 49 73 6F 2D 38 38 35 39 2D 31 0A 47 52 4F DF 0A Iso-8859-1.GRO..

$ perl test.pl
Hello Iso-8859-1


That all said, I would not object to moving to UTF-8 as in almost every
case where I would use this, "\x{..}", "\x{....}", and "\N{.....}"
would be the correct approach

--
H.Merijn Brand https://tux.nl Perl Monger http://amsterdam.pm.org/
using perl5.00307 .. 5.33 porting perl5 on HP-UX, AIX, and Linux
https://tux.nl/email.html http://qa.perl.org https://www.test-smoke.org
Re: tightening up source code encoding semantics [ In reply to ]
> On Feb 22, 2022, at 09:00, H.Merijn Brand <perl5@tux.freedom.nl> wrote:
>
> On Tue, 22 Feb 2022 08:45:52 -0500, Felipe Gasper <felipe@felipegasper.com> wrote:
>
>>> On Feb 21, 2022, at 21:55, Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:
>>>
>>> The whole thing makes me think that we want source::encoding (or
>>> something like it) to say "this document is UTF-8" and optionally
>>> "but only ASCII characters." Once that's said, it can't be undone.
>>> There is no "no source::encoding", only a switch to ASCII or not.
>>> Ideally, this would be the natural state of the program, but given
>>> the "the boilerplate should be a single line" doctrine, I think
>>> this is what we want implied by "use v5.x".
>>
>> IMO “modern Perl” should require source code to be valid UTF-8, full
>> stop.
>
> Respectfully disagree. There are still plenty environments where
> entering UTF-8 in source files is a real problem, and entering
> ISO-8859-1 is easy.

Sorry, I meant to write: modern Perl should, by *default*, require Perl source code to be valid UTF-8. Perl should, of course, still be able to work in single-byte contexts.

>
> What if the .pl is in iso-8859? Currently it fails just like Rik's example

While I think `<<GROß>` in UTF-8 should work, it would surprise me if the benefit from teaching Perl to parse the same heredoc in other encodings as well justified the additional development & maintenance effort.

That said, I’m an Anglophone, so I may not perceive such benefits as readily as, say, a continental European.

> That all said, I would not object to moving to UTF-8 as in almost every
> case where I would use this, "\x{..}", "\x{....}", and "\N{.....}"
> would be the correct approach

I’m not sure what you mean. \N{.....} et al. generally create Unicode strings, not their UTF-8-encoded variants.

I do frequently need to type curly-quotes (“”, and ‘’) via keyboard shortcuts. I’d find it irksome if an official preference were established in Perl for me to write something like: \N{LEFT_CURLY_DOUBLE_QUOTE}Hello\N{RIGHT_CURLY_DOUBLE_QUOTE} (or whatever their real Unicode names are) rather than just option-[, Hello, and shift-option-[. In fact, such would break our Locale::Maketext-based localization tools, which parse source code to extract strings to send to translators.

-F
Re: tightening up source code encoding semantics [ In reply to ]
On Tue, 22 Feb 2022 09:37:14 -0500, Felipe Gasper <felipe@felipegasper.com> wrote:

> > On Feb 22, 2022, at 09:00, H.Merijn Brand <perl5@tux.freedom.nl> wrote:
> >
> > On Tue, 22 Feb 2022 08:45:52 -0500, Felipe Gasper <felipe@felipegasper.com> wrote:
> >
> [...]
> >>
> >> IMO “modern Perl” should require source code to be valid UTF-8, full
> >> stop.
> >
> > Respectfully disagree. There are still plenty environments where
> > entering UTF-8 in source files is a real problem, and entering
> > ISO-8859-1 is easy.
>
> Sorry, I meant to write: modern Perl should, by *default*, require
> Perl source code to be valid UTF-8. Perl should, of course, still be
> able to work in single-byte contexts.
>
> > What if the .pl is in iso-8859? Currently it fails just like Rik's
> > example
>
> While I think `<<GROß>` in UTF-8 should work, it would surprise me if
> the benefit from teaching Perl to parse the same heredoc in other
> encodings as well justified the additional development & maintenance
> effort.
>
> That said, I’m an Anglophone, so I may not perceive such benefits as
> readily as, say, a continental European.
>
> > That all said, I would not object to moving to UTF-8 as in almost
> > every case where I would use this, "\x{..}", "\x{....}", and
> > "\N{.....}" would be the correct approach
>
> I’m not sure what you mean. \N{.....} et al. generally create Unicode
> strings, not their UTF-8-encoded variants.

I personally never use hardcoded non-ASCII characters in perl source
code if I can prevent it. If I need is-8859-1, I use \x{..} inside
double quotes. If I need utf-8, I use "\x{....}" and/or "\N{....}".

> I do frequently need to type curly-quotes (“”, and ‘’) via keyboard
> shortcuts. I’d find it irksome if an official preference were
> established in Perl for me to write something like:
> \N{LEFT_CURLY_DOUBLE_QUOTE}Hello\N{RIGHT_CURLY_DOUBLE_QUOTE} (or

I *loathe* special quotes in error messages and source code: they are
just cause of confusing and errors. If a C-compiler or other tool tells
me that <special-quote>file-name<special-quote> contains an error,
double-click on the filename most often includes the quotation, and
these quotations are seldom recognized by my shells. Don't tell me to
use a different shell then, using "intelligent" quotes in
error-messages sucks. Period.

> whatever their real Unicode names are) rather than just option-[,
> Hello, and shift-option-[. In fact, such would break our
> Locale::Maketext-based localization tools, which parse source code to
> extract strings to send to translators.

I know a lot of people love localization, but I try to stay away from
that as much as possible, as it complicates me finding the cause of
problems. The original - most-often an English - message is much easier
to google that some form of translation.

> -F

--
H.Merijn Brand https://tux.nl Perl Monger http://amsterdam.pm.org/
using perl5.00307 .. 5.33 porting perl5 on HP-UX, AIX, and Linux
https://tux.nl/email.html http://qa.perl.org https://www.test-smoke.org
Re: tightening up source code encoding semantics [ In reply to ]
On Tue, Feb 22, 2022, at 1:55 AM, Dan Book wrote:
> On Mon, Feb 21, 2022 at 9:56 PM Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:__
>> The big question is, how are we to allow Perl source code to be encoded? I think there are a few options worth mentioning:
>> * ASCII only, all non-ASCII must be represented by escape sequences
>> * UTF-8 only, all non-ASCII data must be represented by escape sequences
>> * bytes only, all bytes read from source become the corresponding codepoint in the source (this is sometimes described as "It's Latin-1 by default", which has been a contentious claim over the years)
>> * a mixture of bytes and UTF-8
>
> Is the second one meant to just be "UTF-8 only" with no caveats? UTF-8 without non-ASCII data is ... just ASCII.

The second one was very poorly written, I must've seen something shiny while writing it.

The second one should've been "all non-Unicode or non-textual data", not "non-ASCII data". That is:
my $string = "Queensrÿche"; # source contains UTF-8, $string contains U+00FF
my $buffer = "Queensr\xC3\xBFche"; # string, meant to contain UTF-8 (not text), does

--
rjbs
RE: tightening up source code encoding semantics [ In reply to ]
* you must declare source encoding before any non-ASCII byte is encountered
will the following simple programs continue to work, given that "t-u8.pl" contains correct utf-8?
vad@bonitah:~/sdb1$ perl -w
print "??????\n";
??????
vad@bonitah:~/sdb1$ perl -w t-u8.pl
??????
Press any key to continue...
This is a very simple and reasonable thing to do (AKA DWIM)


Internal Use - Confidential
Re: tightening up source code encoding semantics [ In reply to ]
On Tue, Feb 22, 2022, at 12:41 PM, Konovalov, Vadim wrote:
> * you must declare source encoding before any non-ASCII byte is encountered
> will the following simple programs continue to work, given that “t-u8.pl” contains correct utf-8?
> vad@bonitah:~/sdb1$ perl -w
> print "??????\n";
> ??????
> vad@bonitah:~/sdb1$ perl -w t-u8.pl
> ??????
> Press any key to continue...
> This is a very simple and reasonable thing to do (AKA DWIM)

Yes. I think I should've been clearer:
* If source encoding is declared at all, it must be before the first non-ASCII byte.

In my imaginary world, this program is okay:
say "?? ????????";

This program is okay:
use source::encoding 'utf8';
say "?? ????????";

This program is not:
say "??????";
use source::encoding 'utf8';
say "?? ????????";

--
rjbs
Re: tightening up source code encoding semantics [ In reply to ]
On Tue, Feb 22, 2022 at 4:30 PM Ricardo Signes <perl.p5p@rjbs.manxome.org>
wrote:

> On Tue, Feb 22, 2022, at 12:41 PM, Konovalov, Vadim wrote:
>
>
> - you must declare source encoding before any non-ASCII byte is
> encountered
>
> will the following simple programs continue to work, given that “t-u8.pl”
> contains correct utf-8?
>
> vad@bonitah:~/sdb1$ perl -w
>
> print "??????\n";
>
> ??????
>
> vad@bonitah:~/sdb1$ perl -w t-u8.pl
>
> ??????
>
> Press any key to continue...
>
> This is a very simple and reasonable thing to do (AKA DWIM)
>
>
> Yes. I think I should've been clearer:
>
> - If source encoding is declared at all, it must be before the first
> non-ASCII byte.
>
>
> In my imaginary world, this program is okay:
>
> say "?? ????????";
>
>
> This program is okay:
>
> use source::encoding 'utf8';
> say "?? ????????";
>
>
> This program is not:
>
> say "??????";
> use source::encoding 'utf8';
> say "?? ????????";
>
>
>
Sounds very reasonable.

-Dan
Re: tightening up source code encoding semantics [ In reply to ]
On Tue, 22 Feb 2022 16:30:04 -0500, "Ricardo Signes" <perl.p5p@rjbs.manxome.org> wrote:

> On Tue, Feb 22, 2022, at 12:41 PM, Konovalov, Vadim wrote:
> > * you must declare source encoding before any non-ASCII byte is encountered
> > will the following simple programs continue to work, given that “t-u8.pl” contains correct utf-8?
> > vad@bonitah:~/sdb1$ perl -w
> > print "??????\n";
> > ??????
> > vad@bonitah:~/sdb1$ perl -w t-u8.pl
> > ??????
> > Press any key to continue...
> > This is a very simple and reasonable thing to do (AKA DWIM)
>
> Yes. I think I should've been clearer:
> * If source encoding is declared at all, it must be before the first non-ASCII byte.

+1

> In my imaginary world, this program is okay:
> say "?? ????????";
>
> This program is okay:
> use source::encoding 'utf8';
> say "?? ????????";
>
> This program is not:
> say "??????";
> use source::encoding 'utf8';
> say "?? ????????";

--
H.Merijn Brand https://tux.nl Perl Monger http://amsterdam.pm.org/
using perl5.00307 .. 5.33 porting perl5 on HP-UX, AIX, and Linux
https://tux.nl/email.html http://qa.perl.org https://www.test-smoke.org
Re: tightening up source code encoding semantics [ In reply to ]
On 2/21/22 19:55, Ricardo Signes wrote:
> Porters,
>
> This is a long email which ends up with me mostly spitballing an idea or
> two about how to improve our handling of source code encoding.  Sorry?
>
> I've been talking with Karl about source::encoding, utf8, and related
> topics.  We got talking about whether "no source::encoding" made sense.
> Meanwhile, Paul was posting about disallowing downgrade from utf8.  Then
> Karl asked about bytes.pm.
>
> I think the whole situation could do with another round of "Yeah, but
> what would the best world be?"  I will start by saying, "In the best
> world, bytes.pm would not exist."  But it does, and I think we can
> generally allow it to continue to … do what it does.  I will not refer
> to bytes.pm again in this email.
>
> The big question is, how are we to allow Perl source code to be
> encoded?  I think there are a few options worth mentioning:
>
> * ASCII only, all non-ASCII must be represented by escape sequences
> * UTF-8 only, all non-ASCII data must be represented by escape sequences
> * bytes only, all bytes read from source become the corresponding
> codepoint in the source (this is sometimes described as "It's
> Latin-1 by default", which has been a contentious claim over the years)
> * a mixture of bytes and UTF-8
>
> The option we've given, for years, is the last one.  We start in bytes
> mode.  "use utf8" indicates that the source document is in UTF-8.  When
> utf8 leaves effect, either because its scope ends or because of "no
> utf8", we return to bytes mode.  This is pretty terrible, in my
> opinion.  What's one's editor to make of this?
>
> If we imagine that the reader can correctly swap between reading bytes
> and UTF-8 at scope boundaries (which I think I've seen recent evidence
> that it cannot reliably do), this may be a technically sustainable
> position.  I think it's a /bad/ position, though.
>
> "The source is bytes" is a bad position and always has been one, with
> the /possible/ exception of string literals.  Unfortunately, we have
> relatively terrible failure modes around non-ASCII outside of string
> literals.
>
> *Program:*
>
> <<GROß;
> foo
> GROß
>
>
> *Output:*
>
> Can't find string terminator "GRO" anywhere before EOF at - line 1.
>
>
> I think what we really want is to say /either/ "This program has stupid
> legacy behavior" /or/ "this program is encoded in UTF-8".  Then we want
> to strongly, /strongly/ encourage the second option.  You may want to
> cry out, now, "I thought you said months ago that we wouldn't force
> everyone to use UTF-8 encoded source!"  I am not quite contradicting myself.
>
> Remember, fellow porter, that ASCII encoded data is a subset of UTF-8
> encoded data.  Once the source is declared to be in UTF-8, it's much
> less of a problem to say "specifically, entirely codepoints 0-127 except
> in scopes where that restriction is lifted."  I think the problem with
> "no utf8" is not that it lets you disallow Japanese text, but that it
> switches back to bytes mode.
>
> The whole thing makes me think that we want source::encoding (or
> something like it) to say "this document is UTF-8" and optionally "but
> only ASCII characters."  Once that's said, it can't be undone.  There is
> no "no source::encoding", only a switch to ASCII or not.  Ideally, this
> would be the natural state of the program, but given the "the
> boilerplate should be a single line" doctrine, I think this is what we
> want implied by "use v5.x".
>
> This gets us back to the "use v5.x should imply ascii encoding", but
> further to, "and you can't switch it off".  I'd say something like:
>
> * you must declare source encoding before any non-ASCII byte is
> encountered
> * you must declare source encoding at the outermost lexical scope in a
> file, if you are to declare it at all
>
> --
> rjbs
>

An option to think about is that it's possible to pretty reliably guess
the encoding upon encountering the first line containing non-ASCII.
Pod::Simple does this successfully and the choices are UTF-8 vs Windows
CP1252, which is quite a bit harder to distinguish from UTF-8 than our
alternative, Latin1. There have been no reports of problems with its
technique since I beefed it up some years ago.

The confusables for the Latin1 vs UTF-8 case all look like a Latin1
letter or the multiplication sign or division sign, followed by one or
more Latin1 punctuation/symbols or C1 controls. If you look at their
graphics, they all look like mojibake. Hence I'm confident, even
without the Pod::Simple experience, that it is extremely unlikely we
would guess wrong.

Here's how it could work.

You wouldn't need an encoding declaration in your file unless
1) the very unlikely case where we guessed wrong
2) you want to forbid non-ASCII in your file, as the original email
thread discussed.

Absent such a declaration, Perl would parse the file like it does today.
When it encounters the first line containing a non-ASCII, it would
make its guess, and if the guess is UTF-8, raise a warning, if enabled.

'no utf8' would be the way to say "Don't guess UTF-8'. It would throw
an error if we had already seen what we took as UTF-8.

'use ascii' (however it is spelled) would cause an error to be thrown if
a non-ASCII is encountered within its scope.

'no ascii' would be a no-op outside the scope of 'use ascii'. Otherwise
it would restore the behavior to whatever it was when the 'use ascii'
was encountered.

I believe the only existing programs this scenario would effect are ones
that (most likely, unsafely) mix UTF-8 and Latin1.

An advantage is that a 'use utf8' would no longer be required in almost
all circumstances.
Re: tightening up source code encoding semantics [ In reply to ]
On Wed, Feb 23, 2022 at 9:18 AM Karl Williamson <public@khwilliamson.com>
wrote:

> On 2/21/22 19:55, Ricardo Signes wrote:
> > Porters,
> >
> > This is a long email which ends up with me mostly spitballing an idea or
> > two about how to improve our handling of source code encoding. Sorry?
> >
> > I've been talking with Karl about source::encoding, utf8, and related
> > topics. We got talking about whether "no source::encoding" made sense.
> > Meanwhile, Paul was posting about disallowing downgrade from utf8. Then
> > Karl asked about bytes.pm.
> >
> > I think the whole situation could do with another round of "Yeah, but
> > what would the best world be?" I will start by saying, "In the best
> > world, bytes.pm would not exist." But it does, and I think we can
> > generally allow it to continue to … do what it does. I will not refer
> > to bytes.pm again in this email.
> >
> > The big question is, how are we to allow Perl source code to be
> > encoded? I think there are a few options worth mentioning:
> >
> > * ASCII only, all non-ASCII must be represented by escape sequences
> > * UTF-8 only, all non-ASCII data must be represented by escape
> sequences
> > * bytes only, all bytes read from source become the corresponding
> > codepoint in the source (this is sometimes described as "It's
> > Latin-1 by default", which has been a contentious claim over the
> years)
> > * a mixture of bytes and UTF-8
> >
> > The option we've given, for years, is the last one. We start in bytes
> > mode. "use utf8" indicates that the source document is in UTF-8. When
> > utf8 leaves effect, either because its scope ends or because of "no
> > utf8", we return to bytes mode. This is pretty terrible, in my
> > opinion. What's one's editor to make of this?
> >
> > If we imagine that the reader can correctly swap between reading bytes
> > and UTF-8 at scope boundaries (which I think I've seen recent evidence
> > that it cannot reliably do), this may be a technically sustainable
> > position. I think it's a /bad/ position, though.
> >
> > "The source is bytes" is a bad position and always has been one, with
> > the /possible/ exception of string literals. Unfortunately, we have
> > relatively terrible failure modes around non-ASCII outside of string
> > literals.
> >
> > *Program:*
> >
> > <<GROß;
> > foo
> > GROß
> >
> >
> > *Output:*
> >
> > Can't find string terminator "GRO" anywhere before EOF at - line 1.
> >
> >
> > I think what we really want is to say /either/ "This program has stupid
> > legacy behavior" /or/ "this program is encoded in UTF-8". Then we want
> > to strongly, /strongly/ encourage the second option. You may want to
> > cry out, now, "I thought you said months ago that we wouldn't force
> > everyone to use UTF-8 encoded source!" I am not quite contradicting
> myself.
> >
> > Remember, fellow porter, that ASCII encoded data is a subset of UTF-8
> > encoded data. Once the source is declared to be in UTF-8, it's much
> > less of a problem to say "specifically, entirely codepoints 0-127 except
> > in scopes where that restriction is lifted." I think the problem with
> > "no utf8" is not that it lets you disallow Japanese text, but that it
> > switches back to bytes mode.
> >
> > The whole thing makes me think that we want source::encoding (or
> > something like it) to say "this document is UTF-8" and optionally "but
> > only ASCII characters." Once that's said, it can't be undone. There is
> > no "no source::encoding", only a switch to ASCII or not. Ideally, this
> > would be the natural state of the program, but given the "the
> > boilerplate should be a single line" doctrine, I think this is what we
> > want implied by "use v5.x".
> >
> > This gets us back to the "use v5.x should imply ascii encoding", but
> > further to, "and you can't switch it off". I'd say something like:
> >
> > * you must declare source encoding before any non-ASCII byte is
> > encountered
> > * you must declare source encoding at the outermost lexical scope in a
> > file, if you are to declare it at all
> >
> > --
> > rjbs
> >
>
> An option to think about is that it's possible to pretty reliably guess
> the encoding upon encountering the first line containing non-ASCII.
> Pod::Simple does this successfully and the choices are UTF-8 vs Windows
> CP1252, which is quite a bit harder to distinguish from UTF-8 than our
> alternative, Latin1. There have been no reports of problems with its
> technique since I beefed it up some years ago.
>
> The confusables for the Latin1 vs UTF-8 case all look like a Latin1
> letter or the multiplication sign or division sign, followed by one or
> more Latin1 punctuation/symbols or C1 controls. If you look at their
> graphics, they all look like mojibake. Hence I'm confident, even
> without the Pod::Simple experience, that it is extremely unlikely we
> would guess wrong.
>
> Here's how it could work.
>
> You wouldn't need an encoding declaration in your file unless
> 1) the very unlikely case where we guessed wrong
> 2) you want to forbid non-ASCII in your file, as the original email
> thread discussed.
>
> Absent such a declaration, Perl would parse the file like it does today.
> When it encounters the first line containing a non-ASCII, it would
> make its guess, and if the guess is UTF-8, raise a warning, if enabled.
>
> 'no utf8' would be the way to say "Don't guess UTF-8'. It would throw
> an error if we had already seen what we took as UTF-8.
>
> 'use ascii' (however it is spelled) would cause an error to be thrown if
> a non-ASCII is encountered within its scope.
>
> 'no ascii' would be a no-op outside the scope of 'use ascii'. Otherwise
> it would restore the behavior to whatever it was when the 'use ascii'
> was encountered.
>
> I believe the only existing programs this scenario would effect are ones
> that (most likely, unsafely) mix UTF-8 and Latin1.
>
> An advantage is that a 'use utf8' would no longer be required in almost
> all circumstances.
>

In my opinion, changing program behavior based on a pretty reliable guess
rather than a declaration would be a mistake.

-Dan
Re: tightening up source code encoding semantics [ In reply to ]
Hi there,

On Wed, 23 Feb 2022, Dan Book wrote:

> In my opinion, changing program behavior based on a pretty reliable guess
> rather than a declaration would be a mistake.

+1

"Pretty reliable" == "rarely experienced problem, difficult to diagnose".

--

73,
Ged.
Re: tightening up source code encoding semantics [ In reply to ]
On Thu, 24 Feb 2022, 00:15 G.W. Haywood via perl5-porters, <
perl5-porters@perl.org> wrote:

> Hi there,
>
> On Wed, 23 Feb 2022, Dan Book wrote:
>
> > In my opinion, changing program behavior based on a pretty reliable guess
> > rather than a declaration would be a mistake.
>
> +1
>
> "Pretty reliable" == "rarely experienced problem, difficult to diagnose".
>

Yeah, I don't know that I agree. Karl is an expert in these matters, and
pretty conservative about his opinions. If he says pretty reliable I would
assume he is being conservative and actually means "very robust" and not
argue unless I had clear data to contradict him. i would not base my
rejection on opinions and intuition. It's perfectly possible that what he
proposes is reliable and capable of detecting things that could cause
trouble.

I'd like to hear more before we just dismiss his proposal.

Yves
Re: tightening up source code encoding semantics [ In reply to ]
2022-2-23 23:18 Karl Williamson <public@khwilliamson.com> wrote:

> On 2/21/22 19:55, Ricardo Signes wrote:
> > Porters,
> >
> > This is a long email which ends up with me mostly spitballing an idea or
> > two about how to improve our handling of source code encoding. Sorry?
> >
> > I've been talking with Karl about source::encoding, utf8, and related
> > topics. We got talking about whether "no source::encoding" made sense.
> > Meanwhile, Paul was posting about disallowing downgrade from utf8. Then
> > Karl asked about bytes.pm.
> >
> > I think the whole situation could do with another round of "Yeah, but
> > what would the best world be?" I will start by saying, "In the best
> > world, bytes.pm would not exist." But it does, and I think we can
> > generally allow it to continue to … do what it does. I will not refer
> > to bytes.pm again in this email.
> >
> > The big question is, how are we to allow Perl source code to be
> > encoded? I think there are a few options worth mentioning:
> >
> > * ASCII only, all non-ASCII must be represented by escape sequences
> > * UTF-8 only, all non-ASCII data must be represented by escape
> sequences
> > * bytes only, all bytes read from source become the corresponding
> > codepoint in the source (this is sometimes described as "It's
> > Latin-1 by default", which has been a contentious claim over the
> years)
> > * a mixture of bytes and UTF-8
> >
> > The option we've given, for years, is the last one. We start in bytes
> > mode. "use utf8" indicates that the source document is in UTF-8. When
> > utf8 leaves effect, either because its scope ends or because of "no
> > utf8", we return to bytes mode. This is pretty terrible, in my
> > opinion. What's one's editor to make of this?
> >
> > If we imagine that the reader can correctly swap between reading bytes
> > and UTF-8 at scope boundaries (which I think I've seen recent evidence
> > that it cannot reliably do), this may be a technically sustainable
> > position. I think it's a /bad/ position, though.
> >
> > "The source is bytes" is a bad position and always has been one, with
> > the /possible/ exception of string literals. Unfortunately, we have
> > relatively terrible failure modes around non-ASCII outside of string
> > literals.
> >
> > *Program:*
> >
> > <<GROß;
> > foo
> > GROß
> >
> >
> > *Output:*
> >
> > Can't find string terminator "GRO" anywhere before EOF at - line 1.
> >
> >
> > I think what we really want is to say /either/ "This program has stupid
> > legacy behavior" /or/ "this program is encoded in UTF-8". Then we want
> > to strongly, /strongly/ encourage the second option. You may want to
> > cry out, now, "I thought you said months ago that we wouldn't force
> > everyone to use UTF-8 encoded source!" I am not quite contradicting
> myself.
> >
> > Remember, fellow porter, that ASCII encoded data is a subset of UTF-8
> > encoded data. Once the source is declared to be in UTF-8, it's much
> > less of a problem to say "specifically, entirely codepoints 0-127 except
> > in scopes where that restriction is lifted." I think the problem with
> > "no utf8" is not that it lets you disallow Japanese text, but that it
> > switches back to bytes mode.
> >
> > The whole thing makes me think that we want source::encoding (or
> > something like it) to say "this document is UTF-8" and optionally "but
> > only ASCII characters." Once that's said, it can't be undone. There is
> > no "no source::encoding", only a switch to ASCII or not. Ideally, this
> > would be the natural state of the program, but given the "the
> > boilerplate should be a single line" doctrine, I think this is what we
> > want implied by "use v5.x".
> >
> > This gets us back to the "use v5.x should imply ascii encoding", but
> > further to, "and you can't switch it off". I'd say something like:
> >
> > * you must declare source encoding before any non-ASCII byte is
> > encountered
> > * you must declare source encoding at the outermost lexical scope in a
> > file, if you are to declare it at all
> >
> > --
> > rjbs
> >
>
> An option to think about is that it's possible to pretty reliably guess
> the encoding upon encountering the first line containing non-ASCII.
> Pod::Simple does this successfully and the choices are UTF-8 vs Windows
> CP1252, which is quite a bit harder to distinguish from UTF-8 than our
> alternative, Latin1. There have been no reports of problems with its
> technique since I beefed it up some years ago.
>
> The confusables for the Latin1 vs UTF-8 case all look like a Latin1
> letter or the multiplication sign or division sign, followed by one or
> more Latin1 punctuation/symbols or C1 controls. If you look at their
> graphics, they all look like mojibake. Hence I'm confident, even
> without the Pod::Simple experience, that it is extremely unlikely we
> would guess wrong.
>
> Here's how it could work.
>
> You wouldn't need an encoding declaration in your file unless
> 1) the very unlikely case where we guessed wrong
> 2) you want to forbid non-ASCII in your file, as the original email
> thread discussed.
>
> Absent such a declaration, Perl would parse the file like it does today.
> When it encounters the first line containing a non-ASCII, it would
> make its guess, and if the guess is UTF-8, raise a warning, if enabled.
>
> 'no utf8' would be the way to say "Don't guess UTF-8'. It would throw
> an error if we had already seen what we took as UTF-8.
>
> 'use ascii' (however it is spelled) would cause an error to be thrown if
> a non-ASCII is encountered within its scope.
>
> 'no ascii' would be a no-op outside the scope of 'use ascii'. Otherwise
> it would restore the behavior to whatever it was when the 'use ascii'
> was encountered.
>
> I believe the only existing programs this scenario would effect are ones
> that (most likely, unsafely) mix UTF-8 and Latin1.
>
> An advantage is that a 'use utf8' would no longer be required in almost
> all circumstances.
>

My understanding:

In most cases, Perl tokenizer can guess correctly whether it's written in
source code latin1 or UTF-8.

The only source codes with a mix UTF-8 and latin-1 can cause problems
actually.

If users want to make sure it is UTF-8, use "use utf8".

If users want to make sure it is ASCII, use "use ascii".
Re: tightening up source code encoding semantics [ In reply to ]
Hi there,

On Thu, 24 Feb 2022, demerphq wrote:
> On Thu, 24 Feb 2022, 00:15 G.W. Haywood via perl5-porters wrote:
>> On Wed, 23 Feb 2022, Dan Book wrote:
>>
>>> In my opinion, changing program behavior based on a pretty reliable guess
>>> rather than a declaration would be a mistake.
>>
>> +1
>>
>> "Pretty reliable" == "rarely experienced problem, difficult to diagnose".
>>
>
> Yeah, I don't know that I agree. Karl is an expert in these matters, and
> pretty conservative about his opinions. If he says pretty reliable I would
> assume he is being conservative and actually means "very robust" and not
> argue unless I had clear data to contradict him. i would not base my
> rejection on opinions and intuition. It's perfectly possible that what he
> proposes is reliable and capable of detecting things that could cause
> trouble.
>
> I'd like to hear more before we just dismiss his proposal.

If we're talking about things with the potential to negatively affect
people all over the planet for years to come, it's either reliable or
it isn't. If an expletive is appropriate here, then it invites more,
later, of the kind more commonly experienced with software issues.

I'm thinking if it's more unlikely to get it wrong than it is that say
twelve fortuitous cosmic rays will accidentally empty my bank account,
then that will probably do. But probabilities are notoriously tricky.
Guesstimates for example of hash collision resistance have turned out
to be significantly overestimated, e.g. by 2^15+ for MD5 and SHA-1.

If we can delete the intensifier, consider me persuaded and move on.

--

73,
Ged.
Re: tightening up source code encoding semantics [ In reply to ]
On 2/23/22 09:14, G.W. Haywood via perl5-porters wrote:
> Hi there,
>
> On Wed, 23 Feb 2022, Dan Book wrote:
>
>> In my opinion, changing program behavior based on a pretty reliable guess
>> rather than a declaration would be a mistake.
>
> +1
>
> "Pretty reliable" == "rarely experienced problem, difficult to diagnose".
>

I believe you guys don't grasp the proposal

No perl program would silently change behavior from the existing
baseline as a result of this proposal.

Please let that soak in.

The proposal does not lead to hard to diagnose issues, because it will
warn at compilation time if it chooses a different interpretation than
the current one.

The advantage it has over the original one in this thread is it is more
lenient; many fewer programs would have to change as a result; and in
almost all instances 'use utf8' would not be required in a program,
which has been a promise in our documentation for a long time.

I should have used a stronger term than 'pretty reliable'. It is well
known that the likelihood of UTF-8 being confused with most other
encodings goes down quite fast as the number of non-ASCII characters in
a string increases. I have never seen a value claimed, but I wouldn't
be surprised if it weren't exponential.

The syntax of UTF-8 consists of a start byte consisting of 2 or more
initial 1 bits followed by a 0, and then any pattern of bits. That
means at least the first three bits are fixed. You have 110xxxxx or
1110xxxx or 11110xxx, etc.

The start byte is followed by some number of continuation bytes, each of
which begins with '10', and then any pattern of 6 bits.

But the constraint is that the number of bytes in a single UTF-8
character is the number of leading set bits in its start byte. That
means that a character is constrained both in its bit patterns, and
length. A following character must be an ASCII one with the leading bit
0, or another sequence of bytes following the constraints I gave above.

All the bytes, of course, could individually be Latin1 ones. It turns
out that all the possible start bytes are one of:
a) multiplication sign ×
b) division sign ÷
c) Any one of 62 Latin 1 letters like ø

These must be followed by a sequence of 1 or more characters that are C1
controls or Latin 1 punctuation or symbols, like ¥.

A three byte UTF-8 character must end with two Latin1 controls or
symbols in a row.

The programs that are most like to fool this proposal are those with
single character that could be two Latin1 bytes, or a single UTF-8 one.
Attached is a printout of all such ones that don't involve a
non-printable C1 control. Find some that at are at all likely to be the
single character in a file that would confuse the algorithm in the proposal.

Yes, one could have a letter followed by a superscript digit; that makes
some sense. But would it be the only such sequence of bytes in the
file, and every other byte is either ASCII or of the same form? It's
very unlikely, but even if that is the case you would be warned at
compilation time.
Re: tightening up source code encoding semantics [ In reply to ]
On Sat, Feb 26, 2022 at 11:57 PM Karl Williamson <public@khwilliamson.com>
wrote:

> On 2/23/22 09:14, G.W. Haywood via perl5-porters wrote:
> > Hi there,
> >
> > On Wed, 23 Feb 2022, Dan Book wrote:
> >
> >> In my opinion, changing program behavior based on a pretty reliable
> guess
> >> rather than a declaration would be a mistake.
> >
> > +1
> >
> > "Pretty reliable" == "rarely experienced problem, difficult to diagnose".
> >
>
> I believe you guys don't grasp the proposal
>
> No perl program would silently change behavior from the existing
> baseline as a result of this proposal.


How would interpreting characters in a different encoding not silently
change behavior?

I concur with the rest of your message but I am familiar with the
reliability of the guess, and still believe it is a mistake in any such
widespread application.

-Dan
Re: tightening up source code encoding semantics [ In reply to ]
On Sun, 27 Feb 2022 at 05:57, Karl Williamson <public@khwilliamson.com>
wrote:

> On 2/23/22 09:14, G.W. Haywood via perl5-porters wrote:
> > Hi there,
> >
> > On Wed, 23 Feb 2022, Dan Book wrote:
> >
> >> In my opinion, changing program behavior based on a pretty reliable
> guess
> >> rather than a declaration would be a mistake.
> >
> > +1
> >
> > "Pretty reliable" == "rarely experienced problem, difficult to diagnose".
> >
>
> I believe you guys don't grasp the proposal
>
> No perl program would silently change behavior from the existing
> baseline as a result of this proposal.
>
> Please let that soak in.
>
> The proposal does not lead to hard to diagnose issues, because it will
> warn at compilation time if it chooses a different interpretation than
> the current one.
>

I think this is a key point. If this logic would choose to interpret things
in a strange way the developer would be informed.


>
> The advantage it has over the original one in this thread is it is more
> lenient; many fewer programs would have to change as a result; and in
> almost all instances 'use utf8' would not be required in a program,
> which has been a promise in our documentation for a long time.
>
> I should have used a stronger term than 'pretty reliable'. It is well
> known that the likelihood of UTF-8 being confused with most other
> encodings goes down quite fast as the number of non-ASCII characters in
> a string increases. I have never seen a value claimed, but I wouldn't
> be surprised if it weren't exponential.
>
> The syntax of UTF-8 consists of a start byte consisting of 2 or more
> initial 1 bits followed by a 0, and then any pattern of bits. That
> means at least the first three bits are fixed. You have 110xxxxx or
> 1110xxxx or 11110xxx, etc.
>
> The start byte is followed by some number of continuation bytes, each of
> which begins with '10', and then any pattern of 6 bits.
>
> But the constraint is that the number of bytes in a single UTF-8
> character is the number of leading set bits in its start byte. That
> means that a character is constrained both in its bit patterns, and
> length. A following character must be an ASCII one with the leading bit
> 0, or another sequence of bytes following the constraints I gave above.
>
> All the bytes, of course, could individually be Latin1 ones. It turns
> out that all the possible start bytes are one of:
> a) multiplication sign ×
> b) division sign ÷
> c) Any one of 62 Latin 1 letters like ø
>
> These must be followed by a sequence of 1 or more characters that are C1
> controls or Latin 1 punctuation or symbols, like ¥.
>
> A three byte UTF-8 character must end with two Latin1 controls or
> symbols in a row.
>
> The programs that are most like to fool this proposal are those with
> single character that could be two Latin1 bytes, or a single UTF-8 one.
> Attached is a printout of all such ones that don't involve a
> non-printable C1 control. Find some that at are at all likely to be the
> single character in a file that would confuse the algorithm in the
> proposal.
>

If I understand you correctly, what you are saying is that if you detected
*any* case where the code could not be reliably and correctly processed we
would throw an exception. And if we detected anything that simply couldn't
be valid UTF-8 we would know the file does not contain utf8.

So for instance what you are saying is that if someone wrote "×" (with the
× being the single octet \xD7) you instantly know that this is not utf8.
Similarly if someone wrote '÷' (with the ÷ being the single octet \xF7), we
would also know that this is not utf8. And so on and so forth.

$ perl -MData::Dumper -MEncode=encode_utf8 -wle'for my $str ("\xF7\x27",
qq(\xD7")) { my $s= $str; utf8::decode($s) or printf "can not decode <%s>
|%s| %s\n",encode_utf8($s),join(" ", map { unpack("H*",$_) } split //,
$s),Data::Dumper::qquote($s);} '
can not decode <÷'> |f7 27| "\367'"
can not decode <×"> |d7 22| "\327\""

This lines up with code I wrote for my previous job "recurse_decode_utf8"
which pretty much universally replaced decode_utf8 as multiply encoding
data as UTF-8 is a very common occurrence when interoperating with older
MySQL and DBI/DBD::MySQL versions, and other encoding agnostic remote
systems like memcached and redis and whatnot. I think I saw one bug report
related to it doing it wrong, and indeed it was a two byte sequence.

Yes, one could have a letter followed by a superscript digit; that makes
> some sense. But would it be the only such sequence of bytes in the
> file, and every other byte is either ASCII or of the same form? It's
> very unlikely, but even if that is the case you would be warned at
> compilation time.


This makes perfect sense to me. The design of utf8 makes these kinds of
things really easy to do, and as you say, once you have more than a very
small number of characters the chance of an error essentially goes to zero.
If a file contained that few characters we could warn that our detection
was being confused and that is was falling back to the old interpretation.
I guess then those scripts *would* need a pragmata, is that correct, so it
doesn't totally do away with the need for the pragma, and those who didn't
trust the heuristic to do the right thing could still use it.

I like it a lot!

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"
Re: tightening up source code encoding semantics [ In reply to ]
Hi there,

On Sun, 27 Feb 2022, Dan Book wrote:
> On Sat, Feb 26, 2022 Karl Williamson wrote:
>> On 2/23/22, G.W. Haywood via perl5-porters wrote:
>>> On Wed, 23 Feb 2022, Dan Book wrote:
>>>
>>>> In my opinion, changing program behavior based on a pretty reliable guess
>>>> rather than a declaration would be a mistake.
>>>
>>> +1
>>>
>>> "Pretty reliable" == "rarely experienced problem, difficult to diagnose".
>>>
>>
>> I believe you guys don't grasp the proposal

To make sure I've grasped it, I've just been over the thread again.

I'm pretty sure I've grasped it. :)

>> No perl program would silently change behavior from the existing
>> baseline as a result of this proposal.
>
> How would interpreting characters in a different encoding not silently
> change behavior?
>
> I concur with the rest of your message but I am familiar with the
> reliability of the guess, and still believe it is a mistake in any such
> widespread application.

I too remain of the opinion that this is a bridge too far. I don't want
to get into nit-picking so I won't start digging holes but I do think
that there are things mentioned in the proposal merit more discussion.
Here are two:

On Wed, 23 Feb 2022 Karl Williamson wrote:

> I believe the only existing programs this scenario would effect are
> ones that (most likely, unsafely) mix UTF-8 and Latin1.

This was almost an aside in the discussion but it seems to me that
it's one of the more important issues. Isn't there a case for catching
potentially unsafe usage in some way if it isn't being caught already?

> An advantage is that a 'use utf8' would no longer be required in
> almost all circumstances.

At this point in our history I see this as a disadvantage, but I admit
I'm probably numbered amongst the dinosaurs of the coding fraternity.
Is a consensus to be had?

--

73,
Ged.
Re: tightening up source code encoding semantics [ In reply to ]
On Sun, 27 Feb 2022 at 12:01, G.W. Haywood via perl5-porters <
perl5-porters@perl.org> wrote:

> > An advantage is that a 'use utf8' would no longer be required in
> > almost all circumstances.
>
> At this point in our history I see this as a disadvantage, but I admit
> I'm probably numbered amongst the dinosaurs of the coding fraternity.
> Is a consensus to be had?
>

I dont understand, why would it be a disadvantage? Nothing is stopping you
or others from including a pragma (I say this generically because I am not
clear on which pragma it would be), but why is it bad to *not* need one
%99.999 of the time? Especially if we can detect that you forgot it when
you DO need it?

I could understand it would be a disadvantage if sometimes omitting it
would produce a negative outcome and you wouldn't realize it, but Karl has
explained that wouldn't be an issue.

cheers,
Yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"
Re: tightening up source code encoding semantics [ In reply to ]
Hi there,

On Sun, 27 Feb 2022, demerphq wrote:
> On Sun, 27 Feb 2022 at 12:01, G.W. Haywood via perl5-porters wrote:
>> On Wed, Feb 23, 2022 at 9:18 AM Karl Williamson wrote:
>>>
>>> An advantage is that a 'use utf8' would no longer be required in
>>> almost all circumstances.
>>
>> At this point in our history I see this as a disadvantage, but I admit
>> I'm probably numbered amongst the dinosaurs of the coding fraternity.
>> Is a consensus to be had?
>
> I dont understand, why would it be a disadvantage? ...

Well of the points raised I don't think this is the most important,
but I almost don't understand how anyone would see it as an advantage.
I don't wat to sound cranky but suppose we decided that you won't have
to write

use strict;

any more? Would that be a similar advantage?

Like I said, I'm a dinosaur. I'd like to feel that if it does NOT say

use utf8;

then there's a whole trash-can full of worms into which I won't have
to get up to the elbows and which frankly I would dread because it can
waste such a lot of time while contributing not one groat. To try to
amplify I wrote a paragraph here, but it seemed rather too much like a
rant so I tried to put it another way. That seemed like a rant too so
I'll leave it at that and just ask again if there's any consensus - my
reason for making the point in the first place. Is there?

--

73,
Ged.
Re: tightening up source code encoding semantics [ In reply to ]
On 2/26/22 22:25, Dan Book wrote:
> How would interpreting characters in a different encoding not silently
> change behavior?

Because, as I said, if it chooses a different encoding than what it
currently would do, it raises a compilation warning.
Re: tightening up source code encoding semantics [ In reply to ]
On Sun, Feb 27, 2022 at 1:22 PM Karl Williamson <public@khwilliamson.com>
wrote:

> On 2/26/22 22:25, Dan Book wrote:
> > How would interpreting characters in a different encoding not silently
> > change behavior?
>
> Because, as I said, if it chooses a different encoding than what it
> currently would do, it raises a compilation warning.
>

Ah I see now. You were referring to the silently component; I was speaking
on the behavior change regardless of whether it's silent (of course, the
warning is better than not).

-Dan
Re: tightening up source code encoding semantics [ In reply to ]
On 2/27/22 04:01, G.W. Haywood via perl5-porters wrote:
> Hi there,
>
> On Sun, 27 Feb 2022, Dan Book wrote:
>> On Sat, Feb 26, 2022 Karl Williamson wrote:
>>> On 2/23/22, G.W. Haywood via perl5-porters wrote:
>>>> On Wed, 23 Feb 2022, Dan Book wrote:
>>>>
>>>>> In my opinion, changing program behavior based on a pretty reliable
>>>>> guess
>>>>> rather than a declaration would be a mistake.
>>>>
>>>> +1
>>>>
>>>> "Pretty reliable" == "rarely experienced problem, difficult to
>>>> diagnose".
>>>>
>>>
>>> I believe you guys don't grasp the proposal
>
> To make sure I've grasped it, I've just been over the thread again.
>
> I'm pretty sure I've grasped it. :)
>
>>> No perl program would silently change behavior from the existing
>>> baseline as a result of this proposal.
>>
>> How would interpreting characters in a different encoding not silently
>> change behavior?
>>
>> I concur with the rest of your message but I am familiar with the
>> reliability of the guess, and still believe it is a mistake in any such
>> widespread application.
>
> I too remain of the opinion that this is a bridge too far.  I don't want
> to get into nit-picking so I won't start digging holes but I do think
> that there are things mentioned in the proposal merit more discussion.
> Here are two:
>
> On Wed, 23 Feb 2022 Karl Williamson wrote:
>
>> I believe the only existing programs this scenario would effect are
>> ones that (most likely, unsafely) mix UTF-8 and Latin1.
>
> This was almost an aside in the discussion but it seems to me that
> it's one of the more important issues.  Isn't there a case for catching
> potentially unsafe usage in some way if it isn't being caught already?

Please read the original post on this thread. This whole thread is
about trying to prevent unsafe usage. My proposal would do this with
less churn to existing code than the proposal in that original post.
>
>> An advantage is that a 'use utf8' would no longer be required in
>> almost all circumstances.
>
> At this point in our history I see this as a disadvantage, but I admit
> I'm probably numbered amongst the dinosaurs of the coding fraternity.
> Is a consensus to be had?
>

I can't answer that alone
Re: tightening up source code encoding semantics [ In reply to ]
On 2/27/22 11:30, Dan Book wrote:
> On Sun, Feb 27, 2022 at 1:22 PM Karl Williamson <public@khwilliamson.com
> <mailto:public@khwilliamson.com>> wrote:
>
> On 2/26/22 22:25, Dan Book wrote:
> > How would interpreting characters in a different encoding not
> silently
> > change behavior?
>
> Because, as I said, if it chooses  a different encoding than what it
> currently would do, it raises a compilation warning.
>
>
> Ah I see now. You were referring to the silently component; I was
> speaking on the behavior change regardless of whether it's silent (of
> course, the warning is better than not).
>
> -Dan

I then don't get your objection. It appears you don't feel the warning
is good enough. Do you consider warnings in general to be good enough?
If so, what makes one not be good enough, and in particular this one?
Re: tightening up source code encoding semantics [ In reply to ]
Hi there,

On Sun, 27 Feb 2022, Karl Williamson wrote:
> On 2/27/22 04:01, G.W. Haywood via perl5-porters wrote:
>> On Sat, Feb 26, 2022 Karl Williamson wrote:
>>>
>>> I believe you guys don't grasp the proposal
>>
>> ... just been over the thread again. ... merit more discussion.
>>
>> On Wed, 23 Feb 2022 Karl Williamson wrote:
>>
>>> I believe the only existing programs this scenario would effect are
>>> ones that (most likely, unsafely) mix UTF-8 and Latin1.
>>
>> This was almost an aside in the discussion but it seems to me that
>> it's one of the more important issues.? Isn't there a case for catching
>> potentially unsafe usage in some way if it isn't being caught already?
>
> Please read the original post on this thread. This whole thread is about
> trying to prevent unsafe usage. ...

Well as I said I re-read the thread before posting. I'm afraid that to
me the OP reads more like a sermon than a clear statement of the problem
and proposed solutions but it nevertheless strikes some chords here.

Preventing unsafe usage doesn't seem to be contentious, it's just how
it's attempted that might be. I guess my main worry is that it looks
like there's an awful lot of tinkering going on, and that might cause
some ripples. It also looks like a lot of effort is being dissipated
because developers bound themselves hand and foot before setting out.

While I can't honestly say I'd like it very much, I'd be comfortable
with somebody saying

"Welcome to Perl 7. The source is UTF8".

I'm a lot less comfortable with "As of Perl 5.36.8 your sources will
need to ... because 0.N% of programs have mixed UTF-8 and Latin1, and
most of these probably do it unsafely". Sure there are ways to shoot
yourself in the foot. Lots of them. Is this one a serious problem?
"Doctor, it hurts when I do this ..."

FWIW I'd be fine with nothing but ASCII in code for the rest of my
days, but I don't want to add to the understandable frustration so I'm
out of this now.

--

73,
Ged.
Re: tightening up source code encoding semantics [ In reply to ]
On 2/21/22 19:55, Ricardo Signes wrote:
> Porters,
>
> This is a long email which ends up with me mostly spitballing an idea or
> two about how to improve our handling of source code encoding.  Sorry?
>

https://github.com/Perl/perl5/issues/11334 is affected by this proposal
Re: tightening up source code encoding semantics [ In reply to ]
On Sat, Feb 26, 2022, at 23:56, Karl Williamson wrote:
> [ things about how automatic detection could work ]

I will restate, tersely, what I think Karl said. I hope Karl can then say "yes, that's right [or close enough]" or "no."
* if the choices are Latin-1 or UTF-8, It is possible to predict with high confidence which a line of input is
* we can use this to avoid having to declare the encoding
* if encoding is declared, and is at odds with what is detected, a warning (or error) could be issued
So, first off: is that about right?

Next: I think this still requires that the program says "my source should be decoded at all". I *do* agree with the assertion that we can "guess" whether input is UTF-8 or Latin-1, but that's not the only relevant question. Imagine this program:
#!/usr/bin/perl
use v5.36;
my $str1 = "??????";
say $str;

Right now, no matter what content is actually in that string literal, the same bytes that were in the source will be sent to stdout. Imagine that we say "We can detect that the string is UTF-8 bytes, so we decode the bytes in the string literal so that $str1 contains the Unicode codepoints encoded in it." When we print that string, we will get a wide string warning, and we will deserve it. This, more or less, is why this proposal ended up existing rather than the previous one to make "use vX" enable utf8.

It was Felipe G., I believe, who said that users would end up more confused when the [lack of] automatic filehandle discipline didn't match the implicit source decoding. I think that claim was correct. I think we'd do users a disservice if we built strings by decoding the source literals based on encoding detection — not because the detection will be wrong, but because right now there is a bytes-in/bytes-out expectation.

Karl: Please tell me if you think I am way off base, here.

I *do* think this all leads to a more exciting possibility, though!

We *could* automatically detect source encoding, but forbid non-ASCII in string literals without declaration. This would allow non-ASCII syntax freely, but would require users clarify that they know their literals will be decoded into codepoint strings rather than octet strings. (If I wanted to keep banging the "adverbs on quote-like operators" drum, I would say that we could easily do this on a per-literal basis that way.) I think the problem we're seeing here is the conflation of text and buffer types in Perl 5, and I feel like we're finding a nice way to smoosh the lump under the carpet into one place, but I don't think we can eliminate it just yet.

--
rjbs
Re: tightening up source code encoding semantics [ In reply to ]
On Fri, Jun 17, 2022 at 9:59 PM Ricardo Signes <perl.p5p@rjbs.manxome.org>
wrote:

> On Sat, Feb 26, 2022, at 23:56, Karl Williamson wrote:
>
> [ things about how automatic detection could work ]
>
>
> I will restate, tersely, what I think Karl said. I hope Karl can then say
> "yes, that's right [or close enough]" or "no."
>
> - if the choices are Latin-1 or UTF-8, It is possible to predict with
> high confidence which a line of input is
> - we can use this to avoid having to declare the encoding
> - if encoding is declared, and is at odds with what is detected, a
> warning (or error) could be issued
>
> So, first off: is that about right?
>
> Next: I think this still requires that the program says "my source should
> be decoded at all". I *do* agree with the assertion that we can "guess"
> whether input is UTF-8 or Latin-1, but that's not the only relevant
> question. Imagine this program:
>
> #!/usr/bin/perl
> use v5.36;
> my $str1 = "??????";
> say $str;
>
>
> Right now, no matter what content is actually in that string literal, the
> same bytes that were in the source will be sent to stdout. Imagine that we
> say "We can detect that the string is UTF-8 bytes, so we decode the bytes
> in the string literal so that $str1 contains the Unicode codepoints encoded
> in it." When we print that string, we will get a wide string warning, and
> we will deserve it. This, more or less, is why this proposal ended up
> existing rather than the previous one to make "use vX" enable utf8.
>
> It was Felipe G., I believe, who said that users would end up more
> confused when the [lack of] automatic filehandle discipline didn't match
> the implicit source decoding. I think that claim was correct. I think
> we'd do users a disservice if we built strings by decoding the source
> literals based on encoding detection — not because the detection will be
> wrong, but because right now there is a bytes-in/bytes-out expectation.
>
> Karl: Please tell me if you think I am way off base, here.
>
> I *do* think this all leads to a more exciting possibility, though!
>
> We *could* automatically detect source encoding, but forbid non-ASCII in
> string literals without declaration. This would allow non-ASCII syntax
> freely, but would require users clarify that they know their literals will
> be decoded into codepoint strings rather than octet strings. (If I wanted
> to keep banging the "adverbs on quote-like operators" drum, I would say
> that we could easily do this on a per-literal basis that way.) I think the
> problem we're seeing here is the conflation of text and buffer types in
> Perl 5, and I feel like we're finding a nice way to smoosh the lump under
> the carpet into one place, but I don't think we can eliminate it just yet.
>

Due to the wide variety of uses for bytes in source code, I continue to
think any attempt at autodetection that would change the behavior of the
program is a mistake.

-Dan
Re: tightening up source code encoding semantics [ In reply to ]
On 6/17/22 21:58, Ricardo Signes wrote:

>
> Next:  I think this still requires that the program says "my source
> should be decoded at all".

Should there be a "not" after "should" in the above?
Re: tightening up source code encoding semantics [ In reply to ]
On Sat, Jun 18, 2022, at 06:58, James E Keenan wrote:
> On 6/17/22 21:58, Ricardo Signes wrote:
>
> > Next: I think this still requires that the program says "my source should be decoded at all".
>
> Should there be a "not" after "should" in the above?

No.

Right now, if you have a literal string which, in the source, is UTF-8 encoded text, the string in perl land will be the UTF-8 bytes. If we want it to instead by the encoded codepoints, this should be declared.

--
rjbs
Re: tightening up source code encoding semantics [ In reply to ]
On 6/17/22 19:58, Ricardo Signes wrote:
> On Sat, Feb 26, 2022, at 23:56, Karl Williamson wrote:
>> [ things about how automatic detection could work ]
>
> I will restate, tersely, what I think Karl said.  I hope Karl can then
> say "yes, that's right [or close enough]" or "no."
>
> * if the choices are Latin-1 or UTF-8, It is possible to predict with
> high confidence which a line of input is
> * we can use this to avoid having to declare the encoding
> * if encoding is declared, and is at odds with what is detected, a
> warning (or error) could be issued
>
> So, first off: is that about right?

Yes. But also we could issue a warning if no encoding is declared and
we decided that it is utf8, hence any time the current behavior is
changed a warning would be raised.
>
> Next:  I think this still requires that the program says "my source
> should be decoded at all".  I /do/ agree with the assertion that we can
> "guess" whether input is UTF-8 or Latin-1, but that's not the only
> relevant question.  Imagine this program:
>
> #!/usr/bin/perl
> use v5.36;
> my $str1 = "??????";
> say $str;
>
>
> Right now, no matter what content is actually in that string literal,
> the same bytes that were in the source will be sent to stdout.  Imagine
> that we say "We can detect that the string is UTF-8 bytes, so we decode
> the bytes in the string literal so that $str1 contains the Unicode
> codepoints encoded in it."  When we print that string, we will get a
> wide string warning, and we will deserve it.  This, more or less, is why
> this proposal ended up existing rather than the previous one to make
> "use vX" enable utf8.
>
> It was Felipe G., I believe, who said that users would end up more
> confused when the [lack of] automatic filehandle discipline didn't match
> the implicit source decoding.  I think that claim was correct.  I think
> we'd do users a disservice if we built strings by decoding the source
> literals based on encoding detection — not because the detection will be
> wrong, but because right now there is a bytes-in/bytes-out expectation.
>
> Karl:  Please tell me if you think I am way off base, here.

I think I finally understand the issue here; and no you're on base.

But I will beat the drum again against ever using the word 'decode' or
its variants. It is impossible to decode. Everything is always
encoded as something. You can switch encodings, but you can't decode.
I suppose it's clear if you say decode to X. But it doesn't make sense
to decode to an encoding. I presume that what is meant is to decode to
Perl's internal format, but Perl has multiple different internal
formats. So when people use the word 'decode', I don't know what they
actually mean. And I suspect they don't either.
>
> I /do/ think this all leads to a more exciting possibility, though!
>
> We /could/ automatically detect source encoding, but forbid non-ASCII in
> string literals without declaration.  This would allow non-ASCII syntax
> freely, but would require users clarify that they know their literals
> will be decoded into codepoint strings rather than octet strings.  (If I
> wanted to keep banging the "adverbs on quote-like operators" drum, I
> would say that we could easily do this on a per-literal basis that
> way.)  I think the problem we're seeing here is the conflation of text
> and buffer types in Perl 5, and I feel like we're finding a nice way to
> smoosh the lump under the carpet into one place, but I don't think we
> can eliminate it just yet.

This sounds reasonable to me.
>
> --
> rjbs
Re: tightening up source code encoding semantics [ In reply to ]
On Tue, Jun 21, 2022, at 12:17, Karl Williamson wrote:
> > Karl: Please tell me if you think I am way off base, here.
>
> I think I finally understand the issue here; and no you're on base.

Okay, good. I will revisit the proposal and try to make sure we get to something we all/both think is good.

> But I will beat the drum again against ever using the word 'decode' or its variants. It is impossible to decode.

Because I value easy communication with you, I will try to avoid it. But I think it's often clear to me what it means: the "decode" operation maps from a sequence of bytes to a sequence of codepoints.

Obviously the codepoints have to be represented in the computer memory as bytes, but logically in the program they are now treated as codepoints. So when I say "should we decode the source?" I mean "should the compiler decode the source text so that the variables formed out of its literals are codepoint sequences rather than byte sequences."

I don't think people mean "to decode is to transcode into Perl's internal byte format." They mean "to decode is to transform from a byte sequence (in a known encoding and repertoire) to a codepoint sequence." (Well, some people mean that. Some people don't know what they mean. That's just how people are…)

So: while I don't think it's meaningless or impossible to talk about, I will gladly concede that you find it distracting, and try to be more verbose. ????

--
rjbs
Re: tightening up source code encoding semantics [ In reply to ]
Porters,

Okay, I'm replying to this thread, but sort of starting anew.

What we want to avoid:
* runtime encoding bugs that could be compile time
* more boilerplate than should be necessary
* breaking old code (apart from the quite outré)
* being locked into ASCII "plus maybe Latin-1" forever for the non-literal source code
We've been through a lot of options, which I will not recount here (sorry). One key pair of points:
* we can with high confidence know whether a document is UTF-8
* we can't know what existing programs do with strings, so we can't detect encoding bugs at compile time (or, really, even compile time)
So, here's my new run at the problem (don't reply until you get to the end of this email, okay?):
* the goal state is that Perl programs are always encoded as UTF-8 text files
* literal strings are byte strings (sequences of the octets found in the source document)
* under "use utf8", literal strings are text strings (sequences of codepoints represented by the octets in the source)
* because the source document must be valid Unicode text, a source document of the bytes \x22 \xFF \x22 is *not* legal, because it is not legal UTF-8
* right now, the default is that any byte sequence is legal in the source document, so the program \x22\xFF\x22 is legal, and produces a string whose only element is chr(\xFF)
* this should be rejected at read time, because the source document should be UTF-8
I'm going to stop the bullet list there. Here's what I think: I am describing what, I think, is the *right* state of affairs for Perl 5 if we don't introduce a Str v. Buf type distinction. On the other hand, I imagine the road to getting there: programs saved as Latin-1 encoded files with non-ASCII literals (to say nothing of variable names) need to be warned about, re-encoded, and so on. Is this worth it? I don't know, and I don't know the extent of the work required, but it feels like "surely a bunch."

*So I want to go back toward the original proposal.*

We should have something like "use ascii;" that says "this source code must be entirely in ASCII". If you say "use utf8", it overrides "use ascii". We aim to turn "use ascii" on in v5.x.0. Then uses can't accidentally write in an undeclared encoding. You can't wonder what the behavior of `"????"` is, octet/codepoint-wise, because it is a compile error unless you declared "use utf8". You can't declare "It's UTF-8 included non-ASCII codepoints in the source, but the literal strings are octet strings." Too bad. We could have a qb{...} someday.

I think we have various technical means to provide a "Perl, but with coherent source code encoding semantics" better than that, but I don't think we have the will, and I don't know whether we *should*, relative to our "don't break running code" goals.

So: should we have "use ascii", or something else, or nothing?

--
rjbs
Re: tightening up source code encoding semantics [ In reply to ]
2022-7-16 1:10 Ricardo Signes <perl.p5p@rjbs.manxome.org>: wrote

I'm going to stop the bullet list there. Here's what I think: I am
> describing what, I think, is the *right* state of affairs for Perl 5 if
> we don't introduce a Str v. Buf type distinction. On the other hand, I
> imagine the road to getting there: programs saved as Latin-1 encoded files
> with non-ASCII literals (to say nothing of variable names) need to be
> warned about, re-encoded, and so on. Is this worth it? I don't know, and
> I don't know the extent of the work required, but it feels like "surely a
> bunch."
>

I think the user who needs re-encoded wants to know

1. "What code needs to be changed?"
2. "How do I convert the code?"
3. "Is there a tool that can convert automatically?"