Mailing List Archive

tightening up source code encoding semantics
Porters,

This is a long email which ends up with me mostly spitballing an idea or two about how to improve our handling of source code encoding. Sorry?

I've been talking with Karl about source::encoding, utf8, and related topics. We got talking about whether "no source::encoding" made sense. Meanwhile, Paul was posting about disallowing downgrade from utf8. Then Karl asked about bytes.pm.

I think the whole situation could do with another round of "Yeah, but what would the best world be?" I will start by saying, "In the best world, bytes.pm would not exist." But it does, and I think we can generally allow it to continue to … do what it does. I will not refer to bytes.pm again in this email.

The big question is, how are we to allow Perl source code to be encoded? I think there are a few options worth mentioning:
* ASCII only, all non-ASCII must be represented by escape sequences
* UTF-8 only, all non-ASCII data must be represented by escape sequences
* bytes only, all bytes read from source become the corresponding codepoint in the source (this is sometimes described as "It's Latin-1 by default", which has been a contentious claim over the years)
* a mixture of bytes and UTF-8
The option we've given, for years, is the last one. We start in bytes mode. "use utf8" indicates that the source document is in UTF-8. When utf8 leaves effect, either because its scope ends or because of "no utf8", we return to bytes mode. This is pretty terrible, in my opinion. What's one's editor to make of this?

If we imagine that the reader can correctly swap between reading bytes and UTF-8 at scope boundaries (which I think I've seen recent evidence that it cannot reliably do), this may be a technically sustainable position. I think it's a *bad* position, though.

"The source is bytes" is a bad position and always has been one, with the *possible* exception of string literals. Unfortunately, we have relatively terrible failure modes around non-ASCII outside of string literals.

*Program:*
<<GROß;
foo
GROß

*Output:*
Can't find string terminator "GRO" anywhere before EOF at - line 1.

I think what we really want is to say *either* "This program has stupid legacy behavior" *or* "this program is encoded in UTF-8". Then we want to strongly, *strongly* encourage the second option. You may want to cry out, now, "I thought you said months ago that we wouldn't force everyone to use UTF-8 encoded source!" I am not quite contradicting myself.

Remember, fellow porter, that ASCII encoded data is a subset of UTF-8 encoded data. Once the source is declared to be in UTF-8, it's much less of a problem to say "specifically, entirely codepoints 0-127 except in scopes where that restriction is lifted." I think the problem with "no utf8" is not that it lets you disallow Japanese text, but that it switches back to bytes mode.

The whole thing makes me think that we want source::encoding (or something like it) to say "this document is UTF-8" and optionally "but only ASCII characters." Once that's said, it can't be undone. There is no "no source::encoding", only a switch to ASCII or not. Ideally, this would be the natural state of the program, but given the "the boilerplate should be a single line" doctrine, I think this is what we want implied by "use v5.x".

This gets us back to the "use v5.x should imply ascii encoding", but further to, "and you can't switch it off". I'd say something like:
* you must declare source encoding before any non-ASCII byte is encountered
* you must declare source encoding at the outermost lexical scope in a file, if you are to declare it at all
--
rjbs
Re: tightening up source code encoding semantics [ In reply to ]
On Mon, Feb 21, 2022 at 9:56 PM Ricardo Signes <perl.p5p@rjbs.manxome.org>
wrote:

> Porters,
>
> This is a long email which ends up with me mostly spitballing an idea or
> two about how to improve our handling of source code encoding. Sorry?
>
> I've been talking with Karl about source::encoding, utf8, and related
> topics. We got talking about whether "no source::encoding" made sense.
> Meanwhile, Paul was posting about disallowing downgrade from utf8. Then
> Karl asked about bytes.pm.
>
> I think the whole situation could do with another round of "Yeah, but what
> would the best world be?" I will start by saying, "In the best world,
> bytes.pm would not exist." But it does, and I think we can generally
> allow it to continue to … do what it does. I will not refer to bytes.pm
> again in this email.
>
> The big question is, how are we to allow Perl source code to be encoded?
> I think there are a few options worth mentioning:
>
> - ASCII only, all non-ASCII must be represented by escape sequences
> - UTF-8 only, all non-ASCII data must be represented by escape
> sequences
> - bytes only, all bytes read from source become the corresponding
> codepoint in the source (this is sometimes described as "It's Latin-1 by
> default", which has been a contentious claim over the years)
> - a mixture of bytes and UTF-8
>
>
Is the second one meant to just be "UTF-8 only" with no caveats? UTF-8
without non-ASCII data is ... just ASCII.

-Dan
Re: tightening up source code encoding semantics [ In reply to ]
Op 22-02-2022 om 03:55 schreef Ricardo Signes:

> The whole thing makes me think that we want source::encoding (or
> something like it) to say "this document is UTF-8" and optionally "but
> only ASCII characters."  Once that's said, it can't be undone.  There
> is no "no source::encoding", only a switch to ASCII or not.  Ideally,
> this would be the natural state of the program, but given the "the
> boilerplate should be a single line" doctrine, I think this is what we
> want implied by "use v5.x".
>

I would not call it "UTF8 but only ASCII characters". That is confusing
as hell. Call it what it is, ASCII.


> you must declare source encoding before any non-ASCII byte is encountered
>
> * you must declare source encoding at the outermost lexical scope in
> a file, if you are to declare it at all
>
>

This makes so much sense. Would it makes sense to continue the current
behaviour, but print a deprication warning when non-ASCII is encountered
without an explicit or implicit 'use utf8'? (implicit, because this
opens the way for instance for use v5.040 to imply use utf8).


But how much existing code does it break? That is still the big problem
with this proposal, it breaks currently perfectly valid Perl code. Or
did I miss something?


M4
Re: tightening up source code encoding semantics [ In reply to ]
Hi there,

On Mon, 21 Feb 2022, Ricardo Signes wrote:

> ... Once the source is declared to be in UTF-8, it's much less of a
> problem to say "specifically, entirely codepoints 0-127 except in
> scopes where that restriction is lifted."

I'd be pleased to be able to do that. Maybe something like the

use stricter;

that I mentioned?

> ... I think the problem with "no utf8" is not that it lets you
> disallow Japanese text, but that it switches back to bytes mode.

That's awful.

> ...
> This gets us back to the "use v5.x should imply ascii encoding", but
> further to, "and you can't switch it off" ... something like:
>
> * you must declare source encoding before any non-ASCII byte is encountered

I could live with that. I'd be happy to.

> * you must declare source encoding at the outermost lexical scope in
> a file, if you are to declare it at all

I haven't thought through implications but I think I like that too.

--

73,
Ged.
Re: tightening up source code encoding semantics [ In reply to ]
> On Feb 21, 2022, at 21:55, Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:
>
> The whole thing makes me think that we want source::encoding (or something like it) to say "this document is UTF-8" and optionally "but only ASCII characters." Once that's said, it can't be undone. There is no "no source::encoding", only a switch to ASCII or not. Ideally, this would be the natural state of the program, but given the "the boilerplate should be a single line" doctrine, I think this is what we want implied by "use v5.x".

IMO “modern Perl” should require source code to be valid UTF-8, full stop. There seems little reason why Rik’s “GROß” example should fail.

Auto-decoding of string literals is a separate, more problematic question. Valid use cases exist either way. If Perl reliably differentiated between decoded/text and non-decoded/byte strings, auto-decode would be a sensible default, but that’s not where we are. As I wrote months ago, auto-decode makes `print "hello"` subtly wrong, which will frustrate a neophyte’s already-thorny first encounter with character encoding in Perl.

For context: cPanel’s internal rule is “all strings are byte strings unless you really need text”. It’s rare that we need Unicode semantics, and forgoing both decode and encode steps all but eliminates that class of bugs for us. (FWIW I think this would actually serve many Perl applications besides cPanel better than the decode/encode workflow.)

Requiring source code to be valid UTF-8, but *not* auto-decoding literals, would solve the “GROß” problem while still avoiding the print-hello-is-subtly-wrong awkwardness.

-F
Re: tightening up source code encoding semantics [ In reply to ]
On Tue, 22 Feb 2022 08:45:52 -0500, Felipe Gasper <felipe@felipegasper.com> wrote:

> > On Feb 21, 2022, at 21:55, Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:
> >
> > The whole thing makes me think that we want source::encoding (or
> > something like it) to say "this document is UTF-8" and optionally
> > "but only ASCII characters." Once that's said, it can't be undone.
> > There is no "no source::encoding", only a switch to ASCII or not.
> > Ideally, this would be the natural state of the program, but given
> > the "the boilerplate should be a single line" doctrine, I think
> > this is what we want implied by "use v5.x".
>
> IMO “modern Perl” should require source code to be valid UTF-8, full
> stop.

Respectfully disagree. There are still plenty environments where
entering UTF-8 in source files is a real problem, and entering
ISO-8859-1 is easy.

There seems little reason why Rik’s “GROß” example should fail.

What if the .pl is in iso-8859? Currently it fails just like Rik's example

$ cat test.pl
#!env perl

use 5.18.3;
use warnings;

say <<GROß;
Hello Iso-8859-1
GROß

$ dump test.pl
[DUMP 0.6.01]

00000000 23 21 2F 70 72 6F 2F 62 69 6E 2F 70 65 72 6C 0A #!/pro/bin/perl.
00000010 0A 75 73 65 20 35 2E 31 38 2E 33 3B 0A 75 73 65 .use 5.18.3;.use
00000020 20 77 61 72 6E 69 6E 67 73 3B 0A 0A 73 61 79 20 warnings;..say
00000030 3C 3C 47 52 4F DF 3B 0A 48 65 6C 6C 6F 20 49 73 <<GRO.;.Hello Is
00000040 6F 2D 38 38 35 39 2D 31 0A 47 52 4F DF 0A o-8859-1.GRO..

$ perl test.pl
Can't find string terminator "GRO" anywhere before EOF at test.pl line 6.


Anyway, heredoc separators should be quoted (your opinion may differ)

$ cat test.pl
#!env perl

use 5.18.3;
use warnings;

say <<"GROß";
Hello Iso-8859-1
GROß

$ dump test.pl
[DUMP 0.6.01]

00000000 23 21 2F 70 72 6F 2F 62 69 6E 2F 70 65 72 6C 0A #!/pro/bin/perl.
00000010 0A 75 73 65 20 35 2E 31 38 2E 33 3B 0A 75 73 65 .use 5.18.3;.use
00000020 20 77 61 72 6E 69 6E 67 73 3B 0A 0A 73 61 79 20 warnings;..say
00000030 3C 3C 22 47 52 4F DF 22 3B 0A 48 65 6C 6C 6F 20 <<"GRO.";.Hello
00000040 49 73 6F 2D 38 38 35 39 2D 31 0A 47 52 4F DF 0A Iso-8859-1.GRO..

$ perl test.pl
Hello Iso-8859-1


That all said, I would not object to moving to UTF-8 as in almost every
case where I would use this, "\x{..}", "\x{....}", and "\N{.....}"
would be the correct approach

--
H.Merijn Brand https://tux.nl Perl Monger http://amsterdam.pm.org/
using perl5.00307 .. 5.33 porting perl5 on HP-UX, AIX, and Linux
https://tux.nl/email.html http://qa.perl.org https://www.test-smoke.org
Re: tightening up source code encoding semantics [ In reply to ]
> On Feb 22, 2022, at 09:00, H.Merijn Brand <perl5@tux.freedom.nl> wrote:
>
> On Tue, 22 Feb 2022 08:45:52 -0500, Felipe Gasper <felipe@felipegasper.com> wrote:
>
>>> On Feb 21, 2022, at 21:55, Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:
>>>
>>> The whole thing makes me think that we want source::encoding (or
>>> something like it) to say "this document is UTF-8" and optionally
>>> "but only ASCII characters." Once that's said, it can't be undone.
>>> There is no "no source::encoding", only a switch to ASCII or not.
>>> Ideally, this would be the natural state of the program, but given
>>> the "the boilerplate should be a single line" doctrine, I think
>>> this is what we want implied by "use v5.x".
>>
>> IMO “modern Perl” should require source code to be valid UTF-8, full
>> stop.
>
> Respectfully disagree. There are still plenty environments where
> entering UTF-8 in source files is a real problem, and entering
> ISO-8859-1 is easy.

Sorry, I meant to write: modern Perl should, by *default*, require Perl source code to be valid UTF-8. Perl should, of course, still be able to work in single-byte contexts.

>
> What if the .pl is in iso-8859? Currently it fails just like Rik's example

While I think `<<GROß>` in UTF-8 should work, it would surprise me if the benefit from teaching Perl to parse the same heredoc in other encodings as well justified the additional development & maintenance effort.

That said, I’m an Anglophone, so I may not perceive such benefits as readily as, say, a continental European.

> That all said, I would not object to moving to UTF-8 as in almost every
> case where I would use this, "\x{..}", "\x{....}", and "\N{.....}"
> would be the correct approach

I’m not sure what you mean. \N{.....} et al. generally create Unicode strings, not their UTF-8-encoded variants.

I do frequently need to type curly-quotes (“”, and ‘’) via keyboard shortcuts. I’d find it irksome if an official preference were established in Perl for me to write something like: \N{LEFT_CURLY_DOUBLE_QUOTE}Hello\N{RIGHT_CURLY_DOUBLE_QUOTE} (or whatever their real Unicode names are) rather than just option-[, Hello, and shift-option-[. In fact, such would break our Locale::Maketext-based localization tools, which parse source code to extract strings to send to translators.

-F
Re: tightening up source code encoding semantics [ In reply to ]
On Tue, 22 Feb 2022 09:37:14 -0500, Felipe Gasper <felipe@felipegasper.com> wrote:

> > On Feb 22, 2022, at 09:00, H.Merijn Brand <perl5@tux.freedom.nl> wrote:
> >
> > On Tue, 22 Feb 2022 08:45:52 -0500, Felipe Gasper <felipe@felipegasper.com> wrote:
> >
> [...]
> >>
> >> IMO “modern Perl” should require source code to be valid UTF-8, full
> >> stop.
> >
> > Respectfully disagree. There are still plenty environments where
> > entering UTF-8 in source files is a real problem, and entering
> > ISO-8859-1 is easy.
>
> Sorry, I meant to write: modern Perl should, by *default*, require
> Perl source code to be valid UTF-8. Perl should, of course, still be
> able to work in single-byte contexts.
>
> > What if the .pl is in iso-8859? Currently it fails just like Rik's
> > example
>
> While I think `<<GROß>` in UTF-8 should work, it would surprise me if
> the benefit from teaching Perl to parse the same heredoc in other
> encodings as well justified the additional development & maintenance
> effort.
>
> That said, I’m an Anglophone, so I may not perceive such benefits as
> readily as, say, a continental European.
>
> > That all said, I would not object to moving to UTF-8 as in almost
> > every case where I would use this, "\x{..}", "\x{....}", and
> > "\N{.....}" would be the correct approach
>
> I’m not sure what you mean. \N{.....} et al. generally create Unicode
> strings, not their UTF-8-encoded variants.

I personally never use hardcoded non-ASCII characters in perl source
code if I can prevent it. If I need is-8859-1, I use \x{..} inside
double quotes. If I need utf-8, I use "\x{....}" and/or "\N{....}".

> I do frequently need to type curly-quotes (“”, and ‘’) via keyboard
> shortcuts. I’d find it irksome if an official preference were
> established in Perl for me to write something like:
> \N{LEFT_CURLY_DOUBLE_QUOTE}Hello\N{RIGHT_CURLY_DOUBLE_QUOTE} (or

I *loathe* special quotes in error messages and source code: they are
just cause of confusing and errors. If a C-compiler or other tool tells
me that <special-quote>file-name<special-quote> contains an error,
double-click on the filename most often includes the quotation, and
these quotations are seldom recognized by my shells. Don't tell me to
use a different shell then, using "intelligent" quotes in
error-messages sucks. Period.

> whatever their real Unicode names are) rather than just option-[,
> Hello, and shift-option-[. In fact, such would break our
> Locale::Maketext-based localization tools, which parse source code to
> extract strings to send to translators.

I know a lot of people love localization, but I try to stay away from
that as much as possible, as it complicates me finding the cause of
problems. The original - most-often an English - message is much easier
to google that some form of translation.

> -F

--
H.Merijn Brand https://tux.nl Perl Monger http://amsterdam.pm.org/
using perl5.00307 .. 5.33 porting perl5 on HP-UX, AIX, and Linux
https://tux.nl/email.html http://qa.perl.org https://www.test-smoke.org
Re: tightening up source code encoding semantics [ In reply to ]
On Tue, Feb 22, 2022, at 1:55 AM, Dan Book wrote:
> On Mon, Feb 21, 2022 at 9:56 PM Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:__
>> The big question is, how are we to allow Perl source code to be encoded? I think there are a few options worth mentioning:
>> * ASCII only, all non-ASCII must be represented by escape sequences
>> * UTF-8 only, all non-ASCII data must be represented by escape sequences
>> * bytes only, all bytes read from source become the corresponding codepoint in the source (this is sometimes described as "It's Latin-1 by default", which has been a contentious claim over the years)
>> * a mixture of bytes and UTF-8
>
> Is the second one meant to just be "UTF-8 only" with no caveats? UTF-8 without non-ASCII data is ... just ASCII.

The second one was very poorly written, I must've seen something shiny while writing it.

The second one should've been "all non-Unicode or non-textual data", not "non-ASCII data". That is:
my $string = "Queensrÿche"; # source contains UTF-8, $string contains U+00FF
my $buffer = "Queensr\xC3\xBFche"; # string, meant to contain UTF-8 (not text), does

--
rjbs
RE: tightening up source code encoding semantics [ In reply to ]
* you must declare source encoding before any non-ASCII byte is encountered
will the following simple programs continue to work, given that "t-u8.pl" contains correct utf-8?
vad@bonitah:~/sdb1$ perl -w
print "??????\n";
??????
vad@bonitah:~/sdb1$ perl -w t-u8.pl
??????
Press any key to continue...
This is a very simple and reasonable thing to do (AKA DWIM)


Internal Use - Confidential
Re: tightening up source code encoding semantics [ In reply to ]
On Tue, Feb 22, 2022, at 12:41 PM, Konovalov, Vadim wrote:
> * you must declare source encoding before any non-ASCII byte is encountered
> will the following simple programs continue to work, given that “t-u8.pl” contains correct utf-8?
> vad@bonitah:~/sdb1$ perl -w
> print "??????\n";
> ??????
> vad@bonitah:~/sdb1$ perl -w t-u8.pl
> ??????
> Press any key to continue...
> This is a very simple and reasonable thing to do (AKA DWIM)

Yes. I think I should've been clearer:
* If source encoding is declared at all, it must be before the first non-ASCII byte.

In my imaginary world, this program is okay:
say "?? ????????";

This program is okay:
use source::encoding 'utf8';
say "?? ????????";

This program is not:
say "??????";
use source::encoding 'utf8';
say "?? ????????";

--
rjbs
Re: tightening up source code encoding semantics [ In reply to ]
On Tue, Feb 22, 2022 at 4:30 PM Ricardo Signes <perl.p5p@rjbs.manxome.org>
wrote:

> On Tue, Feb 22, 2022, at 12:41 PM, Konovalov, Vadim wrote:
>
>
> - you must declare source encoding before any non-ASCII byte is
> encountered
>
> will the following simple programs continue to work, given that “t-u8.pl”
> contains correct utf-8?
>
> vad@bonitah:~/sdb1$ perl -w
>
> print "??????\n";
>
> ??????
>
> vad@bonitah:~/sdb1$ perl -w t-u8.pl
>
> ??????
>
> Press any key to continue...
>
> This is a very simple and reasonable thing to do (AKA DWIM)
>
>
> Yes. I think I should've been clearer:
>
> - If source encoding is declared at all, it must be before the first
> non-ASCII byte.
>
>
> In my imaginary world, this program is okay:
>
> say "?? ????????";
>
>
> This program is okay:
>
> use source::encoding 'utf8';
> say "?? ????????";
>
>
> This program is not:
>
> say "??????";
> use source::encoding 'utf8';
> say "?? ????????";
>
>
>
Sounds very reasonable.

-Dan
Re: tightening up source code encoding semantics [ In reply to ]
On Tue, 22 Feb 2022 16:30:04 -0500, "Ricardo Signes" <perl.p5p@rjbs.manxome.org> wrote:

> On Tue, Feb 22, 2022, at 12:41 PM, Konovalov, Vadim wrote:
> > * you must declare source encoding before any non-ASCII byte is encountered
> > will the following simple programs continue to work, given that “t-u8.pl” contains correct utf-8?
> > vad@bonitah:~/sdb1$ perl -w
> > print "??????\n";
> > ??????
> > vad@bonitah:~/sdb1$ perl -w t-u8.pl
> > ??????
> > Press any key to continue...
> > This is a very simple and reasonable thing to do (AKA DWIM)
>
> Yes. I think I should've been clearer:
> * If source encoding is declared at all, it must be before the first non-ASCII byte.

+1

> In my imaginary world, this program is okay:
> say "?? ????????";
>
> This program is okay:
> use source::encoding 'utf8';
> say "?? ????????";
>
> This program is not:
> say "??????";
> use source::encoding 'utf8';
> say "?? ????????";

--
H.Merijn Brand https://tux.nl Perl Monger http://amsterdam.pm.org/
using perl5.00307 .. 5.33 porting perl5 on HP-UX, AIX, and Linux
https://tux.nl/email.html http://qa.perl.org https://www.test-smoke.org
Re: tightening up source code encoding semantics [ In reply to ]
On 2/21/22 19:55, Ricardo Signes wrote:
> Porters,
>
> This is a long email which ends up with me mostly spitballing an idea or
> two about how to improve our handling of source code encoding.  Sorry?
>
> I've been talking with Karl about source::encoding, utf8, and related
> topics.  We got talking about whether "no source::encoding" made sense.
> Meanwhile, Paul was posting about disallowing downgrade from utf8.  Then
> Karl asked about bytes.pm.
>
> I think the whole situation could do with another round of "Yeah, but
> what would the best world be?"  I will start by saying, "In the best
> world, bytes.pm would not exist."  But it does, and I think we can
> generally allow it to continue to … do what it does.  I will not refer
> to bytes.pm again in this email.
>
> The big question is, how are we to allow Perl source code to be
> encoded?  I think there are a few options worth mentioning:
>
> * ASCII only, all non-ASCII must be represented by escape sequences
> * UTF-8 only, all non-ASCII data must be represented by escape sequences
> * bytes only, all bytes read from source become the corresponding
> codepoint in the source (this is sometimes described as "It's
> Latin-1 by default", which has been a contentious claim over the years)
> * a mixture of bytes and UTF-8
>
> The option we've given, for years, is the last one.  We start in bytes
> mode.  "use utf8" indicates that the source document is in UTF-8.  When
> utf8 leaves effect, either because its scope ends or because of "no
> utf8", we return to bytes mode.  This is pretty terrible, in my
> opinion.  What's one's editor to make of this?
>
> If we imagine that the reader can correctly swap between reading bytes
> and UTF-8 at scope boundaries (which I think I've seen recent evidence
> that it cannot reliably do), this may be a technically sustainable
> position.  I think it's a /bad/ position, though.
>
> "The source is bytes" is a bad position and always has been one, with
> the /possible/ exception of string literals.  Unfortunately, we have
> relatively terrible failure modes around non-ASCII outside of string
> literals.
>
> *Program:*
>
> <<GROß;
> foo
> GROß
>
>
> *Output:*
>
> Can't find string terminator "GRO" anywhere before EOF at - line 1.
>
>
> I think what we really want is to say /either/ "This program has stupid
> legacy behavior" /or/ "this program is encoded in UTF-8".  Then we want
> to strongly, /strongly/ encourage the second option.  You may want to
> cry out, now, "I thought you said months ago that we wouldn't force
> everyone to use UTF-8 encoded source!"  I am not quite contradicting myself.
>
> Remember, fellow porter, that ASCII encoded data is a subset of UTF-8
> encoded data.  Once the source is declared to be in UTF-8, it's much
> less of a problem to say "specifically, entirely codepoints 0-127 except
> in scopes where that restriction is lifted."  I think the problem with
> "no utf8" is not that it lets you disallow Japanese text, but that it
> switches back to bytes mode.
>
> The whole thing makes me think that we want source::encoding (or
> something like it) to say "this document is UTF-8" and optionally "but
> only ASCII characters."  Once that's said, it can't be undone.  There is
> no "no source::encoding", only a switch to ASCII or not.  Ideally, this
> would be the natural state of the program, but given the "the
> boilerplate should be a single line" doctrine, I think this is what we
> want implied by "use v5.x".
>
> This gets us back to the "use v5.x should imply ascii encoding", but
> further to, "and you can't switch it off".  I'd say something like:
>
> * you must declare source encoding before any non-ASCII byte is
> encountered
> * you must declare source encoding at the outermost lexical scope in a
> file, if you are to declare it at all
>
> --
> rjbs
>

An option to think about is that it's possible to pretty reliably guess
the encoding upon encountering the first line containing non-ASCII.
Pod::Simple does this successfully and the choices are UTF-8 vs Windows
CP1252, which is quite a bit harder to distinguish from UTF-8 than our
alternative, Latin1. There have been no reports of problems with its
technique since I beefed it up some years ago.

The confusables for the Latin1 vs UTF-8 case all look like a Latin1
letter or the multiplication sign or division sign, followed by one or
more Latin1 punctuation/symbols or C1 controls. If you look at their
graphics, they all look like mojibake. Hence I'm confident, even
without the Pod::Simple experience, that it is extremely unlikely we
would guess wrong.

Here's how it could work.

You wouldn't need an encoding declaration in your file unless
1) the very unlikely case where we guessed wrong
2) you want to forbid non-ASCII in your file, as the original email
thread discussed.

Absent such a declaration, Perl would parse the file like it does today.
When it encounters the first line containing a non-ASCII, it would
make its guess, and if the guess is UTF-8, raise a warning, if enabled.

'no utf8' would be the way to say "Don't guess UTF-8'. It would throw
an error if we had already seen what we took as UTF-8.

'use ascii' (however it is spelled) would cause an error to be thrown if
a non-ASCII is encountered within its scope.

'no ascii' would be a no-op outside the scope of 'use ascii'. Otherwise
it would restore the behavior to whatever it was when the 'use ascii'
was encountered.

I believe the only existing programs this scenario would effect are ones
that (most likely, unsafely) mix UTF-8 and Latin1.

An advantage is that a 'use utf8' would no longer be required in almost
all circumstances.
Re: tightening up source code encoding semantics [ In reply to ]
On Wed, Feb 23, 2022 at 9:18 AM Karl Williamson <public@khwilliamson.com>
wrote:

> On 2/21/22 19:55, Ricardo Signes wrote:
> > Porters,
> >
> > This is a long email which ends up with me mostly spitballing an idea or
> > two about how to improve our handling of source code encoding. Sorry?
> >
> > I've been talking with Karl about source::encoding, utf8, and related
> > topics. We got talking about whether "no source::encoding" made sense.
> > Meanwhile, Paul was posting about disallowing downgrade from utf8. Then
> > Karl asked about bytes.pm.
> >
> > I think the whole situation could do with another round of "Yeah, but
> > what would the best world be?" I will start by saying, "In the best
> > world, bytes.pm would not exist." But it does, and I think we can
> > generally allow it to continue to … do what it does. I will not refer
> > to bytes.pm again in this email.
> >
> > The big question is, how are we to allow Perl source code to be
> > encoded? I think there are a few options worth mentioning:
> >
> > * ASCII only, all non-ASCII must be represented by escape sequences
> > * UTF-8 only, all non-ASCII data must be represented by escape
> sequences
> > * bytes only, all bytes read from source become the corresponding
> > codepoint in the source (this is sometimes described as "It's
> > Latin-1 by default", which has been a contentious claim over the
> years)
> > * a mixture of bytes and UTF-8
> >
> > The option we've given, for years, is the last one. We start in bytes
> > mode. "use utf8" indicates that the source document is in UTF-8. When
> > utf8 leaves effect, either because its scope ends or because of "no
> > utf8", we return to bytes mode. This is pretty terrible, in my
> > opinion. What's one's editor to make of this?
> >
> > If we imagine that the reader can correctly swap between reading bytes
> > and UTF-8 at scope boundaries (which I think I've seen recent evidence
> > that it cannot reliably do), this may be a technically sustainable
> > position. I think it's a /bad/ position, though.
> >
> > "The source is bytes" is a bad position and always has been one, with
> > the /possible/ exception of string literals. Unfortunately, we have
> > relatively terrible failure modes around non-ASCII outside of string
> > literals.
> >
> > *Program:*
> >
> > <<GROß;
> > foo
> > GROß
> >
> >
> > *Output:*
> >
> > Can't find string terminator "GRO" anywhere before EOF at - line 1.
> >
> >
> > I think what we really want is to say /either/ "This program has stupid
> > legacy behavior" /or/ "this program is encoded in UTF-8". Then we want
> > to strongly, /strongly/ encourage the second option. You may want to
> > cry out, now, "I thought you said months ago that we wouldn't force
> > everyone to use UTF-8 encoded source!" I am not quite contradicting
> myself.
> >
> > Remember, fellow porter, that ASCII encoded data is a subset of UTF-8
> > encoded data. Once the source is declared to be in UTF-8, it's much
> > less of a problem to say "specifically, entirely codepoints 0-127 except
> > in scopes where that restriction is lifted." I think the problem with
> > "no utf8" is not that it lets you disallow Japanese text, but that it
> > switches back to bytes mode.
> >
> > The whole thing makes me think that we want source::encoding (or
> > something like it) to say "this document is UTF-8" and optionally "but
> > only ASCII characters." Once that's said, it can't be undone. There is
> > no "no source::encoding", only a switch to ASCII or not. Ideally, this
> > would be the natural state of the program, but given the "the
> > boilerplate should be a single line" doctrine, I think this is what we
> > want implied by "use v5.x".
> >
> > This gets us back to the "use v5.x should imply ascii encoding", but
> > further to, "and you can't switch it off". I'd say something like:
> >
> > * you must declare source encoding before any non-ASCII byte is
> > encountered
> > * you must declare source encoding at the outermost lexical scope in a
> > file, if you are to declare it at all
> >
> > --
> > rjbs
> >
>
> An option to think about is that it's possible to pretty reliably guess
> the encoding upon encountering the first line containing non-ASCII.
> Pod::Simple does this successfully and the choices are UTF-8 vs Windows
> CP1252, which is quite a bit harder to distinguish from UTF-8 than our
> alternative, Latin1. There have been no reports of problems with its
> technique since I beefed it up some years ago.
>
> The confusables for the Latin1 vs UTF-8 case all look like a Latin1
> letter or the multiplication sign or division sign, followed by one or
> more Latin1 punctuation/symbols or C1 controls. If you look at their
> graphics, they all look like mojibake. Hence I'm confident, even
> without the Pod::Simple experience, that it is extremely unlikely we
> would guess wrong.
>
> Here's how it could work.
>
> You wouldn't need an encoding declaration in your file unless
> 1) the very unlikely case where we guessed wrong
> 2) you want to forbid non-ASCII in your file, as the original email
> thread discussed.
>
> Absent such a declaration, Perl would parse the file like it does today.
> When it encounters the first line containing a non-ASCII, it would
> make its guess, and if the guess is UTF-8, raise a warning, if enabled.
>
> 'no utf8' would be the way to say "Don't guess UTF-8'. It would throw
> an error if we had already seen what we took as UTF-8.
>
> 'use ascii' (however it is spelled) would cause an error to be thrown if
> a non-ASCII is encountered within its scope.
>
> 'no ascii' would be a no-op outside the scope of 'use ascii'. Otherwise
> it would restore the behavior to whatever it was when the 'use ascii'
> was encountered.
>
> I believe the only existing programs this scenario would effect are ones
> that (most likely, unsafely) mix UTF-8 and Latin1.
>
> An advantage is that a 'use utf8' would no longer be required in almost
> all circumstances.
>

In my opinion, changing program behavior based on a pretty reliable guess
rather than a declaration would be a mistake.

-Dan
Re: tightening up source code encoding semantics [ In reply to ]
Hi there,

On Wed, 23 Feb 2022, Dan Book wrote:

> In my opinion, changing program behavior based on a pretty reliable guess
> rather than a declaration would be a mistake.

+1

"Pretty reliable" == "rarely experienced problem, difficult to diagnose".

--

73,
Ged.
Re: tightening up source code encoding semantics [ In reply to ]
On Thu, 24 Feb 2022, 00:15 G.W. Haywood via perl5-porters, <
perl5-porters@perl.org> wrote:

> Hi there,
>
> On Wed, 23 Feb 2022, Dan Book wrote:
>
> > In my opinion, changing program behavior based on a pretty reliable guess
> > rather than a declaration would be a mistake.
>
> +1
>
> "Pretty reliable" == "rarely experienced problem, difficult to diagnose".
>

Yeah, I don't know that I agree. Karl is an expert in these matters, and
pretty conservative about his opinions. If he says pretty reliable I would
assume he is being conservative and actually means "very robust" and not
argue unless I had clear data to contradict him. i would not base my
rejection on opinions and intuition. It's perfectly possible that what he
proposes is reliable and capable of detecting things that could cause
trouble.

I'd like to hear more before we just dismiss his proposal.

Yves
Re: tightening up source code encoding semantics [ In reply to ]
2022-2-23 23:18 Karl Williamson <public@khwilliamson.com> wrote:

> On 2/21/22 19:55, Ricardo Signes wrote:
> > Porters,
> >
> > This is a long email which ends up with me mostly spitballing an idea or
> > two about how to improve our handling of source code encoding. Sorry?
> >
> > I've been talking with Karl about source::encoding, utf8, and related
> > topics. We got talking about whether "no source::encoding" made sense.
> > Meanwhile, Paul was posting about disallowing downgrade from utf8. Then
> > Karl asked about bytes.pm.
> >
> > I think the whole situation could do with another round of "Yeah, but
> > what would the best world be?" I will start by saying, "In the best
> > world, bytes.pm would not exist." But it does, and I think we can
> > generally allow it to continue to … do what it does. I will not refer
> > to bytes.pm again in this email.
> >
> > The big question is, how are we to allow Perl source code to be
> > encoded? I think there are a few options worth mentioning:
> >
> > * ASCII only, all non-ASCII must be represented by escape sequences
> > * UTF-8 only, all non-ASCII data must be represented by escape
> sequences
> > * bytes only, all bytes read from source become the corresponding
> > codepoint in the source (this is sometimes described as "It's
> > Latin-1 by default", which has been a contentious claim over the
> years)
> > * a mixture of bytes and UTF-8
> >
> > The option we've given, for years, is the last one. We start in bytes
> > mode. "use utf8" indicates that the source document is in UTF-8. When
> > utf8 leaves effect, either because its scope ends or because of "no
> > utf8", we return to bytes mode. This is pretty terrible, in my
> > opinion. What's one's editor to make of this?
> >
> > If we imagine that the reader can correctly swap between reading bytes
> > and UTF-8 at scope boundaries (which I think I've seen recent evidence
> > that it cannot reliably do), this may be a technically sustainable
> > position. I think it's a /bad/ position, though.
> >
> > "The source is bytes" is a bad position and always has been one, with
> > the /possible/ exception of string literals. Unfortunately, we have
> > relatively terrible failure modes around non-ASCII outside of string
> > literals.
> >
> > *Program:*
> >
> > <<GROß;
> > foo
> > GROß
> >
> >
> > *Output:*
> >
> > Can't find string terminator "GRO" anywhere before EOF at - line 1.
> >
> >
> > I think what we really want is to say /either/ "This program has stupid
> > legacy behavior" /or/ "this program is encoded in UTF-8". Then we want
> > to strongly, /strongly/ encourage the second option. You may want to
> > cry out, now, "I thought you said months ago that we wouldn't force
> > everyone to use UTF-8 encoded source!" I am not quite contradicting
> myself.
> >
> > Remember, fellow porter, that ASCII encoded data is a subset of UTF-8
> > encoded data. Once the source is declared to be in UTF-8, it's much
> > less of a problem to say "specifically, entirely codepoints 0-127 except
> > in scopes where that restriction is lifted." I think the problem with
> > "no utf8" is not that it lets you disallow Japanese text, but that it
> > switches back to bytes mode.
> >
> > The whole thing makes me think that we want source::encoding (or
> > something like it) to say "this document is UTF-8" and optionally "but
> > only ASCII characters." Once that's said, it can't be undone. There is
> > no "no source::encoding", only a switch to ASCII or not. Ideally, this
> > would be the natural state of the program, but given the "the
> > boilerplate should be a single line" doctrine, I think this is what we
> > want implied by "use v5.x".
> >
> > This gets us back to the "use v5.x should imply ascii encoding", but
> > further to, "and you can't switch it off". I'd say something like:
> >
> > * you must declare source encoding before any non-ASCII byte is
> > encountered
> > * you must declare source encoding at the outermost lexical scope in a
> > file, if you are to declare it at all
> >
> > --
> > rjbs
> >
>
> An option to think about is that it's possible to pretty reliably guess
> the encoding upon encountering the first line containing non-ASCII.
> Pod::Simple does this successfully and the choices are UTF-8 vs Windows
> CP1252, which is quite a bit harder to distinguish from UTF-8 than our
> alternative, Latin1. There have been no reports of problems with its
> technique since I beefed it up some years ago.
>
> The confusables for the Latin1 vs UTF-8 case all look like a Latin1
> letter or the multiplication sign or division sign, followed by one or
> more Latin1 punctuation/symbols or C1 controls. If you look at their
> graphics, they all look like mojibake. Hence I'm confident, even
> without the Pod::Simple experience, that it is extremely unlikely we
> would guess wrong.
>
> Here's how it could work.
>
> You wouldn't need an encoding declaration in your file unless
> 1) the very unlikely case where we guessed wrong
> 2) you want to forbid non-ASCII in your file, as the original email
> thread discussed.
>
> Absent such a declaration, Perl would parse the file like it does today.
> When it encounters the first line containing a non-ASCII, it would
> make its guess, and if the guess is UTF-8, raise a warning, if enabled.
>
> 'no utf8' would be the way to say "Don't guess UTF-8'. It would throw
> an error if we had already seen what we took as UTF-8.
>
> 'use ascii' (however it is spelled) would cause an error to be thrown if
> a non-ASCII is encountered within its scope.
>
> 'no ascii' would be a no-op outside the scope of 'use ascii'. Otherwise
> it would restore the behavior to whatever it was when the 'use ascii'
> was encountered.
>
> I believe the only existing programs this scenario would effect are ones
> that (most likely, unsafely) mix UTF-8 and Latin1.
>
> An advantage is that a 'use utf8' would no longer be required in almost
> all circumstances.
>

My understanding:

In most cases, Perl tokenizer can guess correctly whether it's written in
source code latin1 or UTF-8.

The only source codes with a mix UTF-8 and latin-1 can cause problems
actually.

If users want to make sure it is UTF-8, use "use utf8".

If users want to make sure it is ASCII, use "use ascii".
Re: tightening up source code encoding semantics [ In reply to ]
Hi there,

On Thu, 24 Feb 2022, demerphq wrote:
> On Thu, 24 Feb 2022, 00:15 G.W. Haywood via perl5-porters wrote:
>> On Wed, 23 Feb 2022, Dan Book wrote:
>>
>>> In my opinion, changing program behavior based on a pretty reliable guess
>>> rather than a declaration would be a mistake.
>>
>> +1
>>
>> "Pretty reliable" == "rarely experienced problem, difficult to diagnose".
>>
>
> Yeah, I don't know that I agree. Karl is an expert in these matters, and
> pretty conservative about his opinions. If he says pretty reliable I would
> assume he is being conservative and actually means "very robust" and not
> argue unless I had clear data to contradict him. i would not base my
> rejection on opinions and intuition. It's perfectly possible that what he
> proposes is reliable and capable of detecting things that could cause
> trouble.
>
> I'd like to hear more before we just dismiss his proposal.

If we're talking about things with the potential to negatively affect
people all over the planet for years to come, it's either reliable or
it isn't. If an expletive is appropriate here, then it invites more,
later, of the kind more commonly experienced with software issues.

I'm thinking if it's more unlikely to get it wrong than it is that say
twelve fortuitous cosmic rays will accidentally empty my bank account,
then that will probably do. But probabilities are notoriously tricky.
Guesstimates for example of hash collision resistance have turned out
to be significantly overestimated, e.g. by 2^15+ for MD5 and SHA-1.

If we can delete the intensifier, consider me persuaded and move on.

--

73,
Ged.
Re: tightening up source code encoding semantics [ In reply to ]
On 2/23/22 09:14, G.W. Haywood via perl5-porters wrote:
> Hi there,
>
> On Wed, 23 Feb 2022, Dan Book wrote:
>
>> In my opinion, changing program behavior based on a pretty reliable guess
>> rather than a declaration would be a mistake.
>
> +1
>
> "Pretty reliable" == "rarely experienced problem, difficult to diagnose".
>

I believe you guys don't grasp the proposal

No perl program would silently change behavior from the existing
baseline as a result of this proposal.

Please let that soak in.

The proposal does not lead to hard to diagnose issues, because it will
warn at compilation time if it chooses a different interpretation than
the current one.

The advantage it has over the original one in this thread is it is more
lenient; many fewer programs would have to change as a result; and in
almost all instances 'use utf8' would not be required in a program,
which has been a promise in our documentation for a long time.

I should have used a stronger term than 'pretty reliable'. It is well
known that the likelihood of UTF-8 being confused with most other
encodings goes down quite fast as the number of non-ASCII characters in
a string increases. I have never seen a value claimed, but I wouldn't
be surprised if it weren't exponential.

The syntax of UTF-8 consists of a start byte consisting of 2 or more
initial 1 bits followed by a 0, and then any pattern of bits. That
means at least the first three bits are fixed. You have 110xxxxx or
1110xxxx or 11110xxx, etc.

The start byte is followed by some number of continuation bytes, each of
which begins with '10', and then any pattern of 6 bits.

But the constraint is that the number of bytes in a single UTF-8
character is the number of leading set bits in its start byte. That
means that a character is constrained both in its bit patterns, and
length. A following character must be an ASCII one with the leading bit
0, or another sequence of bytes following the constraints I gave above.

All the bytes, of course, could individually be Latin1 ones. It turns
out that all the possible start bytes are one of:
a) multiplication sign ×
b) division sign ÷
c) Any one of 62 Latin 1 letters like ø

These must be followed by a sequence of 1 or more characters that are C1
controls or Latin 1 punctuation or symbols, like ¥.

A three byte UTF-8 character must end with two Latin1 controls or
symbols in a row.

The programs that are most like to fool this proposal are those with
single character that could be two Latin1 bytes, or a single UTF-8 one.
Attached is a printout of all such ones that don't involve a
non-printable C1 control. Find some that at are at all likely to be the
single character in a file that would confuse the algorithm in the proposal.

Yes, one could have a letter followed by a superscript digit; that makes
some sense. But would it be the only such sequence of bytes in the
file, and every other byte is either ASCII or of the same form? It's
very unlikely, but even if that is the case you would be warned at
compilation time.
Re: tightening up source code encoding semantics [ In reply to ]
On Sat, Feb 26, 2022 at 11:57 PM Karl Williamson <public@khwilliamson.com>
wrote:

> On 2/23/22 09:14, G.W. Haywood via perl5-porters wrote:
> > Hi there,
> >
> > On Wed, 23 Feb 2022, Dan Book wrote:
> >
> >> In my opinion, changing program behavior based on a pretty reliable
> guess
> >> rather than a declaration would be a mistake.
> >
> > +1
> >
> > "Pretty reliable" == "rarely experienced problem, difficult to diagnose".
> >
>
> I believe you guys don't grasp the proposal
>
> No perl program would silently change behavior from the existing
> baseline as a result of this proposal.


How would interpreting characters in a different encoding not silently
change behavior?

I concur with the rest of your message but I am familiar with the
reliability of the guess, and still believe it is a mistake in any such
widespread application.

-Dan
Re: tightening up source code encoding semantics [ In reply to ]
On Sun, 27 Feb 2022 at 05:57, Karl Williamson <public@khwilliamson.com>
wrote:

> On 2/23/22 09:14, G.W. Haywood via perl5-porters wrote:
> > Hi there,
> >
> > On Wed, 23 Feb 2022, Dan Book wrote:
> >
> >> In my opinion, changing program behavior based on a pretty reliable
> guess
> >> rather than a declaration would be a mistake.
> >
> > +1
> >
> > "Pretty reliable" == "rarely experienced problem, difficult to diagnose".
> >
>
> I believe you guys don't grasp the proposal
>
> No perl program would silently change behavior from the existing
> baseline as a result of this proposal.
>
> Please let that soak in.
>
> The proposal does not lead to hard to diagnose issues, because it will
> warn at compilation time if it chooses a different interpretation than
> the current one.
>

I think this is a key point. If this logic would choose to interpret things
in a strange way the developer would be informed.


>
> The advantage it has over the original one in this thread is it is more
> lenient; many fewer programs would have to change as a result; and in
> almost all instances 'use utf8' would not be required in a program,
> which has been a promise in our documentation for a long time.
>
> I should have used a stronger term than 'pretty reliable'. It is well
> known that the likelihood of UTF-8 being confused with most other
> encodings goes down quite fast as the number of non-ASCII characters in
> a string increases. I have never seen a value claimed, but I wouldn't
> be surprised if it weren't exponential.
>
> The syntax of UTF-8 consists of a start byte consisting of 2 or more
> initial 1 bits followed by a 0, and then any pattern of bits. That
> means at least the first three bits are fixed. You have 110xxxxx or
> 1110xxxx or 11110xxx, etc.
>
> The start byte is followed by some number of continuation bytes, each of
> which begins with '10', and then any pattern of 6 bits.
>
> But the constraint is that the number of bytes in a single UTF-8
> character is the number of leading set bits in its start byte. That
> means that a character is constrained both in its bit patterns, and
> length. A following character must be an ASCII one with the leading bit
> 0, or another sequence of bytes following the constraints I gave above.
>
> All the bytes, of course, could individually be Latin1 ones. It turns
> out that all the possible start bytes are one of:
> a) multiplication sign ×
> b) division sign ÷
> c) Any one of 62 Latin 1 letters like ø
>
> These must be followed by a sequence of 1 or more characters that are C1
> controls or Latin 1 punctuation or symbols, like ¥.
>
> A three byte UTF-8 character must end with two Latin1 controls or
> symbols in a row.
>
> The programs that are most like to fool this proposal are those with
> single character that could be two Latin1 bytes, or a single UTF-8 one.
> Attached is a printout of all such ones that don't involve a
> non-printable C1 control. Find some that at are at all likely to be the
> single character in a file that would confuse the algorithm in the
> proposal.
>

If I understand you correctly, what you are saying is that if you detected
*any* case where the code could not be reliably and correctly processed we
would throw an exception. And if we detected anything that simply couldn't
be valid UTF-8 we would know the file does not contain utf8.

So for instance what you are saying is that if someone wrote "×" (with the
× being the single octet \xD7) you instantly know that this is not utf8.
Similarly if someone wrote '÷' (with the ÷ being the single octet \xF7), we
would also know that this is not utf8. And so on and so forth.

$ perl -MData::Dumper -MEncode=encode_utf8 -wle'for my $str ("\xF7\x27",
qq(\xD7")) { my $s= $str; utf8::decode($s) or printf "can not decode <%s>
|%s| %s\n",encode_utf8($s),join(" ", map { unpack("H*",$_) } split //,
$s),Data::Dumper::qquote($s);} '
can not decode <÷'> |f7 27| "\367'"
can not decode <×"> |d7 22| "\327\""

This lines up with code I wrote for my previous job "recurse_decode_utf8"
which pretty much universally replaced decode_utf8 as multiply encoding
data as UTF-8 is a very common occurrence when interoperating with older
MySQL and DBI/DBD::MySQL versions, and other encoding agnostic remote
systems like memcached and redis and whatnot. I think I saw one bug report
related to it doing it wrong, and indeed it was a two byte sequence.

Yes, one could have a letter followed by a superscript digit; that makes
> some sense. But would it be the only such sequence of bytes in the
> file, and every other byte is either ASCII or of the same form? It's
> very unlikely, but even if that is the case you would be warned at
> compilation time.


This makes perfect sense to me. The design of utf8 makes these kinds of
things really easy to do, and as you say, once you have more than a very
small number of characters the chance of an error essentially goes to zero.
If a file contained that few characters we could warn that our detection
was being confused and that is was falling back to the old interpretation.
I guess then those scripts *would* need a pragmata, is that correct, so it
doesn't totally do away with the need for the pragma, and those who didn't
trust the heuristic to do the right thing could still use it.

I like it a lot!

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"
Re: tightening up source code encoding semantics [ In reply to ]
Hi there,

On Sun, 27 Feb 2022, Dan Book wrote:
> On Sat, Feb 26, 2022 Karl Williamson wrote:
>> On 2/23/22, G.W. Haywood via perl5-porters wrote:
>>> On Wed, 23 Feb 2022, Dan Book wrote:
>>>
>>>> In my opinion, changing program behavior based on a pretty reliable guess
>>>> rather than a declaration would be a mistake.
>>>
>>> +1
>>>
>>> "Pretty reliable" == "rarely experienced problem, difficult to diagnose".
>>>
>>
>> I believe you guys don't grasp the proposal

To make sure I've grasped it, I've just been over the thread again.

I'm pretty sure I've grasped it. :)

>> No perl program would silently change behavior from the existing
>> baseline as a result of this proposal.
>
> How would interpreting characters in a different encoding not silently
> change behavior?
>
> I concur with the rest of your message but I am familiar with the
> reliability of the guess, and still believe it is a mistake in any such
> widespread application.

I too remain of the opinion that this is a bridge too far. I don't want
to get into nit-picking so I won't start digging holes but I do think
that there are things mentioned in the proposal merit more discussion.
Here are two:

On Wed, 23 Feb 2022 Karl Williamson wrote:

> I believe the only existing programs this scenario would effect are
> ones that (most likely, unsafely) mix UTF-8 and Latin1.

This was almost an aside in the discussion but it seems to me that
it's one of the more important issues. Isn't there a case for catching
potentially unsafe usage in some way if it isn't being caught already?

> An advantage is that a 'use utf8' would no longer be required in
> almost all circumstances.

At this point in our history I see this as a disadvantage, but I admit
I'm probably numbered amongst the dinosaurs of the coding fraternity.
Is a consensus to be had?

--

73,
Ged.
Re: tightening up source code encoding semantics [ In reply to ]
On Sun, 27 Feb 2022 at 12:01, G.W. Haywood via perl5-porters <
perl5-porters@perl.org> wrote:

> > An advantage is that a 'use utf8' would no longer be required in
> > almost all circumstances.
>
> At this point in our history I see this as a disadvantage, but I admit
> I'm probably numbered amongst the dinosaurs of the coding fraternity.
> Is a consensus to be had?
>

I dont understand, why would it be a disadvantage? Nothing is stopping you
or others from including a pragma (I say this generically because I am not
clear on which pragma it would be), but why is it bad to *not* need one
%99.999 of the time? Especially if we can detect that you forgot it when
you DO need it?

I could understand it would be a disadvantage if sometimes omitting it
would produce a negative outcome and you wouldn't realize it, but Karl has
explained that wouldn't be an issue.

cheers,
Yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"
Re: tightening up source code encoding semantics [ In reply to ]
Hi there,

On Sun, 27 Feb 2022, demerphq wrote:
> On Sun, 27 Feb 2022 at 12:01, G.W. Haywood via perl5-porters wrote:
>> On Wed, Feb 23, 2022 at 9:18 AM Karl Williamson wrote:
>>>
>>> An advantage is that a 'use utf8' would no longer be required in
>>> almost all circumstances.
>>
>> At this point in our history I see this as a disadvantage, but I admit
>> I'm probably numbered amongst the dinosaurs of the coding fraternity.
>> Is a consensus to be had?
>
> I dont understand, why would it be a disadvantage? ...

Well of the points raised I don't think this is the most important,
but I almost don't understand how anyone would see it as an advantage.
I don't wat to sound cranky but suppose we decided that you won't have
to write

use strict;

any more? Would that be a similar advantage?

Like I said, I'm a dinosaur. I'd like to feel that if it does NOT say

use utf8;

then there's a whole trash-can full of worms into which I won't have
to get up to the elbows and which frankly I would dread because it can
waste such a lot of time while contributing not one groat. To try to
amplify I wrote a paragraph here, but it seemed rather too much like a
rant so I tried to put it another way. That seemed like a rant too so
I'll leave it at that and just ask again if there's any consensus - my
reason for making the point in the first place. Is there?

--

73,
Ged.

1 2  View All