Mailing List Archive: "use v5.36.0" should imply ASCII source

"use v5.36.0" should imply ASCII source

perl.p5p at rjbs

Aug 6, 2021, 8:22 AM

Post #1 of 50 (2150 views)

Porters,

I recently posted the suggestion <http://markmail.org/message/wywgcbwhu2nhykxc> that "use v5.36.0" should imply "use utf8", which led to a pretty large thread in which Felipe Gasper repeatedly said "This is going to make things worse, not better." I spent a lot of time grumbling about this to myself, figuring out exactly how to rebut this, and then deciding that I tentatively, partly, agreed with him.

We want each improvement to be a ratcheting up in language usability, when possible, rather than "we made things worse so we could make them better." At present, because we don't (and can't) know whether a string is text or bytes, we don't (and can't) automatically encode it when it hits a bytestream. We also don't know reliably whether a given output handle is already expecting to do that encoding for us.

I am 100% certain that adding "use utf8" to the feature bundle would be better *for me*, but I already have a pretty strong grasp of the I/O model of Perl. I'm not sure it's better enough for everybody.

At the PSC, we had a long talk about this, and another proposal was made:

We introduce a new stricture, which I'll call "source_encoding". Under "use strict 'source_encoding'", the compiler will raise an exception when the source contains non-ASCII content unless the utf8 pragma is in effect. The error raised can drive the programmer to documentation explaining the various trade-offs. That is: you can turn on utf8 and deal with how this affects your I/O, or you can disable the stricture, or you can restate your non-ASCII content as ASCII by using escaping constructs.

I'm not *sure* this is an improvement, but I think it is. This prevents the "I forgot to add utf8 and so only discovered after runtime that I have doubly-encoded my output" bug.

--
rjbs

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

grinnz at gmail

Aug 6, 2021, 8:45 AM

Post #2 of 50 (2150 views)

On Fri, Aug 6, 2021 at 11:23 AM Ricardo Signes <perl.p5p@rjbs.manxome.org>
wrote:

> Porters,
>
> I recently posted the suggestion
> <http://markmail.org/message/wywgcbwhu2nhykxc> that "use v5.36.0" should
> imply "use utf8", which led to a pretty large thread in which Felipe Gasper
> repeatedly said "This is going to make things worse, not better." I spent
> a lot of time grumbling about this to myself, figuring out exactly how to
> rebut this, and then deciding that I tentatively, partly, agreed with him.
>
> We want each improvement to be a ratcheting up in language usability, when
> possible, rather than "we made things worse so we could make them better."
> At present, because we don't (and can't) know whether a string is text or
> bytes, we don't (and can't) automatically encode it when it hits a
> bytestream. We also don't know reliably whether a given output handle is
> already expecting to do that encoding for us.
>
> I am 100% certain that adding "use utf8" to the feature bundle would be
> better *for me*, but I already have a pretty strong grasp of the I/O
> model of Perl. I'm not sure it's better enough for everybody.
>
> At the PSC, we had a long talk about this, and another proposal was made:
>
> We introduce a new stricture, which I'll call "source_encoding". Under
> "use strict 'source_encoding'", the compiler will raise an exception when
> the source contains non-ASCII content unless the utf8 pragma is in effect.
> The error raised can drive the programmer to documentation explaining the
> various trade-offs. That is: you can turn on utf8 and deal with how this
> affects your I/O, or you can disable the stricture, or you can restate your
> non-ASCII content as ASCII by using escaping constructs.
>
> I'm not *sure* this is an improvement, but I think it is. This prevents
> the "I forgot to add utf8 and so only discovered after runtime that I have
> doubly-encoded my output" bug.
>

FWIW, this is roughly what was suggested by Zefram as part of his proposal
for utf8-by-default, phrased as
"deprecate the presence of non-ASCII bytes anywhere in a source file other
than in the scope of "use utf8".".
https://www.nntp.perl.org/group/perl.perl5.porters/2017/10/msg246838.html

-Dan

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

felipe at felipegasper

Aug 6, 2021, 9:02 AM

Post #3 of 50 (2150 views)

> On Aug 6, 2021, at 11:22 AM, Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:
>
> Porters,
>
> I recently posted the suggestion that "use v5.36.0" should imply "use utf8", which led to a pretty large thread in which Felipe Gasper repeatedly said "This is going to make things worse, not better." I spent a lot of time grumbling about this to myself, figuring out exactly how to rebut this, and then deciding that I tentatively, partly, agreed with him.
>
> We want each improvement to be a ratcheting up in language usability, when possible, rather than "we made things worse so we could make them better." At present, because we don't (and can't) know whether a string is text or bytes, we don't (and can't) automatically encode it when it hits a bytestream. We also don't know reliably whether a given output handle is already expecting to do that encoding for us.
>
> I am 100% certain that adding "use utf8" to the feature bundle would be better for me, but I already have a pretty strong grasp of the I/O model of Perl. I'm not sure it's better enough for everybody.
>
> At the PSC, we had a long talk about this, and another proposal was made:
>
> We introduce a new stricture, which I'll call "source_encoding". Under "use strict 'source_encoding'", the compiler will raise an exception when the source contains non-ASCII content unless the utf8 pragma is in effect. The error raised can drive the programmer to documentation explaining the various trade-offs. That is: you can turn on utf8 and deal with how this affects your I/O, or you can disable the stricture, or you can restate your non-ASCII content as ASCII by using escaping constructs.
>
> I'm not sure this is an improvement, but I think it is. This prevents the "I forgot to add utf8 and so only discovered after runtime that I have doubly-encoded my output" bug.

This seems reasonable. It encourages decoding of UTF-8 characters while still allowing `print "hello world"` to be correct in modern Perl.

-FG

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

Aug 6, 2021, 9:12 AM

Post #4 of 50 (2150 views)

"Ricardo Signes" <perl.p5p@rjbs.manxome.org> wrote:
>At the PSC, we had a long talk about this, and another proposal was made:
>
>We introduce a new stricture, which I'll call "source_encoding". Under
>"use strict 'source_encoding'", the compiler will raise an exception when
>the source contains non-ASCII content unless the utf8 pragma is in effect.
>The error raised can drive the programmer to documentation explaining
>the various trade-offs. That is: you can turn on utf8 and deal with how
>this affects your I/O, or you can disable the stricture, or you can
>restate your non-ASCII content as ASCII by using escaping constructs.
>
>I'm not *sure* this is an improvement, but I think it is. This prevents
>the "I forgot to add utf8 and so only discovered after runtime that
>I have doubly-encoded my output" bug.

+1 - for me that's a big improvement over "turn utf8 on automatically":
the latter would have been a reason for me to avoid "use <version>", while
this would be a reason to use it.

Hugo

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

darren at darrenduncan

Aug 6, 2021, 10:34 AM

Post #5 of 50 (2150 views)

On 2021-08-06 8:22 a.m., Ricardo Signes wrote:
> At the PSC, we had a long talk about this, and another proposal was made:
>
> We introduce a new stricture, which I'll call "source_encoding". Under "use
> strict 'source_encoding'", the compiler will raise an exception when the source
> contains non-ASCII content unless the utf8 pragma is in effect. The error
> raised can drive the programmer to documentation explaining the various
> trade-offs. That is: you can turn on utf8 and deal with how this affects your
> I/O, or you can disable the stricture, or you can restate your non-ASCII content
> as ASCII by using escaping constructs.
>
> I'm not /sure/ this is an improvement, but I think it is. This prevents the "I
> forgot to add utf8 and so only discovered after runtime that I have
> doubly-encoded my output" bug.

+1

Personally I feel that this change is a great improvement, assuming I understand
it right.

So just to be clear, when you say ASCII, you mean pure 7-bit ASCII, which is a
proper subset of both UTF-8 and all the Latin encodings, and thus any source
files written in that will "just work" in both the most common Unicode AND
non-Unicode environments.

Would your new on as part of use 5.36 stricture then be failing every source
file that has any octet with a 1 in the 8th bit when that file doesn't also have
an explicit declaration of source encoding?

Because that is what I would expect given what you said.

For my part, I expressly designed my portable data format MUON
https://github.com/muldis/Muldis_Object_Notation/blob/master/spec/Muldis_Object_Notation_Syntax_Plain_Text.md
so that the non-7-bit-ASCII character repertoire is forbidden literally in a
file except within quoted character string literals, and so one can parse
everything outside the quoted strings, the actual document structure, completely
without even having to know what the encoding is (it can be done in binary
mode), at least between UTF-8 vs Latin etc (and even for encodings that aren't),
and decoding the inside of strings is deferrable.

-- Darren Duncan

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

Aug 6, 2021, 11:03 AM

Post #6 of 50 (2150 views)

I think this might be reasonable, but I'm not certain about how we
would want it to interact with Pod.

Pod is often interleaved with code, and more likely to include names,
and thus, UTF-8 characters. The way to declare Pod content as UTF-8 is
separate from how it is declared for the source. Should users be
required to include both a 'use utf8;' and '=encoding UTF-8' if they
want to include UTF-8 characters in their documentation, even if their
code is pure ASCII?

On Fri, Aug 6, 2021 at 5:23 PM Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:
>
> Porters,
>
> I recently posted the suggestion that "use v5.36.0" should imply "use utf8", which led to a pretty large thread in which Felipe Gasper repeatedly said "This is going to make things worse, not better." I spent a lot of time grumbling about this to myself, figuring out exactly how to rebut this, and then deciding that I tentatively, partly, agreed with him.
>
> We want each improvement to be a ratcheting up in language usability, when possible, rather than "we made things worse so we could make them better." At present, because we don't (and can't) know whether a string is text or bytes, we don't (and can't) automatically encode it when it hits a bytestream. We also don't know reliably whether a given output handle is already expecting to do that encoding for us.
>
> I am 100% certain that adding "use utf8" to the feature bundle would be better for me, but I already have a pretty strong grasp of the I/O model of Perl. I'm not sure it's better enough for everybody.
>
> At the PSC, we had a long talk about this, and another proposal was made:
>
> We introduce a new stricture, which I'll call "source_encoding". Under "use strict 'source_encoding'", the compiler will raise an exception when the source contains non-ASCII content unless the utf8 pragma is in effect. The error raised can drive the programmer to documentation explaining the various trade-offs. That is: you can turn on utf8 and deal with how this affects your I/O, or you can disable the stricture, or you can restate your non-ASCII content as ASCII by using escaping constructs.
>
> I'm not sure this is an improvement, but I think it is. This prevents the "I forgot to add utf8 and so only discovered after runtime that I have doubly-encoded my output" bug.
>
> --
> rjbs

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

grinnz at gmail

Aug 6, 2021, 11:15 AM

Post #7 of 50 (2150 views)

On Fri, Aug 6, 2021 at 2:04 PM Graham Knop <haarg@haarg.org> wrote:

> I think this might be reasonable, but I'm not certain about how we
> would want it to interact with Pod.
>
> Pod is often interleaved with code, and more likely to include names,
> and thus, UTF-8 characters. The way to declare Pod content as UTF-8 is
> separate from how it is declared for the source. Should users be
> required to include both a 'use utf8;' and '=encoding UTF-8' if they
> want to include UTF-8 characters in their documentation, even if their
> code is pure ASCII?
>

It seems reasonable to me for this restriction to ignore POD, and only
apply to things "use utf8" applies to (so including __DATA__/__END__).
Though I'm not sure how that interacts internally.

-Dan

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

leonerd at leonerd

Aug 6, 2021, 11:23 AM

Post #8 of 50 (2150 views)

On Fri, 6 Aug 2021 20:03:48 +0200
Graham Knop <haarg@haarg.org> wrote:

> Pod is often interleaved with code, and more likely to include names,
> and thus, UTF-8 characters. The way to declare Pod content as UTF-8 is
> separate from how it is declared for the source. Should users be
> required to include both a 'use utf8;' and '=encoding UTF-8' if they
> want to include UTF-8 characters in their documentation, even if their
> code is pure ASCII?

We discussed this very question. The trouble is it's a quickly-slippery
slope. If you allow non-ASCII in POD, do you allow it in comments? The
same justification - it's common to write people's (non-English) names
in comments just as well as POD. Should it be allowed there?

This quickly leads to another weird entry in a future "Perl Quirks"
document 10 years down the line, where users complain that the rules of
non-ASCII are hard to guess and subtle and anyway PPR doesn't do it
right and also there are bugs in the parser and ...

It's far easier for everyone - implementation and users alike - to give
a very simple rule:

After `use VERSION>=5.36` and until any `use utf8` there must be no
non-ASCII bytes whatsoever.

Yes this does lead to an annoying dual declaration of both `use utf8`
and `=encoding UTF-8` - perhaps that can be helped in some way?

--
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk | https://metacpan.org/author/PEVANS
http://www.leonerd.org.uk/ | https://www.tindie.com/stores/leonerd/

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

grinnz at gmail

Aug 6, 2021, 11:58 AM

Post #9 of 50 (2150 views)

On Fri, Aug 6, 2021 at 2:23 PM Paul "LeoNerd" Evans <leonerd@leonerd.org.uk>
wrote:

> On Fri, 6 Aug 2021 20:03:48 +0200
> Graham Knop <haarg@haarg.org> wrote:
>
> > Pod is often interleaved with code, and more likely to include names,
> > and thus, UTF-8 characters. The way to declare Pod content as UTF-8 is
> > separate from how it is declared for the source. Should users be
> > required to include both a 'use utf8;' and '=encoding UTF-8' if they
> > want to include UTF-8 characters in their documentation, even if their
> > code is pure ASCII?
>
> We discussed this very question. The trouble is it's a quickly-slippery
> slope. If you allow non-ASCII in POD, do you allow it in comments? The
> same justification - it's common to write people's (non-English) names
> in comments just as well as POD. Should it be allowed there?
>
> This quickly leads to another weird entry in a future "Perl Quirks"
> document 10 years down the line, where users complain that the rules of
> non-ASCII are hard to guess and subtle and anyway PPR doesn't do it
> right and also there are bugs in the parser and ...
>
> It's far easier for everyone - implementation and users alike - to give
> a very simple rule:
>
> After `use VERSION>=5.36` and until any `use utf8` there must be no
> non-ASCII bytes whatsoever.
>
> Yes this does lead to an annoying dual declaration of both `use utf8`
> and `=encoding UTF-8` - perhaps that can be helped in some way?
>

I don't think it's comparable. Comments are parsed by the perl interpreter,
but POD is not, except to find the end of the POD.

-Dan

RE: "use v5.36.0" should imply ASCII source [ In reply to ]

wolf-dietrich_moeller at t-online

Aug 6, 2021, 12:52 PM

Post #10 of 50 (2150 views)

Ricardo Signes wrote:

> We introduce a new stricture, which I'll call "source_encoding". Under
"use strict 'source_encoding'", the compiler will raise an exception when
the source contains non-ASCII content unless the utf8 pragma is in effect.
The error raised can drive the programmer to documentation explaining the
various trade-offs. That is: you can turn on utf8 and deal with how this
affects your I/O, or you can disable the stricture, or you can restate your
non-ASCII content as ASCII by using escaping constructs.

Question: If I understand correctly, this would be turned on just by "use
strict", even without "use v5.36". As many existing programs contain "use
strict", does this mean that all existing sources using Latin1 would fail?
I can imagine, that many programs use "use strict" (as its use is strongly
recommended, seeing the other discussion about turning it on with "use
v5.36").

Regards Wolf

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

perl.p5p at rjbs

Aug 6, 2021, 3:05 PM

Post #11 of 50 (2150 views)

On Fri, Aug 6, 2021, at 3:52 PM, Wolf-Dietrich Moeller (Munchen) wrote:
> Question: If I understand correctly, this would be turned on just by "use
> strict", even without "use v5.36".

No, that would not be the case.

--
rjbs

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

perigrin at prather

Aug 6, 2021, 5:01 PM

Post #12 of 50 (2150 views)

> On Aug 6, 2021, at 2:58 PM, Dan Book <grinnz@gmail.com> wrote:
>
> I don't think it's comparable. Comments are parsed by the perl interpreter, but POD is not, except to find the end of the POD.
>

Additionally comments don’t as far as I know have a way to declare their encoding.

-Chris

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

darren at darrenduncan

Aug 6, 2021, 11:45 PM

Post #13 of 50 (2150 views)

On 2021-08-06 5:01 p.m., Chris Prather wrote:
>> On Aug 6, 2021, at 2:58 PM, Dan Book <grinnz@gmail.com> wrote:
>>
>> I don't think it's comparable. Comments are parsed by the perl interpreter, but POD is not, except to find the end of the POD.
>
> Additionally comments don’t as far as I know have a way to declare their encoding.

Question: Is there ever a real life scenario where a single source file is not
entirely the same encoding? Can you reasonably have a single file containing
Perl code and POD where the Perl code is one character encoding and the POD is
another one? -- Darren Duncan

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

darren at darrenduncan

Aug 6, 2021, 11:51 PM

Post #14 of 50 (2150 views)

On 2021-08-06 11:45 p.m., Darren Duncan wrote:
> On 2021-08-06 5:01 p.m., Chris Prather wrote:
>>> On Aug 6, 2021, at 2:58 PM, Dan Book <grinnz@gmail.com> wrote:
>>>
>>> I don't think it's comparable. Comments are parsed by the perl interpreter,
>>> but POD is not, except to find the end of the POD.
>>
>> Additionally comments don’t as far as I know have a way to declare their
>> encoding.
>
> Question: Is there ever a real life scenario where a single source file is not
> entirely the same encoding? Can you reasonably have a single file containing
> Perl code and POD where the Perl code is one character encoding and the POD is
> another one? -- Darren Duncan

Or even putting aside the POD, is there any real scenario where a single file
consisting of only Perl code is in multiple encodings at once?

If there is a "use utf8;" anywhere in a Perl file, would it not be reasonable to
interpret that it is describing the entire file and not just the portion of the
file below that statement?

Perhaps a reasonable design would be that if a file contains a UTF-8 declaration
anywhere in it, the entire file is treated as such, both the part above and the
part below that declaration. And if multiple conflicting encoding declarations
exist for the current file, that is an error.

Basically the encoding declaration can be something that is scanned for in
advance of the regular parsing, or if done inline, encountering such would cause
the parser to restart at the beginning if the declaration is different than what
the parser was doing up to that point, effectively the whole file is affected by
that declaration either way.

-- Darren Duncan

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

Aug 7, 2021, 4:46 AM

Post #15 of 50 (2150 views)

On Fri, Aug 06, 2021 at 11:51:01PM -0700, Darren Duncan wrote:
> Perhaps a reasonable design would be that if a file contains a UTF-8
> declaration anywhere in it, the entire file is treated as such, both the
> part above and the part below that declaration. And if multiple conflicting
> encoding declarations exist for the current file, that is an error.

How could that possibly work? If the 'use utf8' appears halfway through
the source file, does that retrospectively invalidate everything parsed so
far?

--
Monto Blanco... scorchio!

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

Aug 7, 2021, 4:53 AM

Post #16 of 50 (2150 views)

On Fri, Aug 06, 2021 at 11:45:21PM -0700, Darren Duncan wrote:
> Question: Is there ever a real life scenario where a single source file is
> not entirely the same encoding? Can you reasonably have a single file
> containing Perl code and POD where the Perl code is one character encoding
> and the POD is another one? -- Darren Duncan

Mixed utf8/non-utf8 my be rare, but even fully utf8 source files will
still require 2 separate declarations:

a 'use utf8' for the perl parser, and a '=encoding utf8' for a pod parser,
which won't see or understand the 'use utf8'.

More generally, I think croaking on non-ascii in the src file is a fine
use for 'use v5.36'. I'm less clear whether it croak on pod too. Perhaps it
should croak on ord() > 0x7f in a pod section unless an '=encoding' has
been seen?

--
Little fly, thy summer's play my thoughtless hand
has terminated with extreme prejudice.
(with apologies to William Blake)

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

Aug 7, 2021, 5:01 AM

Post #17 of 50 (2150 views)

Op 07-08-2021 om 08:51 schreef Darren Duncan:
> Or even putting aside the POD, is there any real scenario where a
> single file consisting of only Perl code is in multiple encodings at
> once?

Probably not, but ...

>
> If there is a "use utf8;" anywhere in a Perl file, would it not be
> reasonable to interpret that it is describing the entire file and not
> just the portion of the file below that statement?

... I don't feel that is reasonable. It's not how "use" works in
general. "Use" affects the file being parsed from that point on. If you
want it to affect the whole file, put it as the first statement.

But I don't get what you want to achieve, what problem you want to solve
with this solution?

HTH,

M4

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

darren at darrenduncan

Aug 7, 2021, 5:07 AM

Post #18 of 50 (2150 views)

On 2021-08-07 4:46 a.m., Dave Mitchell wrote:
> On Fri, Aug 06, 2021 at 11:51:01PM -0700, Darren Duncan wrote:
>> Perhaps a reasonable design would be that if a file contains a UTF-8
>> declaration anywhere in it, the entire file is treated as such, both the
>> part above and the part below that declaration. And if multiple conflicting
>> encoding declarations exist for the current file, that is an error.
>
> How could that possibly work? If the 'use utf8' appears halfway through
> the source file, does that retrospectively invalidate everything parsed so
> far?

It would if the parser was so far treating everything parsed so far as something
other than UTF-8. As said in my post, the parser would restart at the beginning
of the file and treat it as UTF-8. But this would only need to happen if the
parser kept track of whether it saw any high bits so far, and if it didn't, it
knows it only saw ASCII which is also valid UTF-8 and it can skip the restart.
-- Darren Duncan

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

darren at darrenduncan

Aug 7, 2021, 5:15 AM

Post #19 of 50 (2150 views)

On 2021-08-07 5:01 a.m., Martijn Lievaart wrote:
> Op 07-08-2021 om 08:51 schreef Darren Duncan:
>> Or even putting aside the POD, is there any real scenario where a single file
>> consisting of only Perl code is in multiple encodings at once?
>
> Probably not, but ...
>
>> If there is a "use utf8;" anywhere in a Perl file, would it not be reasonable
>> to interpret that it is describing the entire file and not just the portion of
>> the file below that statement?
>
> ... I don't feel that is reasonable. It's not how "use" works in general. "Use"
> affects the file being parsed from that point on. If you want it to affect the
> whole file, put it as the first statement.
>
> But I don't get what you want to achieve, what problem you want to solve with
> this solution?

I propose that if we don't want to explicitly support mixed encodings then
explicit encoding declarations are a special case (that can be clearly
documented) where their effect should be retroactive to describe the whole file,
because logically that's the only thing that makes sense (for a non mixed
encoding file, declaring any part of it as UTF-8 is logically saying the whole
file is UTF-8), even if it does happen to have the form of a "use" statement. --
Darren Duncan

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

grinnz at gmail

Aug 7, 2021, 8:55 AM

Post #20 of 50 (2150 views)

On Sat, Aug 7, 2021 at 2:45 AM Darren Duncan <darren@darrenduncan.net>
wrote:

> On 2021-08-06 5:01 p.m., Chris Prather wrote:
> >> On Aug 6, 2021, at 2:58 PM, Dan Book <grinnz@gmail.com> wrote:
> >>
> >> I don't think it's comparable. Comments are parsed by the perl
> interpreter, but POD is not, except to find the end of the POD.
> >
> > Additionally comments don’t as far as I know have a way to declare their
> encoding.
>
> Question: Is there ever a real life scenario where a single source file
> is not
> entirely the same encoding? Can you reasonably have a single file
> containing
> Perl code and POD where the Perl code is one character encoding and the
> POD is
> another one? -- Darren Duncan
>

Yes. POD parsers and the perl interpreter do not read the same parts of the
file, ever. Thus they must each indicate to their corresponding parsers
what encoding they contain.

-Dan

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

public at khwilliamson

Aug 7, 2021, 10:12 AM

Post #21 of 50 (2150 views)

On 8/7/21 5:53 AM, Dave Mitchell wrote:
> On Fri, Aug 06, 2021 at 11:45:21PM -0700, Darren Duncan wrote:
>> Question: Is there ever a real life scenario where a single source file is
>> not entirely the same encoding? Can you reasonably have a single file
>> containing Perl code and POD where the Perl code is one character encoding
>> and the POD is another one? -- Darren Duncan
>
> Mixed utf8/non-utf8 my be rare, but even fully utf8 source files will
> still require 2 separate declarations:
>
> a 'use utf8' for the perl parser, and a '=encoding utf8' for a pod parser,
> which won't see or understand the 'use utf8'.
>
> More generally, I think croaking on non-ascii in the src file is a fine
> use for 'use v5.36'. I'm less clear whether it croak on pod too. Perhaps it
> should croak on ord() > 0x7f in a pod section unless an '=encoding' has
> been seen?
>
>
>

Pod is currently assumed to be CP1252 or UTF-8 unless an =encoding line
is specified

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

public at khwilliamson

Aug 7, 2021, 10:44 AM

Post #22 of 50 (2150 views)

On 8/7/21 11:12 AM, Karl Williamson wrote:
> On 8/7/21 5:53 AM, Dave Mitchell wrote:
>> On Fri, Aug 06, 2021 at 11:45:21PM -0700, Darren Duncan wrote:
>>> Question: Is there ever a real life scenario where a single source
>>> file is
>>> not entirely the same encoding? Can you reasonably have a single file
>>> containing Perl code and POD where the Perl code is one character
>>> encoding
>>> and the POD is another one? -- Darren Duncan
>>
>> Mixed utf8/non-utf8 my be rare, but even fully utf8 source files will
>> still require 2 separate declarations:
>>
>> a 'use utf8' for the perl parser, and a '=encoding utf8' for a pod
>> parser,
>> which won't see or understand the 'use utf8'.
>>
>> More generally, I think croaking on non-ascii in the src file is a fine
>> use for 'use v5.36'. I'm less clear whether it croak on pod too.
>> Perhaps it
>> should croak on ord() > 0x7f in a pod section unless an '=encoding' has
>> been seen?
>>
>>
>>
>
> Pod is currently assumed to be CP1252 or UTF-8 unless an =encoding line
> is specified
>

More specifically. Pod::Simple doesn't require you to use an =encoding
line. If a non-ASCII character is found, it will automatically insert
either a cp1252 or utf8 line for you.

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

darren at darrenduncan

Aug 7, 2021, 2:43 PM

Post #23 of 50 (2150 views)

On 2021-08-07 8:55 a.m., Dan Book wrote:
> On Sat, Aug 7, 2021 at 2:45 AM Darren Duncan wrote:
> Question: Is there ever a real life scenario where a single source file is not
> entirely the same encoding? Can you reasonably have a single file containing
> Perl code and POD where the Perl code is one character encoding and the POD is
> another one? -- Darren Duncan
>
> Yes. POD parsers and the perl interpreter do not read the same parts of the
> file, ever. Thus they must each indicate to their corresponding parsers what
> encoding they contain.

The need to dual-declare is important to know but its not the question I asked.

Is there ever a real life scenario where a single physical text file uses one
character encoding for one range of octets in the file and a different character
encoding for a different range of octets in the file.

I'm not aware of any text editor that when asked to save a document to disk
would not be using the same character encoding to write out the entire file.

So I would think it would be a very contrived situation for a text file to exist
physically that isn't all one encoding.

And then the question is whether we would consider it valid for such a file to
exist and that we would try to support it as a non-corrupted file.

-- Darren Duncan

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

grinnz at gmail

Aug 7, 2021, 2:52 PM

Post #24 of 50 (2150 views)

On Sat, Aug 7, 2021 at 5:44 PM Darren Duncan <darren@darrenduncan.net>
wrote:

> On 2021-08-07 8:55 a.m., Dan Book wrote:
> > On Sat, Aug 7, 2021 at 2:45 AM Darren Duncan wrote:
> > Question: Is there ever a real life scenario where a single source
> file is not
> > entirely the same encoding? Can you reasonably have a single file
> containing
> > Perl code and POD where the Perl code is one character encoding and
> the POD is
> > another one? -- Darren Duncan
> >
> > Yes. POD parsers and the perl interpreter do not read the same parts of
> the
> > file, ever. Thus they must each indicate to their corresponding parsers
> what
> > encoding they contain.
>
> The need to dual-declare is important to know but its not the question I
> asked.
>
> Is there ever a real life scenario where a single physical text file uses
> one
> character encoding for one range of octets in the file and a different
> character
> encoding for a different range of octets in the file.
>
> I'm not aware of any text editor that when asked to save a document to
> disk
> would not be using the same character encoding to write out the entire
> file.
>
> So I would think it would be a very contrived situation for a text file to
> exist
> physically that isn't all one encoding.
>
> And then the question is whether we would consider it valid for such a
> file to
> exist and that we would try to support it as a non-corrupted file.
>

It's not a matter of support. It just doesn't matter to the perl
interpreter what the encoding of POD is, and vice versa.

-Dan

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

darren at darrenduncan

Aug 7, 2021, 3:36 PM

Post #25 of 50 (2150 views)

On 2021-08-07 2:52 p.m., Dan Book wrote:
> It's not a matter of support. It just doesn't matter to the perl interpreter
> what the encoding of POD is, and vice versa.

That's fair and reasonable. But my question is broader than Perl code vs POD.

Is it reasonable to support either of these next 2 scenarios?

1. A file contains only Perl code and no POD, and one subset of that file has a
different character encoding than a different subset.

2. A file contains only POD and no Perl code, and one subset of that file has a
different character encoding than a different subset.

-- Darren Duncan

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

grinnz at gmail

Aug 7, 2021, 3:40 PM

Post #26 of 50 (1945 views)

On Sat, Aug 7, 2021 at 6:38 PM Darren Duncan <darren@darrenduncan.net>
wrote:

> On 2021-08-07 2:52 p.m., Dan Book wrote:
> > It's not a matter of support. It just doesn't matter to the perl
> interpreter
> > what the encoding of POD is, and vice versa.
>
> That's fair and reasonable. But my question is broader than Perl code vs
> POD.
>
> Is it reasonable to support either of these next 2 scenarios?
>
> 1. A file contains only Perl code and no POD, and one subset of that file
> has a
> different character encoding than a different subset.
>
> 2. A file contains only POD and no Perl code, and one subset of that file
> has a
> different character encoding than a different subset.
>

I might suggest starting a new thread if you want to discuss these
possibilities and their implications, they have no bearing on the
implementation of the proposed feature. FWIW, I do not think the second
scenario is currently supported by any POD parser.

-Dan

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

dilfridge at gentoo

Aug 8, 2021, 4:15 AM

Post #27 of 50 (1945 views)

> At the PSC, we had a long talk about this, and another proposal was made:
>
> We introduce a new stricture, which I'll call "source_encoding". Under "use strict 'source_encoding'", the compiler will raise an exception when the source contains non-ASCII content unless the utf8 pragma is in effect. The error raised can drive the programmer to documentation explaining the various trade-offs. That is: you can turn on utf8 and deal with how this affects your I/O, or you can disable the stricture, or you can restate your non-ASCII content as ASCII by using escaping constructs.
>

This somehow feels like a step backwards.

Nearly every modern Linux installation uses a unicode locale by default nowadays, I haven't come across a text file in latin1 (or similar) encoding for months...

--
Andreas K. H?ttel
dilfridge@gentoo.org
Gentoo Linux developer
(council, toolchain, base-system, perl, libreoffice)

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

felipe at felipegasper

Aug 8, 2021, 5:17 AM

Post #28 of 50 (1945 views)

> On Aug 8, 2021, at 7:15 AM, Andreas K. Huettel <dilfridge@gentoo.org> wrote:
>
>> At the PSC, we had a long talk about this, and another proposal was made:
>>
>> We introduce a new stricture, which I'll call "source_encoding". Under "use strict 'source_encoding'", the compiler will raise an exception when the source contains non-ASCII content unless the utf8 pragma is in effect. The error raised can drive the programmer to documentation explaining the various trade-offs. That is: you can turn on utf8 and deal with how this affects your I/O, or you can disable the stricture, or you can restate your non-ASCII content as ASCII by using escaping constructs.
>>
>
> This somehow feels like a step backwards.
>
> Nearly every modern Linux installation uses a unicode locale by default nowadays, I haven't come across a text file in latin1 (or similar) encoding for months...

Nearly every modern programming language also differentiates between text and binary. Alas, Perl doesn’t do this.

The language’s maintainers feel--reasonably, I think--that text in source code should be decoded. The fact that “é” in UTF-8 Perl source code is two characters (i.e., code points) by default is weird and counterintuitive. The problem is that utf8.pm’s auto-decoding behaviour imposes a requirement to encode manually, which is *really* weird/counterintuitive: it would “subtly invalidate” a simple “hello, world” implementation in “modern” Perl, which invalidity would only “bite” when there are >127 code points involved, which is, again, further weird/counterintuitive.

So, it’s a mess. The best fix here would be to teach Perl to track which strings are decoded and which aren’t. Perl would gain copiously therefrom, but it’s not easy to do. For now it’s at least reasonable to require, in “modern” Perl, that either:

a) Source code remain all-ASCII.

or

b) Perl’s auto-decoding mode be enabled (explicitly).

This will require that folks like myself, who desire “modernity” but for whom Perl’s status quo is actually useful and desirable (because $work almost never cares about strings’ Unicode content), find some workaround, but at least it’s a conspicuous change that won’t “surprise” anyone.

-FG

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

Aug 8, 2021, 5:52 AM

Post #29 of 50 (1945 views)

Op 07-08-2021 om 14:15 schreef Darren Duncan:
> On 2021-08-07 5:01 a.m., Martijn Lievaart wrote:
>> Op 07-08-2021 om 08:51 schreef Darren Duncan:
>>> Or even putting aside the POD, is there any real scenario where a
>>> single file consisting of only Perl code is in multiple encodings at
>>> once?
>>
>> Probably not, but ...
>>
>>> If there is a "use utf8;" anywhere in a Perl file, would it not be
>>> reasonable to interpret that it is describing the entire file and
>>> not just the portion of the file below that statement?
>>
>> ... I don't feel that is reasonable. It's not how "use" works in
>> general. "Use" affects the file being parsed from that point on. If
>> you want it to affect the whole file, put it as the first statement.
>>
>> But I don't get what you want to achieve, what problem you want to
>> solve with this solution?
>
> I propose that if we don't want to explicitly support mixed encodings
> then explicit encoding declarations are a special case (that can be
> clearly documented) where their effect should be retroactive to
> describe the whole file, because logically that's the only thing that
> makes sense (for a non mixed encoding file, declaring any part of it
> as UTF-8 is logically saying the whole file is UTF-8), even if it does
> happen to have the form of a "use" statement. -- Darren Duncan

Well, there actually is one mixed encoding that makes sense, ASCII up
until the 'use utf8', Unicode after that. I would assume this is the
mental model most people have.

M4

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

darren at darrenduncan

Aug 8, 2021, 11:07 AM

Post #30 of 50 (1945 views)

On 2021-08-08 5:52 a.m., Martijn Lievaart wrote:
> Op 07-08-2021 om 14:15 schreef Darren Duncan:
>> On 2021-08-07 5:01 a.m., Martijn Lievaart wrote:
>>> Op 07-08-2021 om 08:51 schreef Darren Duncan:
>>>> Or even putting aside the POD, is there any real scenario where a single
>>>> file consisting of only Perl code is in multiple encodings at once?
>>>
>>> Probably not, but ...
>>>
>>>> If there is a "use utf8;" anywhere in a Perl file, would it not be
>>>> reasonable to interpret that it is describing the entire file and not just
>>>> the portion of the file below that statement?
>>>
>>> ... I don't feel that is reasonable. It's not how "use" works in general.
>>> "Use" affects the file being parsed from that point on. If you want it to
>>> affect the whole file, put it as the first statement.
>>>
>>> But I don't get what you want to achieve, what problem you want to solve with
>>> this solution?
>>
>> I propose that if we don't want to explicitly support mixed encodings then
>> explicit encoding declarations are a special case (that can be clearly
>> documented) where their effect should be retroactive to describe the whole
>> file, because logically that's the only thing that makes sense (for a non
>> mixed encoding file, declaring any part of it as UTF-8 is logically saying the
>> whole file is UTF-8), even if it does happen to have the form of a "use"
>> statement. -- Darren Duncan
>
> Well, there actually is one mixed encoding that makes sense, ASCII up until the
> 'use utf8', Unicode after that. I would assume this is the mental model most
> people have.

Yes, that is trivially the case. I was more concerned about mutually
incompatible encodings, such as non-ASCII Latin1 characters plus non-ASCII UTF-8
characters in the same file. -- Darren Duncan

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

perl.p5p at rjbs

Aug 8, 2021, 3:50 PM

Post #31 of 50 (1945 views)

On Sat, Aug 7, 2021, at 11:55 AM, Dan Book wrote:
> Yes. POD parsers and the perl interpreter do not read the same parts of the file, ever. Thus they must each indicate to their corresponding parsers what encoding they contain.

Assuming you mean this exactly as written, I don't believe this is true.

use v5.34.0;
use warnings;

my $string = <<'END';
=encoding utf8

This is løvely døcumentation.

=cut
END

say $string;

Then…

dinah:~$ perl demo.pl
=encoding utf8

This is løvely døcumentation.

=cut

dinah:~$ pod2text demo.pl
This is løvely døcumentation.

It's all a muddle.

--
rjbs

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

grinnz at gmail

Aug 8, 2021, 3:54 PM

Post #32 of 50 (1945 views)

On Sun, Aug 8, 2021 at 6:52 PM Ricardo Signes <perl.p5p@rjbs.manxome.org>
wrote:

> On Sat, Aug 7, 2021, at 11:55 AM, Dan Book wrote:
>
> Yes. POD parsers and the perl interpreter do not read the same parts of
> the file, ever. Thus they must each indicate to their corresponding parsers
> what encoding they contain.
>
>
> Assuming you mean this exactly as written, I don't believe this is true.
>
> use v5.34.0;
> use warnings;
>
> my $string = <<'END';
> =encoding utf8
>
> This is løvely døcumentation.
>
> =cut
> END
>
> say $string;
>
>
> Then…
>
> dinah:~$ perl demo.pl
> =encoding utf8
>
> This is løvely døcumentation.
>
> =cut
>
> dinah:~$ pod2text demo.pl
> This is løvely døcumentation.
>
>
> It's all a muddle.
>

Yes, I worded it imprecisely. But it remains that even when abusing the
parsers to see each other's components, they still only follow their own
encoding declarations.

-Dan

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

david at cantrell

Aug 12, 2021, 7:48 AM

Post #33 of 50 (1943 views)

On Fri, Aug 06, 2021 at 11:45:21PM -0700, Darren Duncan wrote:
> On 2021-08-06 5:01 p.m., Chris Prather wrote:
> >>On Aug 6, 2021, at 2:58 PM, Dan Book <grinnz@gmail.com> wrote:
> >>
> >>I don't think it's comparable. Comments are parsed by the perl
> >>interpreter, but POD is not, except to find the end of the POD.
> >
> >Additionally comments don???t as far as I know have a way to declare their
> >encoding.
> Question: Is there ever a real life scenario where a single source file is
> not entirely the same encoding?

Sure. Some code, in utf8, and then a binary blob in __DATA__ which the
utf8 code reads and parses.

It's not good practice, but it's what you had to do to easily distribute
data in a CPAN distribution before File::ShareDir::Install existed.

--
David Cantrell | top google result for "topless karaoke murders"

When a man is tired of London, he is tired of life
-- Samuel Johnson

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

grinnz at gmail

Aug 12, 2021, 8:08 AM

Post #34 of 50 (1943 views)

On Thu, Aug 12, 2021 at 11:04 AM David Cantrell <david@cantrell.org.uk>
wrote:

> On Fri, Aug 06, 2021 at 11:45:21PM -0700, Darren Duncan wrote:
> > On 2021-08-06 5:01 p.m., Chris Prather wrote:
> > >>On Aug 6, 2021, at 2:58 PM, Dan Book <grinnz@gmail.com> wrote:
> > >>
> > >>I don't think it's comparable. Comments are parsed by the perl
> > >>interpreter, but POD is not, except to find the end of the POD.
> > >
> > >Additionally comments don???t as far as I know have a way to declare
> their
> > >encoding.
> > Question: Is there ever a real life scenario where a single source file
> is
> > not entirely the same encoding?
>
> Sure. Some code, in utf8, and then a binary blob in __DATA__ which the
> utf8 code reads and parses.
>
> It's not good practice, but it's what you had to do to easily distribute
> data in a CPAN distribution before File::ShareDir::Install existed.
>

Note this will currently break because the filehandle *is* shared between
code and DATA, unlike with POD. "use utf8" applies to both. But I would
consider the use case of non-textual data in DATA exceedingly rare.

-Dan

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

Aug 16, 2021, 5:00 AM

Post #35 of 50 (1943 views)

On Fri, Aug 6, 2021 at 5:23 PM Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:
>
> Porters,
>
> I recently posted the suggestion that "use v5.36.0" should imply "use utf8", which led to a pretty large thread in which Felipe Gasper repeatedly said "This is going to make things worse, not better." I spent a lot of time grumbling about this to myself, figuring out exactly how to rebut this, and then deciding that I tentatively, partly, agreed with him.
>
> We want each improvement to be a ratcheting up in language usability, when possible, rather than "we made things worse so we could make them better." At present, because we don't (and can't) know whether a string is text or bytes, we don't (and can't) automatically encode it when it hits a bytestream. We also don't know reliably whether a given output handle is already expecting to do that encoding for us.
>
> I am 100% certain that adding "use utf8" to the feature bundle would be better for me, but I already have a pretty strong grasp of the I/O model of Perl. I'm not sure it's better enough for everybody.
>
> At the PSC, we had a long talk about this, and another proposal was made:
>
> We introduce a new stricture, which I'll call "source_encoding". Under "use strict 'source_encoding'", the compiler will raise an exception when the source contains non-ASCII content unless the utf8 pragma is in effect. The error raised can drive the programmer to documentation explaining the various trade-offs. That is: you can turn on utf8 and deal with how this affects your I/O, or you can disable the stricture, or you can restate your non-ASCII content as ASCII by using escaping constructs.
>
> I'm not sure this is an improvement, but I think it is. This prevents the "I forgot to add utf8 and so only discovered after runtime that I have doubly-encoded my output" bug.
>
> --
> rjbs

After thinking about this again, I had another idea.

The reason implying 'use utf8' is a problem is because of the impact
it has on string semantics. Maybe we can just have it not impact
string semantics. Make 'use v5.36.0;' decode the source as UTF-8, but
store string literals as byte strings rather than characters. The
strings would still be required to be UTF-8 encoded, but would be
stored with the utf8 flag off. This would allow using UTF-8 encoded
content in comments, Pod, or even in function names, but would not
create the confusion with strings and IO.

This seems possibly hard to document, which may indicate that it is a
terrible idea.

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

Aug 16, 2021, 5:05 AM

Post #36 of 50 (1943 views)

On Fri, Aug 6, 2021 at 8:23 PM Paul "LeoNerd" Evans
<leonerd@leonerd.org.uk> wrote:
> It's far easier for everyone - implementation and users alike - to give
> a very simple rule:
>
> After `use VERSION>=5.36` and until any `use utf8` there must be no
> non-ASCII bytes whatsoever.

This would need the addition of "unless they are after an __END__
marker". I think it will inevitably make the rules for UTF-8 Pod
confusing.

>
> Yes this does lead to an annoying dual declaration of both `use utf8`
> and `=encoding UTF-8` - perhaps that can be helped in some way?

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

felipe at felipegasper

Aug 16, 2021, 5:51 AM

Post #37 of 50 (1943 views)

> On Aug 16, 2021, at 8:00 AM, Graham Knop <haarg@haarg.org> wrote:
>
> After thinking about this again, I had another idea.
>
> The reason implying 'use utf8' is a problem is because of the impact
> it has on string semantics. Maybe we can just have it not impact
> string semantics. Make 'use v5.36.0;' decode the source as UTF-8, but
> store string literals as byte strings rather than characters. The
> strings would still be required to be UTF-8 encoded, but would be
> stored with the utf8 flag off. This would allow using UTF-8 encoded
> content in comments, Pod, or even in function names, but would not
> create the confusion with strings and IO.

I thought of this sometime back, but more in the context of adding flexibility to utf8.pm:

{
use utf8 decode => 'no_strings'; # What Graham envisions
my $foo = "é"; # 2 code points
}

{
use utf8 decode => 'all'; # status quo
my $foo = "é"; # 1 code point
}

I personally would think decode=no_strings could be added to the feature bundle with little trouble. The use case for leaving strings undecoded doesn’t seem to apply for things besides strings.

-F

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

Aug 16, 2021, 6:27 AM

Post #38 of 50 (1943 views)

On Mon, 16 Aug 2021 08:51:30 -0400, Felipe Gasper
<felipe@felipegasper.com> wrote:

> > On Aug 16, 2021, at 8:00 AM, Graham Knop <haarg@haarg.org> wrote:
> >
> > After thinking about this again, I had another idea.
> >
> > The reason implying 'use utf8' is a problem is because of the impact
> > it has on string semantics. Maybe we can just have it not impact
> > string semantics. Make 'use v5.36.0;' decode the source as UTF-8,
> > but store string literals as byte strings rather than characters.
> > The strings would still be required to be UTF-8 encoded, but would
> > be stored with the utf8 flag off. This would allow using UTF-8
> > encoded content in comments, Pod, or even in function names, but
> > would not create the confusion with strings and IO.
>
> I thought of this sometime back, but more in the context of adding
> flexibility to utf8.pm:
>
> {
> use utf8 decode => 'no_strings'; # What Graham envisions
> my $foo = "é"; # 2 code points
> }
>
> {
> use utf8 decode => 'all'; # status quo
> my $foo = "é"; # 1 code point
> }
>
> I personally would think decode=no_strings could be added to the
> feature bundle with little trouble. The use case for leaving strings
> undecoded doesn’t seem to apply for things besides strings.

In that vein, to ease porting from older ISO encoded source files

{ use utf8 decode => 'no_strings'; # What Graham envisions
my $foo = "é"; # 2 code points
}
{ use utf8 decode => 'all'; # status quo
my $foo = "é"; # 1 code point
}
{ use utf8 convert => "utf-8"; # or convert ISO => "UTF-8"
my $foo = "é"; # This ISO-8859-1 é will be upgraded to UTF-8
} # 1 codepoint

If well-documented and completely lexical, the path forward is
extremely easy and fast, and it will trigger coders to make their code
more 2021+. Note that a lot of software was written in times where
editors did not have a clue about mutlibyte encodings and (windows)
people still used Alt-234 and the-like to enter diacriticals.

> -F

--
H.Merijn Brand https://tux.nl Perl Monger http://amsterdam.pm.org/
using perl5.00307 .. 5.33 porting perl5 on HP-UX, AIX, and Linux
https://tux.nl/email.html http://qa.perl.org https://www.test-smoke.org

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

kimoto.yuki at gmail

Aug 16, 2021, 9:19 PM

Post #39 of 50 (1943 views)

2021-8-7 0:23 Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:

>
> I'm not *sure* this is an improvement, but I think it is. This prevents
> the "I forgot to add utf8 and so only discovered after runtime that I have
> doubly-encoded my output" bug.
>
>
Is it okay that our consensus is to write Perl source code using UTF-8 in
the future?

This first step prevent unpredictable latin-1 bugs

- "use feature 'source_encoding';" must be only ASCII (ASCII is a small set
of UTF-8).
- "use feature 'source_encoding'; use utf8;" must be UTF-8.

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

david at cantrell

Aug 17, 2021, 2:28 AM

Post #40 of 50 (1943 views)

On Thu, Aug 12, 2021 at 11:08:48AM -0400, Dan Book wrote:
> On Thu, Aug 12, 2021 at 11:04 AM David Cantrell <david@cantrell.org.uk>
> wrote:
> > On Fri, Aug 06, 2021 at 11:45:21PM -0700, Darren Duncan wrote:
> > > Question: Is there ever a real life scenario where a single
> > > source file is not entirely the same encoding?
> > Sure. Some code, in utf8, and then a binary blob in __DATA__ which the
> > utf8 code reads and parses.
> >
> > It's not good practice, but it's what you had to do to easily distribute
> > data in a CPAN distribution before File::ShareDir::Install existed.
> Note this will currently break because the filehandle *is* shared between
> code and DATA, unlike with POD. "use utf8" applies to both. But I would
> consider the use case of non-textual data in DATA exceedingly rare.

I stopped doing it a few years ago. I think I've seen it in test suites
for some image-processing modules in the past.

--
David Cantrell | Godless Liberal Elitist

You can't spell AWESOME without ME!

RE: "use v5.36.0" should imply ASCII source [ In reply to ]

Vadim.Konovalov at dell

Aug 17, 2021, 3:13 AM

Post #41 of 50 (1943 views)

From: David Cantrell

> On Thu, Aug 12, 2021 at 11:08:48AM -0400, Dan Book wrote:
> > On Thu, Aug 12, 2021 at 11:04 AM David Cantrell wrote:
> > > On Fri, Aug 06, 2021 at 11:45:21PM -0700, Darren Duncan wrote:
> > > > Question: Is there ever a real life scenario where a single
> > > > source file is not entirely the same encoding?
> > > Sure. Some code, in utf8, and then a binary blob in __DATA__ which
> > > the
> > > utf8 code reads and parses.
> > >
> > > It's not good practice, but it's what you had to do to easily
> > > distribute data in a CPAN distribution before File::ShareDir::Install existed.
> > Note this will currently break because the filehandle *is* shared
> > between code and DATA, unlike with POD. "use utf8" applies to both.
> > But I would consider the use case of non-textual data in DATA exceedingly rare.
>
> I stopped doing it a few years ago. I think I've seen it in test suites for
> some image-processing modules in the past.

My use-case of binary __DATA__ was that I was uncompressing Compress::Zlib
data and used this in production in some 5.8.8 age, which was rather nice
to my taste.

not using this technique anymore, though.

Internal Use - Confidential

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

Aug 18, 2021, 3:39 AM

Post #42 of 50 (1942 views)

On Sat, Aug 07, 2021 at 02:01:14PM +0200, Martijn Lievaart wrote:
> Op 07-08-2021 om 08:51 schreef Darren Duncan:

> > If there is a "use utf8;" anywhere in a Perl file, would it not be
> > reasonable to interpret that it is describing the entire file and not
> > just the portion of the file below that statement?
>
>
> ... I don't feel that is reasonable. It's not how "use" works in general.
> "Use" affects the file being parsed from that point on. If you want it to
> affect the whole file, put it as the first statement.

It's not just "not reasonable" - it's not possible.

The perl parser can't restart.

Things like reading source code from a pipe (or terminal) could be worked
around (with sufficient buffering), but that's not the fatal problem here.

The problem is that all actions of the parser happen immediately, and
are committed to the symbol table as they are done. So for starters:

sub foo {
# I am good
}

sub bar {
1 2 3;
# I have a syntax error
}

will generate a definition for &foo before it even starts parsing bar,
and the failure to parse bar won't delete the definition of &foo

So if you wrap the above in an eval (or similar - a require or do in an
eval) and trap the error, you still get &foo.

BEGIN blocks give many many more ways for globally visible side effects to
happen immediately.

So there simply isn't a way to rewind and redo the parse of a file, because
the parsing of a file is not a transaction that ultimately commits or
rolls back - it's kind of AutoCommit, potentially at a
statement-by-statement granularity.

Nicholas Clark

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

perl.p5p at rjbs

Oct 3, 2021, 11:56 AM

Post #43 of 50 (1857 views)

On Mon, Aug 16, 2021, at 8:00 AM, Graham Knop wrote:
> On Fri, Aug 6, 2021 at 5:23 PM Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:>
>> At the PSC, we had a long talk about this, and another proposal was made:
>>
>> We introduce a new stricture, which I'll call "source_encoding". Under "use strict 'source_encoding'", the compiler will raise an exception when the source contains non-ASCII content unless the utf8 pragma is in effect. The error raised can drive the programmer to documentation explaining the various trade-offs. That is: you can turn on utf8 and deal with how this affects your I/O, or you can disable the stricture, or you can restate your non-ASCII content as ASCII by using escaping constructs.
>
> After thinking about this again, I had another idea.
>
> The reason implying 'use utf8' is a problem is because of the impact it has on string semantics. Maybe we can just have it not impact string semantics. Make 'use v5.36.0;' decode the source as UTF-8, but store string literals as byte strings rather than characters. The strings would still be required to be UTF-8 encoded, but would be stored with the utf8 flag off. This would allow using UTF-8 encoded content in comments, Pod, or even in function names, but would not create the confusion with strings and IO.

I said I'd write a reply to this and I didn't. *Mea culpa*.

I think there are two big questions, here:

*ONE:* What's the end state we'd like to get to?

*TWO:* What's a good next step, keeping in mind that we might not ever get past that next step?

My take is this: The end state I'd like is that strings are in one of three states: declared text, declared bytes, unknown. Semantics exist for how to combine these and deal with I/O discipline. The source code is Unicode and string literals are assumed to be text. A new string literal syntax exists for byte strings, like `qb"..."`.

For my money, a useful next step is that we encourage people to opt-in to "source code is unicode and string literals are text." This means that the programmer is then responsible for thinking about how this will affect their I/O. That concern is already there, we're just pushing around the complexity like a lump under the rug. I think this push is a good one. It lets us enable non-ASCII syntax, and it's pretty well understood. Also, we already have something for qb"...." in the form of "do { use bytes; qq{...} }" but we could probably add a qb, too, if we needed it.

--
rjbs

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

grinnz at gmail

Oct 3, 2021, 12:47 PM

Post #44 of 50 (1857 views)

On Sun, Oct 3, 2021 at 2:57 PM Ricardo Signes <perl.p5p@rjbs.manxome.org>
wrote:

> On Mon, Aug 16, 2021, at 8:00 AM, Graham Knop wrote:
>
> On Fri, Aug 6, 2021 at 5:23 PM Ricardo Signes <perl.p5p@rjbs.manxome.org>
> wrote:>
>
> At the PSC, we had a long talk about this, and another proposal was made:
>
> We introduce a new stricture, which I'll call "source_encoding". Under
> "use strict 'source_encoding'", the compiler will raise an exception when
> the source contains non-ASCII content unless the utf8 pragma is in effect.
> The error raised can drive the programmer to documentation explaining the
> various trade-offs. That is: you can turn on utf8 and deal with how this
> affects your I/O, or you can disable the stricture, or you can restate your
> non-ASCII content as ASCII by using escaping constructs.
>
>
> After thinking about this again, I had another idea.
>
> The reason implying 'use utf8' is a problem is because of the impact it
> has on string semantics. Maybe we can just have it not impact string
> semantics. Make 'use v5.36.0;' decode the source as UTF-8, but store string
> literals as byte strings rather than characters. The strings would still be
> required to be UTF-8 encoded, but would be stored with the utf8 flag off.
> This would allow using UTF-8 encoded content in comments, Pod, or even in
> function names, but would not create the confusion with strings and IO.
>
>
> I said I'd write a reply to this and I didn't. *Mea culpa*.
>
> I think there are two big questions, here:
>
> *ONE:* What's the end state we'd like to get to?
>
> *TWO:* What's a good next step, keeping in mind that we might not ever
> get past that next step?
>
> My take is this: The end state I'd like is that strings are in one of
> three states: declared text, declared bytes, unknown. Semantics exist for
> how to combine these and deal with I/O discipline. The source code is
> Unicode and string literals are assumed to be text. A new string literal
> syntax exists for byte strings, like qb"...".
>
> For my money, a useful next step is that we encourage people to opt-in to
> "source code is unicode and string literals are text." This means that the
> programmer is then responsible for thinking about how this will affect
> their I/O. That concern is already there, we're just pushing around the
> complexity like a lump under the rug. I think this push is a good one. It
> lets us enable non-ASCII syntax, and it's pretty well understood. Also, we
> already have something for qb"...." in the form of "do { use bytes; qq{...}
> }" but we could probably add a qb, too, if we needed it.
>

"use bytes" is an abstraction breakage, not an interface, so I would prefer
the qb alternative, unless and until "use bytes" did nothing other than
what "no utf8" currently does (but that could be an alternative for your
suggestion).

I agree very much with the end state proposed. I like the proposed next
step but I don't know how we get there. Even spreading understanding of the
current semantics is an uphill battle; too many people just don't
understand encoding, and that has to be baked into our approach. I think it
is possible, but not easy, to sufficiently document a new assumption for
whatever shape this feature may take. It's problematic that making
"assumption failures" reliably obvious when they occur is difficult to
impossible, ironically the sort of problem we are trying to fix here. I
don't have a conclusion here except that the most useful option won't
necessarily be the most expected (nor is the current state).

-Dan

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

kimoto.yuki at gmail

Oct 4, 2021, 1:45 AM

Post #45 of 50 (1857 views)

2021-10-4 3:57 Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:

>
> *ONE:* What's the end state we'd like to get to?
>
>
I have a question.

echo -e '?' | perl -p -E 's/\d/1/'

'?' of echo argument is Japanese UTF-8. Output is ASCII 1.

Current Output(UTF-8 ?)

?

Ideal Output(ASCII 1)

1

Do you want this to work ideally in the UNIX/Linux system?

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

felipe at felipegasper

Oct 4, 2021, 6:22 AM

Post #46 of 50 (1857 views)

> On Oct 4, 2021, at 4:45 AM, Yuki Kimoto <kimoto.yuki@gmail.com> wrote:
>
>
> 2021-10-4 3:57 Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:
>
> ONE: What's the end state we'd like to get to?
>
>
> I have a question.
>
> echo -e '?' | perl -p -E 's/\d/1/'
>
> '?' of echo argument is Japanese UTF-8. Output is ASCII 1.
>
> Current Output(UTF-8 ?)
>
> ?
>
> Ideal Output(ASCII 1)
>
> 1
>
> Do you want this to work ideally in the UNIX/Linux system?

For that to happen you would pass the `-CIO` flag to perl, which causes STDIN & STDOUT to automatically decode/encode UTF-8.

The one-liner as-is outputs "\xef\xbc\x91" (U+FF11 in UTF-8) instead of ASCII 1 because those 3 bytes are what Perl receives on STDIN, and nothing is decoding those to U+FF11. Your s/\d/1/ only works on *digits*, and none of U+00EF, U+00BC, or U+0091 is. So no change happens.

-FG

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

kimoto.yuki at gmail

Oct 5, 2021, 1:25 AM

Post #47 of 50 (1857 views)

2021-10-4 22:21 Felipe Gasper <felipe@felipegasper.com> wrote:

>
> > On Oct 4, 2021, at 4:45 AM, Yuki Kimoto <kimoto.yuki@gmail.com> wrote:
> >
> >
> > 2021-10-4 3:57 Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:
> >
> > ONE: What's the end state we'd like to get to?
> >
> >
> > I have a question.
> >
> > echo -e '?' | perl -p -E 's/\d/1/'
> >
> > '?' of echo argument is Japanese UTF-8. Output is ASCII 1.
> >
> > Current Output(UTF-8 ?)
> >
> > ?
> >
> > Ideal Output(ASCII 1)
> >
> > 1
> >
> > Do you want this to work ideally in the UNIX/Linux system?
>
> For that to happen you would pass the `-CIO` flag to perl, which causes
> STDIN & STDOUT to automatically decode/encode UTF-8.
>
> The one-liner as-is outputs "\xef\xbc\x91" (U+FF11 in UTF-8) instead of
> ASCII 1 because those 3 bytes are what Perl receives on STDIN, and nothing
> is decoding those to U+FF11. Your s/\d/1/ only works on *digits*, and none
> of U+00EF, U+00BC, or U+0091 is. So no change happens.
>
> -FG

I understand if I get the result, I can use the -CIO flag. I will try to
learn these flags for a while.

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

kimoto.yuki at gmail

Oct 6, 2021, 11:06 PM

Post #48 of 50 (1856 views)

Ric asks "ONE: What's the end state we'd like to get to?".

I'm thinking about the goal.

1. enable utf8 by default in use vx in the future;

2. In also one liner it can be used in the same way as a normal Perl
program using "use utf8", "Encode::decode" and "Encode::encode".

2 needed a little more description.

----------------------------------------------------
Source is UTF-8, and the input string is decoded form UTF-8(arguments(A),
stdin(I), input file stream(i)),

and the string is encode to UTF-8(stdout(O), stderr(E), output file
stream(o)).

In the one liner, I need to write the following way. SAD is same as -IOEAio

echo -e '???' | perl -Mutf8 -CSAD -p -e 's/\d\w?/1ai/'

Input

???

Output

1ai

The replacement is successful as expected.

I want to write this more easily, for example --utf8 option.

echo -e '???' | perl --utf8 -p -e 's/\d\w?/1ai/'
----------------------------------------------------------------------------------

I think this is independent of the topic of the string flag.

what do you think?

2021-10-5 17:25 Yuki Kimoto <kimoto.yuki@gmail.com> wrote:

>
>
> 2021-10-4 22:21 Felipe Gasper <felipe@felipegasper.com> wrote:
>
>>
>> > On Oct 4, 2021, at 4:45 AM, Yuki Kimoto <kimoto.yuki@gmail.com> wrote:
>> >
>> >
>> > 2021-10-4 3:57 Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:
>> >
>> > ONE: What's the end state we'd like to get to?
>> >
>> >
>> > I have a question.
>> >
>> > echo -e '?' | perl -p -E 's/\d/1/'
>> >
>> > '?' of echo argument is Japanese UTF-8. Output is ASCII 1.
>> >
>> > Current Output(UTF-8 ?)
>> >
>> > ?
>> >
>> > Ideal Output(ASCII 1)
>> >
>> > 1
>> >
>> > Do you want this to work ideally in the UNIX/Linux system?
>>
>> For that to happen you would pass the `-CIO` flag to perl, which causes
>> STDIN & STDOUT to automatically decode/encode UTF-8.
>>
>> The one-liner as-is outputs "\xef\xbc\x91" (U+FF11 in UTF-8) instead of
>> ASCII 1 because those 3 bytes are what Perl receives on STDIN, and nothing
>> is decoding those to U+FF11. Your s/\d/1/ only works on *digits*, and none
>> of U+00EF, U+00BC, or U+0091 is. So no change happens.
>>
>> -FG
>
>
> I understand if I get the result, I can use the -CIO flag. I will try to
> learn these flags for a while.
>
>

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

perl.p5p at rjbs

Nov 21, 2021, 1:00 PM

Post #49 of 50 (1595 views)

On Sun, Oct 3, 2021, at 2:56 PM, Ricardo Signes wrote:
> *ONE:* What's the end state we'd like to get to?
>
> *TWO:* What's a good next step, keeping in mind that we might not ever get past that next step?
>
> My take is this: The end state I'd like is that strings are in one of three states: declared text, declared bytes, unknown. Semantics exist for how to combine these and deal with I/O discipline. The source code is Unicode and string literals are assumed to be text. A new string literal syntax exists for byte strings, like `qb"..."`.
>
> For my money, a useful next step is that we encourage people to opt-in to "source code is unicode and string literals are text." This means that the programmer is then responsible for thinking about how this will affect their I/O. That concern is already there, we're just pushing around the complexity like a lump under the rug. I think this push is a good one. It lets us enable non-ASCII syntax, and it's pretty well understood. Also, we already have something for qb"...." in the form of "do { use bytes; qq{...} }" but we could probably add a qb, too, if we needed it.

I want to bump this thread, noting: I filed a draft RFC <https://github.com/Perl/RFCs/pull/5> on this, and think it's good to move forward. (I think we can separate the question of "what utf8 do you get with *use utf8*" to future consideration and to make that consistent. I don't think there's a practical argument to be made that we should keep its current weirdness.)

I do think that creating improvements for non-ASCII syntax is a compelling step we can take in the near future, but for now, I would like to still have source encoding as a pragma like this, which can be made ASCII by default under use vX.

--
rjbs

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

kimoto.yuki at gmail

Nov 21, 2021, 3:57 PM

Post #50 of 50 (1595 views)

2021-11-22 6:00 Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:

>
> My take is this: The end state I'd like is that strings are in one of
> three states: declared text, declared bytes, unknown. Semantics exist for
> how to combine these and deal with I/O discipline. The source code is
> Unicode and string literals are assumed to be text. A new string literal
> syntax exists for byte strings, like qb"...".
>
> I think the flag for text is needed instead of confused and
misused utf8::is_utf8.

if (is_text($text)) {
say Encode::encode('UTF-8', $text);
}

> For my money, a useful next step is that we encourage people to opt-in to
> "source code is unicode and string literals are text." This means that the
> programmer is then responsible for thinking about how this will affect
> their I/O. That concern is already there, we're just pushing around the
> complexity like a lump under the rug. I think this push is a good one. It
> lets us enable non-ASCII syntax, and it's pretty well understood. Also, we
> already have something for qb"...." in the form of "do { use bytes; qq{...}
> }" but we could probably add a qb, too, if we needed it.
>
>
I agree with this.

use v5.40;
# Text (a decoded string). Literal is interpreted as UTF-8
my $text = "abcde";

# Bytes if you need more performance by index access
my $bytes = qb"abcde";