Mailing List Archive: "use v5.36.0" should imply ASCII source

"use v5.36.0" should imply ASCII source

perl.p5p at rjbs

Aug 6, 2021, 8:22 AM

Post #1 of 50 (2145 views)

Porters,

I recently posted the suggestion <http://markmail.org/message/wywgcbwhu2nhykxc> that "use v5.36.0" should imply "use utf8", which led to a pretty large thread in which Felipe Gasper repeatedly said "This is going to make things worse, not better." I spent a lot of time grumbling about this to myself, figuring out exactly how to rebut this, and then deciding that I tentatively, partly, agreed with him.

We want each improvement to be a ratcheting up in language usability, when possible, rather than "we made things worse so we could make them better." At present, because we don't (and can't) know whether a string is text or bytes, we don't (and can't) automatically encode it when it hits a bytestream. We also don't know reliably whether a given output handle is already expecting to do that encoding for us.

I am 100% certain that adding "use utf8" to the feature bundle would be better *for me*, but I already have a pretty strong grasp of the I/O model of Perl. I'm not sure it's better enough for everybody.

At the PSC, we had a long talk about this, and another proposal was made:

We introduce a new stricture, which I'll call "source_encoding". Under "use strict 'source_encoding'", the compiler will raise an exception when the source contains non-ASCII content unless the utf8 pragma is in effect. The error raised can drive the programmer to documentation explaining the various trade-offs. That is: you can turn on utf8 and deal with how this affects your I/O, or you can disable the stricture, or you can restate your non-ASCII content as ASCII by using escaping constructs.

I'm not *sure* this is an improvement, but I think it is. This prevents the "I forgot to add utf8 and so only discovered after runtime that I have doubly-encoded my output" bug.

--
rjbs

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

grinnz at gmail

Aug 6, 2021, 8:45 AM

Post #2 of 50 (2145 views)

On Fri, Aug 6, 2021 at 11:23 AM Ricardo Signes <perl.p5p@rjbs.manxome.org>
wrote:

> Porters,
>
> I recently posted the suggestion
> <http://markmail.org/message/wywgcbwhu2nhykxc> that "use v5.36.0" should
> imply "use utf8", which led to a pretty large thread in which Felipe Gasper
> repeatedly said "This is going to make things worse, not better." I spent
> a lot of time grumbling about this to myself, figuring out exactly how to
> rebut this, and then deciding that I tentatively, partly, agreed with him.
>
> We want each improvement to be a ratcheting up in language usability, when
> possible, rather than "we made things worse so we could make them better."
> At present, because we don't (and can't) know whether a string is text or
> bytes, we don't (and can't) automatically encode it when it hits a
> bytestream. We also don't know reliably whether a given output handle is
> already expecting to do that encoding for us.
>
> I am 100% certain that adding "use utf8" to the feature bundle would be
> better *for me*, but I already have a pretty strong grasp of the I/O
> model of Perl. I'm not sure it's better enough for everybody.
>
> At the PSC, we had a long talk about this, and another proposal was made:
>
> We introduce a new stricture, which I'll call "source_encoding". Under
> "use strict 'source_encoding'", the compiler will raise an exception when
> the source contains non-ASCII content unless the utf8 pragma is in effect.
> The error raised can drive the programmer to documentation explaining the
> various trade-offs. That is: you can turn on utf8 and deal with how this
> affects your I/O, or you can disable the stricture, or you can restate your
> non-ASCII content as ASCII by using escaping constructs.
>
> I'm not *sure* this is an improvement, but I think it is. This prevents
> the "I forgot to add utf8 and so only discovered after runtime that I have
> doubly-encoded my output" bug.
>

FWIW, this is roughly what was suggested by Zefram as part of his proposal
for utf8-by-default, phrased as
"deprecate the presence of non-ASCII bytes anywhere in a source file other
than in the scope of "use utf8".".
https://www.nntp.perl.org/group/perl.perl5.porters/2017/10/msg246838.html

-Dan

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

felipe at felipegasper

Aug 6, 2021, 9:02 AM

Post #3 of 50 (2145 views)

> On Aug 6, 2021, at 11:22 AM, Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:
>
> Porters,
>
> I recently posted the suggestion that "use v5.36.0" should imply "use utf8", which led to a pretty large thread in which Felipe Gasper repeatedly said "This is going to make things worse, not better." I spent a lot of time grumbling about this to myself, figuring out exactly how to rebut this, and then deciding that I tentatively, partly, agreed with him.
>
> We want each improvement to be a ratcheting up in language usability, when possible, rather than "we made things worse so we could make them better." At present, because we don't (and can't) know whether a string is text or bytes, we don't (and can't) automatically encode it when it hits a bytestream. We also don't know reliably whether a given output handle is already expecting to do that encoding for us.
>
> I am 100% certain that adding "use utf8" to the feature bundle would be better for me, but I already have a pretty strong grasp of the I/O model of Perl. I'm not sure it's better enough for everybody.
>
> At the PSC, we had a long talk about this, and another proposal was made:
>
> We introduce a new stricture, which I'll call "source_encoding". Under "use strict 'source_encoding'", the compiler will raise an exception when the source contains non-ASCII content unless the utf8 pragma is in effect. The error raised can drive the programmer to documentation explaining the various trade-offs. That is: you can turn on utf8 and deal with how this affects your I/O, or you can disable the stricture, or you can restate your non-ASCII content as ASCII by using escaping constructs.
>
> I'm not sure this is an improvement, but I think it is. This prevents the "I forgot to add utf8 and so only discovered after runtime that I have doubly-encoded my output" bug.

This seems reasonable. It encourages decoding of UTF-8 characters while still allowing `print "hello world"` to be correct in modern Perl.

-FG

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

Aug 6, 2021, 9:12 AM

Post #4 of 50 (2145 views)

"Ricardo Signes" <perl.p5p@rjbs.manxome.org> wrote:
>At the PSC, we had a long talk about this, and another proposal was made:
>
>We introduce a new stricture, which I'll call "source_encoding". Under
>"use strict 'source_encoding'", the compiler will raise an exception when
>the source contains non-ASCII content unless the utf8 pragma is in effect.
>The error raised can drive the programmer to documentation explaining
>the various trade-offs. That is: you can turn on utf8 and deal with how
>this affects your I/O, or you can disable the stricture, or you can
>restate your non-ASCII content as ASCII by using escaping constructs.
>
>I'm not *sure* this is an improvement, but I think it is. This prevents
>the "I forgot to add utf8 and so only discovered after runtime that
>I have doubly-encoded my output" bug.

+1 - for me that's a big improvement over "turn utf8 on automatically":
the latter would have been a reason for me to avoid "use <version>", while
this would be a reason to use it.

Hugo

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

darren at darrenduncan

Aug 6, 2021, 10:34 AM

Post #5 of 50 (2145 views)

On 2021-08-06 8:22 a.m., Ricardo Signes wrote:
> At the PSC, we had a long talk about this, and another proposal was made:
>
> We introduce a new stricture, which I'll call "source_encoding". Under "use
> strict 'source_encoding'", the compiler will raise an exception when the source
> contains non-ASCII content unless the utf8 pragma is in effect. The error
> raised can drive the programmer to documentation explaining the various
> trade-offs. That is: you can turn on utf8 and deal with how this affects your
> I/O, or you can disable the stricture, or you can restate your non-ASCII content
> as ASCII by using escaping constructs.
>
> I'm not /sure/ this is an improvement, but I think it is. This prevents the "I
> forgot to add utf8 and so only discovered after runtime that I have
> doubly-encoded my output" bug.

+1

Personally I feel that this change is a great improvement, assuming I understand
it right.

So just to be clear, when you say ASCII, you mean pure 7-bit ASCII, which is a
proper subset of both UTF-8 and all the Latin encodings, and thus any source
files written in that will "just work" in both the most common Unicode AND
non-Unicode environments.

Would your new on as part of use 5.36 stricture then be failing every source
file that has any octet with a 1 in the 8th bit when that file doesn't also have
an explicit declaration of source encoding?

Because that is what I would expect given what you said.

For my part, I expressly designed my portable data format MUON
https://github.com/muldis/Muldis_Object_Notation/blob/master/spec/Muldis_Object_Notation_Syntax_Plain_Text.md
so that the non-7-bit-ASCII character repertoire is forbidden literally in a
file except within quoted character string literals, and so one can parse
everything outside the quoted strings, the actual document structure, completely
without even having to know what the encoding is (it can be done in binary
mode), at least between UTF-8 vs Latin etc (and even for encodings that aren't),
and decoding the inside of strings is deferrable.

-- Darren Duncan

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

Aug 6, 2021, 11:03 AM

Post #6 of 50 (2145 views)

I think this might be reasonable, but I'm not certain about how we
would want it to interact with Pod.

Pod is often interleaved with code, and more likely to include names,
and thus, UTF-8 characters. The way to declare Pod content as UTF-8 is
separate from how it is declared for the source. Should users be
required to include both a 'use utf8;' and '=encoding UTF-8' if they
want to include UTF-8 characters in their documentation, even if their
code is pure ASCII?

On Fri, Aug 6, 2021 at 5:23 PM Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:
>
> Porters,
>
> I recently posted the suggestion that "use v5.36.0" should imply "use utf8", which led to a pretty large thread in which Felipe Gasper repeatedly said "This is going to make things worse, not better." I spent a lot of time grumbling about this to myself, figuring out exactly how to rebut this, and then deciding that I tentatively, partly, agreed with him.
>
> We want each improvement to be a ratcheting up in language usability, when possible, rather than "we made things worse so we could make them better." At present, because we don't (and can't) know whether a string is text or bytes, we don't (and can't) automatically encode it when it hits a bytestream. We also don't know reliably whether a given output handle is already expecting to do that encoding for us.
>
> I am 100% certain that adding "use utf8" to the feature bundle would be better for me, but I already have a pretty strong grasp of the I/O model of Perl. I'm not sure it's better enough for everybody.
>
> At the PSC, we had a long talk about this, and another proposal was made:
>
> We introduce a new stricture, which I'll call "source_encoding". Under "use strict 'source_encoding'", the compiler will raise an exception when the source contains non-ASCII content unless the utf8 pragma is in effect. The error raised can drive the programmer to documentation explaining the various trade-offs. That is: you can turn on utf8 and deal with how this affects your I/O, or you can disable the stricture, or you can restate your non-ASCII content as ASCII by using escaping constructs.
>
> I'm not sure this is an improvement, but I think it is. This prevents the "I forgot to add utf8 and so only discovered after runtime that I have doubly-encoded my output" bug.
>
> --
> rjbs

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

grinnz at gmail

Aug 6, 2021, 11:15 AM

Post #7 of 50 (2145 views)

On Fri, Aug 6, 2021 at 2:04 PM Graham Knop <haarg@haarg.org> wrote:

> I think this might be reasonable, but I'm not certain about how we
> would want it to interact with Pod.
>
> Pod is often interleaved with code, and more likely to include names,
> and thus, UTF-8 characters. The way to declare Pod content as UTF-8 is
> separate from how it is declared for the source. Should users be
> required to include both a 'use utf8;' and '=encoding UTF-8' if they
> want to include UTF-8 characters in their documentation, even if their
> code is pure ASCII?
>

It seems reasonable to me for this restriction to ignore POD, and only
apply to things "use utf8" applies to (so including __DATA__/__END__).
Though I'm not sure how that interacts internally.

-Dan

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

leonerd at leonerd

Aug 6, 2021, 11:23 AM

Post #8 of 50 (2145 views)

On Fri, 6 Aug 2021 20:03:48 +0200
Graham Knop <haarg@haarg.org> wrote:

> Pod is often interleaved with code, and more likely to include names,
> and thus, UTF-8 characters. The way to declare Pod content as UTF-8 is
> separate from how it is declared for the source. Should users be
> required to include both a 'use utf8;' and '=encoding UTF-8' if they
> want to include UTF-8 characters in their documentation, even if their
> code is pure ASCII?

We discussed this very question. The trouble is it's a quickly-slippery
slope. If you allow non-ASCII in POD, do you allow it in comments? The
same justification - it's common to write people's (non-English) names
in comments just as well as POD. Should it be allowed there?

This quickly leads to another weird entry in a future "Perl Quirks"
document 10 years down the line, where users complain that the rules of
non-ASCII are hard to guess and subtle and anyway PPR doesn't do it
right and also there are bugs in the parser and ...

It's far easier for everyone - implementation and users alike - to give
a very simple rule:

After `use VERSION>=5.36` and until any `use utf8` there must be no
non-ASCII bytes whatsoever.

Yes this does lead to an annoying dual declaration of both `use utf8`
and `=encoding UTF-8` - perhaps that can be helped in some way?

--
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk | https://metacpan.org/author/PEVANS
http://www.leonerd.org.uk/ | https://www.tindie.com/stores/leonerd/

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

grinnz at gmail

Aug 6, 2021, 11:58 AM

Post #9 of 50 (2145 views)

On Fri, Aug 6, 2021 at 2:23 PM Paul "LeoNerd" Evans <leonerd@leonerd.org.uk>
wrote:

> On Fri, 6 Aug 2021 20:03:48 +0200
> Graham Knop <haarg@haarg.org> wrote:
>
> > Pod is often interleaved with code, and more likely to include names,
> > and thus, UTF-8 characters. The way to declare Pod content as UTF-8 is
> > separate from how it is declared for the source. Should users be
> > required to include both a 'use utf8;' and '=encoding UTF-8' if they
> > want to include UTF-8 characters in their documentation, even if their
> > code is pure ASCII?
>
> We discussed this very question. The trouble is it's a quickly-slippery
> slope. If you allow non-ASCII in POD, do you allow it in comments? The
> same justification - it's common to write people's (non-English) names
> in comments just as well as POD. Should it be allowed there?
>
> This quickly leads to another weird entry in a future "Perl Quirks"
> document 10 years down the line, where users complain that the rules of
> non-ASCII are hard to guess and subtle and anyway PPR doesn't do it
> right and also there are bugs in the parser and ...
>
> It's far easier for everyone - implementation and users alike - to give
> a very simple rule:
>
> After `use VERSION>=5.36` and until any `use utf8` there must be no
> non-ASCII bytes whatsoever.
>
> Yes this does lead to an annoying dual declaration of both `use utf8`
> and `=encoding UTF-8` - perhaps that can be helped in some way?
>

I don't think it's comparable. Comments are parsed by the perl interpreter,
but POD is not, except to find the end of the POD.

-Dan

RE: "use v5.36.0" should imply ASCII source [ In reply to ]

wolf-dietrich_moeller at t-online

Aug 6, 2021, 12:52 PM

Post #10 of 50 (2145 views)

Ricardo Signes wrote:

> We introduce a new stricture, which I'll call "source_encoding". Under
"use strict 'source_encoding'", the compiler will raise an exception when
the source contains non-ASCII content unless the utf8 pragma is in effect.
The error raised can drive the programmer to documentation explaining the
various trade-offs. That is: you can turn on utf8 and deal with how this
affects your I/O, or you can disable the stricture, or you can restate your
non-ASCII content as ASCII by using escaping constructs.

Question: If I understand correctly, this would be turned on just by "use
strict", even without "use v5.36". As many existing programs contain "use
strict", does this mean that all existing sources using Latin1 would fail?
I can imagine, that many programs use "use strict" (as its use is strongly
recommended, seeing the other discussion about turning it on with "use
v5.36").

Regards Wolf

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

perl.p5p at rjbs

Aug 6, 2021, 3:05 PM

Post #11 of 50 (2145 views)

On Fri, Aug 6, 2021, at 3:52 PM, Wolf-Dietrich Moeller (Munchen) wrote:
> Question: If I understand correctly, this would be turned on just by "use
> strict", even without "use v5.36".

No, that would not be the case.

--
rjbs

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

perigrin at prather

Aug 6, 2021, 5:01 PM

Post #12 of 50 (2145 views)

> On Aug 6, 2021, at 2:58 PM, Dan Book <grinnz@gmail.com> wrote:
>
> I don't think it's comparable. Comments are parsed by the perl interpreter, but POD is not, except to find the end of the POD.
>

Additionally comments don’t as far as I know have a way to declare their encoding.

-Chris

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

darren at darrenduncan

Aug 6, 2021, 11:45 PM

Post #13 of 50 (2145 views)

On 2021-08-06 5:01 p.m., Chris Prather wrote:
>> On Aug 6, 2021, at 2:58 PM, Dan Book <grinnz@gmail.com> wrote:
>>
>> I don't think it's comparable. Comments are parsed by the perl interpreter, but POD is not, except to find the end of the POD.
>
> Additionally comments don’t as far as I know have a way to declare their encoding.

Question: Is there ever a real life scenario where a single source file is not
entirely the same encoding? Can you reasonably have a single file containing
Perl code and POD where the Perl code is one character encoding and the POD is
another one? -- Darren Duncan

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

darren at darrenduncan

Aug 6, 2021, 11:51 PM

Post #14 of 50 (2145 views)

On 2021-08-06 11:45 p.m., Darren Duncan wrote:
> On 2021-08-06 5:01 p.m., Chris Prather wrote:
>>> On Aug 6, 2021, at 2:58 PM, Dan Book <grinnz@gmail.com> wrote:
>>>
>>> I don't think it's comparable. Comments are parsed by the perl interpreter,
>>> but POD is not, except to find the end of the POD.
>>
>> Additionally comments don’t as far as I know have a way to declare their
>> encoding.
>
> Question: Is there ever a real life scenario where a single source file is not
> entirely the same encoding? Can you reasonably have a single file containing
> Perl code and POD where the Perl code is one character encoding and the POD is
> another one? -- Darren Duncan

Or even putting aside the POD, is there any real scenario where a single file
consisting of only Perl code is in multiple encodings at once?

If there is a "use utf8;" anywhere in a Perl file, would it not be reasonable to
interpret that it is describing the entire file and not just the portion of the
file below that statement?

Perhaps a reasonable design would be that if a file contains a UTF-8 declaration
anywhere in it, the entire file is treated as such, both the part above and the
part below that declaration. And if multiple conflicting encoding declarations
exist for the current file, that is an error.

Basically the encoding declaration can be something that is scanned for in
advance of the regular parsing, or if done inline, encountering such would cause
the parser to restart at the beginning if the declaration is different than what
the parser was doing up to that point, effectively the whole file is affected by
that declaration either way.

-- Darren Duncan

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

Aug 7, 2021, 4:46 AM

Post #15 of 50 (2145 views)

On Fri, Aug 06, 2021 at 11:51:01PM -0700, Darren Duncan wrote:
> Perhaps a reasonable design would be that if a file contains a UTF-8
> declaration anywhere in it, the entire file is treated as such, both the
> part above and the part below that declaration. And if multiple conflicting
> encoding declarations exist for the current file, that is an error.

How could that possibly work? If the 'use utf8' appears halfway through
the source file, does that retrospectively invalidate everything parsed so
far?

--
Monto Blanco... scorchio!

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

Aug 7, 2021, 4:53 AM

Post #16 of 50 (2145 views)

On Fri, Aug 06, 2021 at 11:45:21PM -0700, Darren Duncan wrote:
> Question: Is there ever a real life scenario where a single source file is
> not entirely the same encoding? Can you reasonably have a single file
> containing Perl code and POD where the Perl code is one character encoding
> and the POD is another one? -- Darren Duncan

Mixed utf8/non-utf8 my be rare, but even fully utf8 source files will
still require 2 separate declarations:

a 'use utf8' for the perl parser, and a '=encoding utf8' for a pod parser,
which won't see or understand the 'use utf8'.

More generally, I think croaking on non-ascii in the src file is a fine
use for 'use v5.36'. I'm less clear whether it croak on pod too. Perhaps it
should croak on ord() > 0x7f in a pod section unless an '=encoding' has
been seen?

--
Little fly, thy summer's play my thoughtless hand
has terminated with extreme prejudice.
(with apologies to William Blake)

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

Aug 7, 2021, 5:01 AM

Post #17 of 50 (2145 views)

Op 07-08-2021 om 08:51 schreef Darren Duncan:
> Or even putting aside the POD, is there any real scenario where a
> single file consisting of only Perl code is in multiple encodings at
> once?

Probably not, but ...

>
> If there is a "use utf8;" anywhere in a Perl file, would it not be
> reasonable to interpret that it is describing the entire file and not
> just the portion of the file below that statement?

... I don't feel that is reasonable. It's not how "use" works in
general. "Use" affects the file being parsed from that point on. If you
want it to affect the whole file, put it as the first statement.

But I don't get what you want to achieve, what problem you want to solve
with this solution?

HTH,

M4

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

darren at darrenduncan

Aug 7, 2021, 5:07 AM

Post #18 of 50 (2145 views)

On 2021-08-07 4:46 a.m., Dave Mitchell wrote:
> On Fri, Aug 06, 2021 at 11:51:01PM -0700, Darren Duncan wrote:
>> Perhaps a reasonable design would be that if a file contains a UTF-8
>> declaration anywhere in it, the entire file is treated as such, both the
>> part above and the part below that declaration. And if multiple conflicting
>> encoding declarations exist for the current file, that is an error.
>
> How could that possibly work? If the 'use utf8' appears halfway through
> the source file, does that retrospectively invalidate everything parsed so
> far?

It would if the parser was so far treating everything parsed so far as something
other than UTF-8. As said in my post, the parser would restart at the beginning
of the file and treat it as UTF-8. But this would only need to happen if the
parser kept track of whether it saw any high bits so far, and if it didn't, it
knows it only saw ASCII which is also valid UTF-8 and it can skip the restart.
-- Darren Duncan

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

darren at darrenduncan

Aug 7, 2021, 5:15 AM

Post #19 of 50 (2145 views)

On 2021-08-07 5:01 a.m., Martijn Lievaart wrote:
> Op 07-08-2021 om 08:51 schreef Darren Duncan:
>> Or even putting aside the POD, is there any real scenario where a single file
>> consisting of only Perl code is in multiple encodings at once?
>
> Probably not, but ...
>
>> If there is a "use utf8;" anywhere in a Perl file, would it not be reasonable
>> to interpret that it is describing the entire file and not just the portion of
>> the file below that statement?
>
> ... I don't feel that is reasonable. It's not how "use" works in general. "Use"
> affects the file being parsed from that point on. If you want it to affect the
> whole file, put it as the first statement.
>
> But I don't get what you want to achieve, what problem you want to solve with
> this solution?

I propose that if we don't want to explicitly support mixed encodings then
explicit encoding declarations are a special case (that can be clearly
documented) where their effect should be retroactive to describe the whole file,
because logically that's the only thing that makes sense (for a non mixed
encoding file, declaring any part of it as UTF-8 is logically saying the whole
file is UTF-8), even if it does happen to have the form of a "use" statement. --
Darren Duncan

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

grinnz at gmail

Aug 7, 2021, 8:55 AM

Post #20 of 50 (2145 views)

On Sat, Aug 7, 2021 at 2:45 AM Darren Duncan <darren@darrenduncan.net>
wrote:

> On 2021-08-06 5:01 p.m., Chris Prather wrote:
> >> On Aug 6, 2021, at 2:58 PM, Dan Book <grinnz@gmail.com> wrote:
> >>
> >> I don't think it's comparable. Comments are parsed by the perl
> interpreter, but POD is not, except to find the end of the POD.
> >
> > Additionally comments don’t as far as I know have a way to declare their
> encoding.
>
> Question: Is there ever a real life scenario where a single source file
> is not
> entirely the same encoding? Can you reasonably have a single file
> containing
> Perl code and POD where the Perl code is one character encoding and the
> POD is
> another one? -- Darren Duncan
>

Yes. POD parsers and the perl interpreter do not read the same parts of the
file, ever. Thus they must each indicate to their corresponding parsers
what encoding they contain.

-Dan

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

public at khwilliamson

Aug 7, 2021, 10:12 AM

Post #21 of 50 (2145 views)

On 8/7/21 5:53 AM, Dave Mitchell wrote:
> On Fri, Aug 06, 2021 at 11:45:21PM -0700, Darren Duncan wrote:
>> Question: Is there ever a real life scenario where a single source file is
>> not entirely the same encoding? Can you reasonably have a single file
>> containing Perl code and POD where the Perl code is one character encoding
>> and the POD is another one? -- Darren Duncan
>
> Mixed utf8/non-utf8 my be rare, but even fully utf8 source files will
> still require 2 separate declarations:
>
> a 'use utf8' for the perl parser, and a '=encoding utf8' for a pod parser,
> which won't see or understand the 'use utf8'.
>
> More generally, I think croaking on non-ascii in the src file is a fine
> use for 'use v5.36'. I'm less clear whether it croak on pod too. Perhaps it
> should croak on ord() > 0x7f in a pod section unless an '=encoding' has
> been seen?
>
>
>

Pod is currently assumed to be CP1252 or UTF-8 unless an =encoding line
is specified

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

public at khwilliamson

Aug 7, 2021, 10:44 AM

Post #22 of 50 (2145 views)

On 8/7/21 11:12 AM, Karl Williamson wrote:
> On 8/7/21 5:53 AM, Dave Mitchell wrote:
>> On Fri, Aug 06, 2021 at 11:45:21PM -0700, Darren Duncan wrote:
>>> Question: Is there ever a real life scenario where a single source
>>> file is
>>> not entirely the same encoding? Can you reasonably have a single file
>>> containing Perl code and POD where the Perl code is one character
>>> encoding
>>> and the POD is another one? -- Darren Duncan
>>
>> Mixed utf8/non-utf8 my be rare, but even fully utf8 source files will
>> still require 2 separate declarations:
>>
>> a 'use utf8' for the perl parser, and a '=encoding utf8' for a pod
>> parser,
>> which won't see or understand the 'use utf8'.
>>
>> More generally, I think croaking on non-ascii in the src file is a fine
>> use for 'use v5.36'. I'm less clear whether it croak on pod too.
>> Perhaps it
>> should croak on ord() > 0x7f in a pod section unless an '=encoding' has
>> been seen?
>>
>>
>>
>
> Pod is currently assumed to be CP1252 or UTF-8 unless an =encoding line
> is specified
>

More specifically. Pod::Simple doesn't require you to use an =encoding
line. If a non-ASCII character is found, it will automatically insert
either a cp1252 or utf8 line for you.

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

darren at darrenduncan

Aug 7, 2021, 2:43 PM

Post #23 of 50 (2145 views)

On 2021-08-07 8:55 a.m., Dan Book wrote:
> On Sat, Aug 7, 2021 at 2:45 AM Darren Duncan wrote:
> Question: Is there ever a real life scenario where a single source file is not
> entirely the same encoding? Can you reasonably have a single file containing
> Perl code and POD where the Perl code is one character encoding and the POD is
> another one? -- Darren Duncan
>
> Yes. POD parsers and the perl interpreter do not read the same parts of the
> file, ever. Thus they must each indicate to their corresponding parsers what
> encoding they contain.

The need to dual-declare is important to know but its not the question I asked.

Is there ever a real life scenario where a single physical text file uses one
character encoding for one range of octets in the file and a different character
encoding for a different range of octets in the file.

I'm not aware of any text editor that when asked to save a document to disk
would not be using the same character encoding to write out the entire file.

So I would think it would be a very contrived situation for a text file to exist
physically that isn't all one encoding.

And then the question is whether we would consider it valid for such a file to
exist and that we would try to support it as a non-corrupted file.

-- Darren Duncan

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

grinnz at gmail

Aug 7, 2021, 2:52 PM

Post #24 of 50 (2145 views)

On Sat, Aug 7, 2021 at 5:44 PM Darren Duncan <darren@darrenduncan.net>
wrote:

> On 2021-08-07 8:55 a.m., Dan Book wrote:
> > On Sat, Aug 7, 2021 at 2:45 AM Darren Duncan wrote:
> > Question: Is there ever a real life scenario where a single source
> file is not
> > entirely the same encoding? Can you reasonably have a single file
> containing
> > Perl code and POD where the Perl code is one character encoding and
> the POD is
> > another one? -- Darren Duncan
> >
> > Yes. POD parsers and the perl interpreter do not read the same parts of
> the
> > file, ever. Thus they must each indicate to their corresponding parsers
> what
> > encoding they contain.
>
> The need to dual-declare is important to know but its not the question I
> asked.
>
> Is there ever a real life scenario where a single physical text file uses
> one
> character encoding for one range of octets in the file and a different
> character
> encoding for a different range of octets in the file.
>
> I'm not aware of any text editor that when asked to save a document to
> disk
> would not be using the same character encoding to write out the entire
> file.
>
> So I would think it would be a very contrived situation for a text file to
> exist
> physically that isn't all one encoding.
>
> And then the question is whether we would consider it valid for such a
> file to
> exist and that we would try to support it as a non-corrupted file.
>

It's not a matter of support. It just doesn't matter to the perl
interpreter what the encoding of POD is, and vice versa.

-Dan

Re: "use v5.36.0" should imply ASCII source [ In reply to ]

darren at darrenduncan

Aug 7, 2021, 3:36 PM

Post #25 of 50 (2145 views)

On 2021-08-07 2:52 p.m., Dan Book wrote:
> It's not a matter of support. It just doesn't matter to the perl interpreter
> what the encoding of POD is, and vice versa.

That's fair and reasonable. But my question is broader than Perl code vs POD.

Is it reasonable to support either of these next 2 scenarios?

1. A file contains only Perl code and no POD, and one subset of that file has a
different character encoding than a different subset.

2. A file contains only POD and no Perl code, and one subset of that file has a
different character encoding than a different subset.

-- Darren Duncan