Mailing List Archive

1 2  View All
Re: "use v5.36.0" should imply ASCII source [ In reply to ]
On Sat, Aug 7, 2021 at 6:38 PM Darren Duncan <darren@darrenduncan.net>
wrote:

> On 2021-08-07 2:52 p.m., Dan Book wrote:
> > It's not a matter of support. It just doesn't matter to the perl
> interpreter
> > what the encoding of POD is, and vice versa.
>
> That's fair and reasonable. But my question is broader than Perl code vs
> POD.
>
> Is it reasonable to support either of these next 2 scenarios?
>
> 1. A file contains only Perl code and no POD, and one subset of that file
> has a
> different character encoding than a different subset.
>
> 2. A file contains only POD and no Perl code, and one subset of that file
> has a
> different character encoding than a different subset.
>

I might suggest starting a new thread if you want to discuss these
possibilities and their implications, they have no bearing on the
implementation of the proposed feature. FWIW, I do not think the second
scenario is currently supported by any POD parser.

-Dan
Re: "use v5.36.0" should imply ASCII source [ In reply to ]
> At the PSC, we had a long talk about this, and another proposal was made:
>
> We introduce a new stricture, which I'll call "source_encoding". Under "use strict 'source_encoding'", the compiler will raise an exception when the source contains non-ASCII content unless the utf8 pragma is in effect. The error raised can drive the programmer to documentation explaining the various trade-offs. That is: you can turn on utf8 and deal with how this affects your I/O, or you can disable the stricture, or you can restate your non-ASCII content as ASCII by using escaping constructs.
>

This somehow feels like a step backwards.

Nearly every modern Linux installation uses a unicode locale by default nowadays, I haven't come across a text file in latin1 (or similar) encoding for months...

--
Andreas K. H?ttel
dilfridge@gentoo.org
Gentoo Linux developer
(council, toolchain, base-system, perl, libreoffice)
Re: "use v5.36.0" should imply ASCII source [ In reply to ]
> On Aug 8, 2021, at 7:15 AM, Andreas K. Huettel <dilfridge@gentoo.org> wrote:
>
>> At the PSC, we had a long talk about this, and another proposal was made:
>>
>> We introduce a new stricture, which I'll call "source_encoding". Under "use strict 'source_encoding'", the compiler will raise an exception when the source contains non-ASCII content unless the utf8 pragma is in effect. The error raised can drive the programmer to documentation explaining the various trade-offs. That is: you can turn on utf8 and deal with how this affects your I/O, or you can disable the stricture, or you can restate your non-ASCII content as ASCII by using escaping constructs.
>>
>
> This somehow feels like a step backwards.
>
> Nearly every modern Linux installation uses a unicode locale by default nowadays, I haven't come across a text file in latin1 (or similar) encoding for months...

Nearly every modern programming language also differentiates between text and binary. Alas, Perl doesn’t do this.

The language’s maintainers feel--reasonably, I think--that text in source code should be decoded. The fact that “é” in UTF-8 Perl source code is two characters (i.e., code points) by default is weird and counterintuitive. The problem is that utf8.pm’s auto-decoding behaviour imposes a requirement to encode manually, which is *really* weird/counterintuitive: it would “subtly invalidate” a simple “hello, world” implementation in “modern” Perl, which invalidity would only “bite” when there are >127 code points involved, which is, again, further weird/counterintuitive.

So, it’s a mess. The best fix here would be to teach Perl to track which strings are decoded and which aren’t. Perl would gain copiously therefrom, but it’s not easy to do. For now it’s at least reasonable to require, in “modern” Perl, that either:

a) Source code remain all-ASCII.

or

b) Perl’s auto-decoding mode be enabled (explicitly).

This will require that folks like myself, who desire “modernity” but for whom Perl’s status quo is actually useful and desirable (because $work almost never cares about strings’ Unicode content), find some workaround, but at least it’s a conspicuous change that won’t “surprise” anyone.

-FG
Re: "use v5.36.0" should imply ASCII source [ In reply to ]
Op 07-08-2021 om 14:15 schreef Darren Duncan:
> On 2021-08-07 5:01 a.m., Martijn Lievaart wrote:
>> Op 07-08-2021 om 08:51 schreef Darren Duncan:
>>> Or even putting aside the POD, is there any real scenario where a
>>> single file consisting of only Perl code is in multiple encodings at
>>> once?
>>
>> Probably not, but ...
>>
>>> If there is a "use utf8;" anywhere in a Perl file, would it not be
>>> reasonable to interpret that it is describing the entire file and
>>> not just the portion of the file below that statement?
>>
>> ... I don't feel that is reasonable. It's not how "use" works in
>> general. "Use" affects the file being parsed from that point on. If
>> you want it to affect the whole file, put it as the first statement.
>>
>> But I don't get what you want to achieve, what problem you want to
>> solve with this solution?
>
> I propose that if we don't want to explicitly support mixed encodings
> then explicit encoding declarations are a special case (that can be
> clearly documented) where their effect should be retroactive to
> describe the whole file, because logically that's the only thing that
> makes sense (for a non mixed encoding file, declaring any part of it
> as UTF-8 is logically saying the whole file is UTF-8), even if it does
> happen to have the form of a "use" statement. -- Darren Duncan


Well, there actually is one mixed encoding that makes sense, ASCII up
until the 'use utf8', Unicode after that. I would assume this is the
mental model most people have.


M4
Re: "use v5.36.0" should imply ASCII source [ In reply to ]
On 2021-08-08 5:52 a.m., Martijn Lievaart wrote:
> Op 07-08-2021 om 14:15 schreef Darren Duncan:
>> On 2021-08-07 5:01 a.m., Martijn Lievaart wrote:
>>> Op 07-08-2021 om 08:51 schreef Darren Duncan:
>>>> Or even putting aside the POD, is there any real scenario where a single
>>>> file consisting of only Perl code is in multiple encodings at once?
>>>
>>> Probably not, but ...
>>>
>>>> If there is a "use utf8;" anywhere in a Perl file, would it not be
>>>> reasonable to interpret that it is describing the entire file and not just
>>>> the portion of the file below that statement?
>>>
>>> ... I don't feel that is reasonable. It's not how "use" works in general.
>>> "Use" affects the file being parsed from that point on. If you want it to
>>> affect the whole file, put it as the first statement.
>>>
>>> But I don't get what you want to achieve, what problem you want to solve with
>>> this solution?
>>
>> I propose that if we don't want to explicitly support mixed encodings then
>> explicit encoding declarations are a special case (that can be clearly
>> documented) where their effect should be retroactive to describe the whole
>> file, because logically that's the only thing that makes sense (for a non
>> mixed encoding file, declaring any part of it as UTF-8 is logically saying the
>> whole file is UTF-8), even if it does happen to have the form of a "use"
>> statement. -- Darren Duncan
>
> Well, there actually is one mixed encoding that makes sense, ASCII up until the
> 'use utf8', Unicode after that. I would assume this is the mental model most
> people have.

Yes, that is trivially the case. I was more concerned about mutually
incompatible encodings, such as non-ASCII Latin1 characters plus non-ASCII UTF-8
characters in the same file. -- Darren Duncan
Re: "use v5.36.0" should imply ASCII source [ In reply to ]
On Sat, Aug 7, 2021, at 11:55 AM, Dan Book wrote:
> Yes. POD parsers and the perl interpreter do not read the same parts of the file, ever. Thus they must each indicate to their corresponding parsers what encoding they contain.

Assuming you mean this exactly as written, I don't believe this is true.

use v5.34.0;
use warnings;

my $string = <<'END';
=encoding utf8

This is løvely døcumentation.

=cut
END

say $string;

Then…

dinah:~$ perl demo.pl
=encoding utf8

This is løvely døcumentation.

=cut

dinah:~$ pod2text demo.pl
This is løvely døcumentation.

It's all a muddle.

--
rjbs
Re: "use v5.36.0" should imply ASCII source [ In reply to ]
On Sun, Aug 8, 2021 at 6:52 PM Ricardo Signes <perl.p5p@rjbs.manxome.org>
wrote:

> On Sat, Aug 7, 2021, at 11:55 AM, Dan Book wrote:
>
> Yes. POD parsers and the perl interpreter do not read the same parts of
> the file, ever. Thus they must each indicate to their corresponding parsers
> what encoding they contain.
>
>
> Assuming you mean this exactly as written, I don't believe this is true.
>
> use v5.34.0;
> use warnings;
>
> my $string = <<'END';
> =encoding utf8
>
> This is løvely døcumentation.
>
> =cut
> END
>
> say $string;
>
>
> Then…
>
> dinah:~$ perl demo.pl
> =encoding utf8
>
> This is løvely døcumentation.
>
> =cut
>
> dinah:~$ pod2text demo.pl
> This is løvely døcumentation.
>
>
> It's all a muddle.
>

Yes, I worded it imprecisely. But it remains that even when abusing the
parsers to see each other's components, they still only follow their own
encoding declarations.

-Dan
Re: "use v5.36.0" should imply ASCII source [ In reply to ]
On Fri, Aug 06, 2021 at 11:45:21PM -0700, Darren Duncan wrote:
> On 2021-08-06 5:01 p.m., Chris Prather wrote:
> >>On Aug 6, 2021, at 2:58 PM, Dan Book <grinnz@gmail.com> wrote:
> >>
> >>I don't think it's comparable. Comments are parsed by the perl
> >>interpreter, but POD is not, except to find the end of the POD.
> >
> >Additionally comments don???t as far as I know have a way to declare their
> >encoding.
> Question: Is there ever a real life scenario where a single source file is
> not entirely the same encoding?

Sure. Some code, in utf8, and then a binary blob in __DATA__ which the
utf8 code reads and parses.

It's not good practice, but it's what you had to do to easily distribute
data in a CPAN distribution before File::ShareDir::Install existed.

--
David Cantrell | top google result for "topless karaoke murders"

When a man is tired of London, he is tired of life
-- Samuel Johnson
Re: "use v5.36.0" should imply ASCII source [ In reply to ]
On Thu, Aug 12, 2021 at 11:04 AM David Cantrell <david@cantrell.org.uk>
wrote:

> On Fri, Aug 06, 2021 at 11:45:21PM -0700, Darren Duncan wrote:
> > On 2021-08-06 5:01 p.m., Chris Prather wrote:
> > >>On Aug 6, 2021, at 2:58 PM, Dan Book <grinnz@gmail.com> wrote:
> > >>
> > >>I don't think it's comparable. Comments are parsed by the perl
> > >>interpreter, but POD is not, except to find the end of the POD.
> > >
> > >Additionally comments don???t as far as I know have a way to declare
> their
> > >encoding.
> > Question: Is there ever a real life scenario where a single source file
> is
> > not entirely the same encoding?
>
> Sure. Some code, in utf8, and then a binary blob in __DATA__ which the
> utf8 code reads and parses.
>
> It's not good practice, but it's what you had to do to easily distribute
> data in a CPAN distribution before File::ShareDir::Install existed.
>

Note this will currently break because the filehandle *is* shared between
code and DATA, unlike with POD. "use utf8" applies to both. But I would
consider the use case of non-textual data in DATA exceedingly rare.

-Dan
Re: "use v5.36.0" should imply ASCII source [ In reply to ]
On Fri, Aug 6, 2021 at 5:23 PM Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:
>
> Porters,
>
> I recently posted the suggestion that "use v5.36.0" should imply "use utf8", which led to a pretty large thread in which Felipe Gasper repeatedly said "This is going to make things worse, not better." I spent a lot of time grumbling about this to myself, figuring out exactly how to rebut this, and then deciding that I tentatively, partly, agreed with him.
>
> We want each improvement to be a ratcheting up in language usability, when possible, rather than "we made things worse so we could make them better." At present, because we don't (and can't) know whether a string is text or bytes, we don't (and can't) automatically encode it when it hits a bytestream. We also don't know reliably whether a given output handle is already expecting to do that encoding for us.
>
> I am 100% certain that adding "use utf8" to the feature bundle would be better for me, but I already have a pretty strong grasp of the I/O model of Perl. I'm not sure it's better enough for everybody.
>
> At the PSC, we had a long talk about this, and another proposal was made:
>
> We introduce a new stricture, which I'll call "source_encoding". Under "use strict 'source_encoding'", the compiler will raise an exception when the source contains non-ASCII content unless the utf8 pragma is in effect. The error raised can drive the programmer to documentation explaining the various trade-offs. That is: you can turn on utf8 and deal with how this affects your I/O, or you can disable the stricture, or you can restate your non-ASCII content as ASCII by using escaping constructs.
>
> I'm not sure this is an improvement, but I think it is. This prevents the "I forgot to add utf8 and so only discovered after runtime that I have doubly-encoded my output" bug.
>
> --
> rjbs

After thinking about this again, I had another idea.

The reason implying 'use utf8' is a problem is because of the impact
it has on string semantics. Maybe we can just have it not impact
string semantics. Make 'use v5.36.0;' decode the source as UTF-8, but
store string literals as byte strings rather than characters. The
strings would still be required to be UTF-8 encoded, but would be
stored with the utf8 flag off. This would allow using UTF-8 encoded
content in comments, Pod, or even in function names, but would not
create the confusion with strings and IO.

This seems possibly hard to document, which may indicate that it is a
terrible idea.
Re: "use v5.36.0" should imply ASCII source [ In reply to ]
On Fri, Aug 6, 2021 at 8:23 PM Paul "LeoNerd" Evans
<leonerd@leonerd.org.uk> wrote:
> It's far easier for everyone - implementation and users alike - to give
> a very simple rule:
>
> After `use VERSION>=5.36` and until any `use utf8` there must be no
> non-ASCII bytes whatsoever.

This would need the addition of "unless they are after an __END__
marker". I think it will inevitably make the rules for UTF-8 Pod
confusing.

>
> Yes this does lead to an annoying dual declaration of both `use utf8`
> and `=encoding UTF-8` - perhaps that can be helped in some way?
Re: "use v5.36.0" should imply ASCII source [ In reply to ]
> On Aug 16, 2021, at 8:00 AM, Graham Knop <haarg@haarg.org> wrote:
>
> After thinking about this again, I had another idea.
>
> The reason implying 'use utf8' is a problem is because of the impact
> it has on string semantics. Maybe we can just have it not impact
> string semantics. Make 'use v5.36.0;' decode the source as UTF-8, but
> store string literals as byte strings rather than characters. The
> strings would still be required to be UTF-8 encoded, but would be
> stored with the utf8 flag off. This would allow using UTF-8 encoded
> content in comments, Pod, or even in function names, but would not
> create the confusion with strings and IO.

I thought of this sometime back, but more in the context of adding flexibility to utf8.pm:

{
use utf8 decode => 'no_strings'; # What Graham envisions
my $foo = "é"; # 2 code points
}

{
use utf8 decode => 'all'; # status quo
my $foo = "é"; # 1 code point
}

I personally would think decode=no_strings could be added to the feature bundle with little trouble. The use case for leaving strings undecoded doesn’t seem to apply for things besides strings.

-F
Re: "use v5.36.0" should imply ASCII source [ In reply to ]
On Mon, 16 Aug 2021 08:51:30 -0400, Felipe Gasper
<felipe@felipegasper.com> wrote:

> > On Aug 16, 2021, at 8:00 AM, Graham Knop <haarg@haarg.org> wrote:
> >
> > After thinking about this again, I had another idea.
> >
> > The reason implying 'use utf8' is a problem is because of the impact
> > it has on string semantics. Maybe we can just have it not impact
> > string semantics. Make 'use v5.36.0;' decode the source as UTF-8,
> > but store string literals as byte strings rather than characters.
> > The strings would still be required to be UTF-8 encoded, but would
> > be stored with the utf8 flag off. This would allow using UTF-8
> > encoded content in comments, Pod, or even in function names, but
> > would not create the confusion with strings and IO.
>
> I thought of this sometime back, but more in the context of adding
> flexibility to utf8.pm:
>
> {
> use utf8 decode => 'no_strings'; # What Graham envisions
> my $foo = "é"; # 2 code points
> }
>
> {
> use utf8 decode => 'all'; # status quo
> my $foo = "é"; # 1 code point
> }
>
> I personally would think decode=no_strings could be added to the
> feature bundle with little trouble. The use case for leaving strings
> undecoded doesn’t seem to apply for things besides strings.

In that vein, to ease porting from older ISO encoded source files

{ use utf8 decode => 'no_strings'; # What Graham envisions
my $foo = "é"; # 2 code points
}
{ use utf8 decode => 'all'; # status quo
my $foo = "é"; # 1 code point
}
{ use utf8 convert => "utf-8"; # or convert ISO => "UTF-8"
my $foo = "é"; # This ISO-8859-1 é will be upgraded to UTF-8
} # 1 codepoint

If well-documented and completely lexical, the path forward is
extremely easy and fast, and it will trigger coders to make their code
more 2021+. Note that a lot of software was written in times where
editors did not have a clue about mutlibyte encodings and (windows)
people still used Alt-234 and the-like to enter diacriticals.

> -F

--
H.Merijn Brand https://tux.nl Perl Monger http://amsterdam.pm.org/
using perl5.00307 .. 5.33 porting perl5 on HP-UX, AIX, and Linux
https://tux.nl/email.html http://qa.perl.org https://www.test-smoke.org
Re: "use v5.36.0" should imply ASCII source [ In reply to ]
2021-8-7 0:23 Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:

>
> I'm not *sure* this is an improvement, but I think it is. This prevents
> the "I forgot to add utf8 and so only discovered after runtime that I have
> doubly-encoded my output" bug.
>
>
Is it okay that our consensus is to write Perl source code using UTF-8 in
the future?

This first step prevent unpredictable latin-1 bugs

- "use feature 'source_encoding';" must be only ASCII (ASCII is a small set
of UTF-8).
- "use feature 'source_encoding'; use utf8;" must be UTF-8.
Re: "use v5.36.0" should imply ASCII source [ In reply to ]
On Thu, Aug 12, 2021 at 11:08:48AM -0400, Dan Book wrote:
> On Thu, Aug 12, 2021 at 11:04 AM David Cantrell <david@cantrell.org.uk>
> wrote:
> > On Fri, Aug 06, 2021 at 11:45:21PM -0700, Darren Duncan wrote:
> > > Question: Is there ever a real life scenario where a single
> > > source file is not entirely the same encoding?
> > Sure. Some code, in utf8, and then a binary blob in __DATA__ which the
> > utf8 code reads and parses.
> >
> > It's not good practice, but it's what you had to do to easily distribute
> > data in a CPAN distribution before File::ShareDir::Install existed.
> Note this will currently break because the filehandle *is* shared between
> code and DATA, unlike with POD. "use utf8" applies to both. But I would
> consider the use case of non-textual data in DATA exceedingly rare.

I stopped doing it a few years ago. I think I've seen it in test suites
for some image-processing modules in the past.

--
David Cantrell | Godless Liberal Elitist

You can't spell AWESOME without ME!
RE: "use v5.36.0" should imply ASCII source [ In reply to ]
From: David Cantrell

> On Thu, Aug 12, 2021 at 11:08:48AM -0400, Dan Book wrote:
> > On Thu, Aug 12, 2021 at 11:04 AM David Cantrell wrote:
> > > On Fri, Aug 06, 2021 at 11:45:21PM -0700, Darren Duncan wrote:
> > > > Question: Is there ever a real life scenario where a single
> > > > source file is not entirely the same encoding?
> > > Sure. Some code, in utf8, and then a binary blob in __DATA__ which
> > > the
> > > utf8 code reads and parses.
> > >
> > > It's not good practice, but it's what you had to do to easily
> > > distribute data in a CPAN distribution before File::ShareDir::Install existed.
> > Note this will currently break because the filehandle *is* shared
> > between code and DATA, unlike with POD. "use utf8" applies to both.
> > But I would consider the use case of non-textual data in DATA exceedingly rare.
>
> I stopped doing it a few years ago. I think I've seen it in test suites for
> some image-processing modules in the past.

My use-case of binary __DATA__ was that I was uncompressing Compress::Zlib
data and used this in production in some 5.8.8 age, which was rather nice
to my taste.

not using this technique anymore, though.

Internal Use - Confidential
Re: "use v5.36.0" should imply ASCII source [ In reply to ]
On Sat, Aug 07, 2021 at 02:01:14PM +0200, Martijn Lievaart wrote:
> Op 07-08-2021 om 08:51 schreef Darren Duncan:

> > If there is a "use utf8;" anywhere in a Perl file, would it not be
> > reasonable to interpret that it is describing the entire file and not
> > just the portion of the file below that statement?
>
>
> ... I don't feel that is reasonable. It's not how "use" works in general.
> "Use" affects the file being parsed from that point on. If you want it to
> affect the whole file, put it as the first statement.

It's not just "not reasonable" - it's not possible.

The perl parser can't restart.

Things like reading source code from a pipe (or terminal) could be worked
around (with sufficient buffering), but that's not the fatal problem here.

The problem is that all actions of the parser happen immediately, and
are committed to the symbol table as they are done. So for starters:

sub foo {
# I am good
}

sub bar {
1 2 3;
# I have a syntax error
}


will generate a definition for &foo before it even starts parsing bar,
and the failure to parse bar won't delete the definition of &foo

So if you wrap the above in an eval (or similar - a require or do in an
eval) and trap the error, you still get &foo.


BEGIN blocks give many many more ways for globally visible side effects to
happen immediately.

So there simply isn't a way to rewind and redo the parse of a file, because
the parsing of a file is not a transaction that ultimately commits or
rolls back - it's kind of AutoCommit, potentially at a
statement-by-statement granularity.

Nicholas Clark
Re: "use v5.36.0" should imply ASCII source [ In reply to ]
On Mon, Aug 16, 2021, at 8:00 AM, Graham Knop wrote:
> On Fri, Aug 6, 2021 at 5:23 PM Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:>
>> At the PSC, we had a long talk about this, and another proposal was made:
>>
>> We introduce a new stricture, which I'll call "source_encoding". Under "use strict 'source_encoding'", the compiler will raise an exception when the source contains non-ASCII content unless the utf8 pragma is in effect. The error raised can drive the programmer to documentation explaining the various trade-offs. That is: you can turn on utf8 and deal with how this affects your I/O, or you can disable the stricture, or you can restate your non-ASCII content as ASCII by using escaping constructs.
>
> After thinking about this again, I had another idea.
>
> The reason implying 'use utf8' is a problem is because of the impact it has on string semantics. Maybe we can just have it not impact string semantics. Make 'use v5.36.0;' decode the source as UTF-8, but store string literals as byte strings rather than characters. The strings would still be required to be UTF-8 encoded, but would be stored with the utf8 flag off. This would allow using UTF-8 encoded content in comments, Pod, or even in function names, but would not create the confusion with strings and IO.

I said I'd write a reply to this and I didn't. *Mea culpa*.

I think there are two big questions, here:

*ONE:* What's the end state we'd like to get to?

*TWO:* What's a good next step, keeping in mind that we might not ever get past that next step?

My take is this: The end state I'd like is that strings are in one of three states: declared text, declared bytes, unknown. Semantics exist for how to combine these and deal with I/O discipline. The source code is Unicode and string literals are assumed to be text. A new string literal syntax exists for byte strings, like `qb"..."`.

For my money, a useful next step is that we encourage people to opt-in to "source code is unicode and string literals are text." This means that the programmer is then responsible for thinking about how this will affect their I/O. That concern is already there, we're just pushing around the complexity like a lump under the rug. I think this push is a good one. It lets us enable non-ASCII syntax, and it's pretty well understood. Also, we already have something for qb"...." in the form of "do { use bytes; qq{...} }" but we could probably add a qb, too, if we needed it.

--
rjbs
Re: "use v5.36.0" should imply ASCII source [ In reply to ]
On Sun, Oct 3, 2021 at 2:57 PM Ricardo Signes <perl.p5p@rjbs.manxome.org>
wrote:

> On Mon, Aug 16, 2021, at 8:00 AM, Graham Knop wrote:
>
> On Fri, Aug 6, 2021 at 5:23 PM Ricardo Signes <perl.p5p@rjbs.manxome.org>
> wrote:>
>
> At the PSC, we had a long talk about this, and another proposal was made:
>
> We introduce a new stricture, which I'll call "source_encoding". Under
> "use strict 'source_encoding'", the compiler will raise an exception when
> the source contains non-ASCII content unless the utf8 pragma is in effect.
> The error raised can drive the programmer to documentation explaining the
> various trade-offs. That is: you can turn on utf8 and deal with how this
> affects your I/O, or you can disable the stricture, or you can restate your
> non-ASCII content as ASCII by using escaping constructs.
>
>
> After thinking about this again, I had another idea.
>
> The reason implying 'use utf8' is a problem is because of the impact it
> has on string semantics. Maybe we can just have it not impact string
> semantics. Make 'use v5.36.0;' decode the source as UTF-8, but store string
> literals as byte strings rather than characters. The strings would still be
> required to be UTF-8 encoded, but would be stored with the utf8 flag off.
> This would allow using UTF-8 encoded content in comments, Pod, or even in
> function names, but would not create the confusion with strings and IO.
>
>
> I said I'd write a reply to this and I didn't. *Mea culpa*.
>
> I think there are two big questions, here:
>
> *ONE:* What's the end state we'd like to get to?
>
> *TWO:* What's a good next step, keeping in mind that we might not ever
> get past that next step?
>
> My take is this: The end state I'd like is that strings are in one of
> three states: declared text, declared bytes, unknown. Semantics exist for
> how to combine these and deal with I/O discipline. The source code is
> Unicode and string literals are assumed to be text. A new string literal
> syntax exists for byte strings, like qb"...".
>
> For my money, a useful next step is that we encourage people to opt-in to
> "source code is unicode and string literals are text." This means that the
> programmer is then responsible for thinking about how this will affect
> their I/O. That concern is already there, we're just pushing around the
> complexity like a lump under the rug. I think this push is a good one. It
> lets us enable non-ASCII syntax, and it's pretty well understood. Also, we
> already have something for qb"...." in the form of "do { use bytes; qq{...}
> }" but we could probably add a qb, too, if we needed it.
>

"use bytes" is an abstraction breakage, not an interface, so I would prefer
the qb alternative, unless and until "use bytes" did nothing other than
what "no utf8" currently does (but that could be an alternative for your
suggestion).

I agree very much with the end state proposed. I like the proposed next
step but I don't know how we get there. Even spreading understanding of the
current semantics is an uphill battle; too many people just don't
understand encoding, and that has to be baked into our approach. I think it
is possible, but not easy, to sufficiently document a new assumption for
whatever shape this feature may take. It's problematic that making
"assumption failures" reliably obvious when they occur is difficult to
impossible, ironically the sort of problem we are trying to fix here. I
don't have a conclusion here except that the most useful option won't
necessarily be the most expected (nor is the current state).

-Dan
Re: "use v5.36.0" should imply ASCII source [ In reply to ]
2021-10-4 3:57 Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:

>
> *ONE:* What's the end state we'd like to get to?
>
>
I have a question.

echo -e '?' | perl -p -E 's/\d/1/'

'?' of echo argument is Japanese UTF-8. Output is ASCII 1.

Current Output(UTF-8 ?)

?

Ideal Output(ASCII 1)

1

Do you want this to work ideally in the UNIX/Linux system?
Re: "use v5.36.0" should imply ASCII source [ In reply to ]
> On Oct 4, 2021, at 4:45 AM, Yuki Kimoto <kimoto.yuki@gmail.com> wrote:
>
>
> 2021-10-4 3:57 Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:
>
> ONE: What's the end state we'd like to get to?
>
>
> I have a question.
>
> echo -e '?' | perl -p -E 's/\d/1/'
>
> '?' of echo argument is Japanese UTF-8. Output is ASCII 1.
>
> Current Output(UTF-8 ?)
>
> ?
>
> Ideal Output(ASCII 1)
>
> 1
>
> Do you want this to work ideally in the UNIX/Linux system?

For that to happen you would pass the `-CIO` flag to perl, which causes STDIN & STDOUT to automatically decode/encode UTF-8.

The one-liner as-is outputs "\xef\xbc\x91" (U+FF11 in UTF-8) instead of ASCII 1 because those 3 bytes are what Perl receives on STDIN, and nothing is decoding those to U+FF11. Your s/\d/1/ only works on *digits*, and none of U+00EF, U+00BC, or U+0091 is. So no change happens.

-FG
Re: "use v5.36.0" should imply ASCII source [ In reply to ]
2021-10-4 22:21 Felipe Gasper <felipe@felipegasper.com> wrote:

>
> > On Oct 4, 2021, at 4:45 AM, Yuki Kimoto <kimoto.yuki@gmail.com> wrote:
> >
> >
> > 2021-10-4 3:57 Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:
> >
> > ONE: What's the end state we'd like to get to?
> >
> >
> > I have a question.
> >
> > echo -e '?' | perl -p -E 's/\d/1/'
> >
> > '?' of echo argument is Japanese UTF-8. Output is ASCII 1.
> >
> > Current Output(UTF-8 ?)
> >
> > ?
> >
> > Ideal Output(ASCII 1)
> >
> > 1
> >
> > Do you want this to work ideally in the UNIX/Linux system?
>
> For that to happen you would pass the `-CIO` flag to perl, which causes
> STDIN & STDOUT to automatically decode/encode UTF-8.
>
> The one-liner as-is outputs "\xef\xbc\x91" (U+FF11 in UTF-8) instead of
> ASCII 1 because those 3 bytes are what Perl receives on STDIN, and nothing
> is decoding those to U+FF11. Your s/\d/1/ only works on *digits*, and none
> of U+00EF, U+00BC, or U+0091 is. So no change happens.
>
> -FG


I understand if I get the result, I can use the -CIO flag. I will try to
learn these flags for a while.
Re: "use v5.36.0" should imply ASCII source [ In reply to ]
Ric asks "ONE: What's the end state we'd like to get to?".

I'm thinking about the goal.

1. enable utf8 by default in use vx in the future;

2. In also one liner it can be used in the same way as a normal Perl
program using "use utf8", "Encode::decode" and "Encode::encode".

2 needed a little more description.

----------------------------------------------------
Source is UTF-8, and the input string is decoded form UTF-8(arguments(A),
stdin(I), input file stream(i)),

and the string is encode to UTF-8(stdout(O), stderr(E), output file
stream(o)).

In the one liner, I need to write the following way. SAD is same as -IOEAio

echo -e '???' | perl -Mutf8 -CSAD -p -e 's/\d\w?/1ai/'

Input

???

Output

1ai

The replacement is successful as expected.

I want to write this more easily, for example --utf8 option.

echo -e '???' | perl --utf8 -p -e 's/\d\w?/1ai/'
----------------------------------------------------------------------------------

I think this is independent of the topic of the string flag.

what do you think?

2021-10-5 17:25 Yuki Kimoto <kimoto.yuki@gmail.com> wrote:

>
>
> 2021-10-4 22:21 Felipe Gasper <felipe@felipegasper.com> wrote:
>
>>
>> > On Oct 4, 2021, at 4:45 AM, Yuki Kimoto <kimoto.yuki@gmail.com> wrote:
>> >
>> >
>> > 2021-10-4 3:57 Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:
>> >
>> > ONE: What's the end state we'd like to get to?
>> >
>> >
>> > I have a question.
>> >
>> > echo -e '?' | perl -p -E 's/\d/1/'
>> >
>> > '?' of echo argument is Japanese UTF-8. Output is ASCII 1.
>> >
>> > Current Output(UTF-8 ?)
>> >
>> > ?
>> >
>> > Ideal Output(ASCII 1)
>> >
>> > 1
>> >
>> > Do you want this to work ideally in the UNIX/Linux system?
>>
>> For that to happen you would pass the `-CIO` flag to perl, which causes
>> STDIN & STDOUT to automatically decode/encode UTF-8.
>>
>> The one-liner as-is outputs "\xef\xbc\x91" (U+FF11 in UTF-8) instead of
>> ASCII 1 because those 3 bytes are what Perl receives on STDIN, and nothing
>> is decoding those to U+FF11. Your s/\d/1/ only works on *digits*, and none
>> of U+00EF, U+00BC, or U+0091 is. So no change happens.
>>
>> -FG
>
>
> I understand if I get the result, I can use the -CIO flag. I will try to
> learn these flags for a while.
>
>
Re: "use v5.36.0" should imply ASCII source [ In reply to ]
On Sun, Oct 3, 2021, at 2:56 PM, Ricardo Signes wrote:
> *ONE:* What's the end state we'd like to get to?
>
> *TWO:* What's a good next step, keeping in mind that we might not ever get past that next step?
>
> My take is this: The end state I'd like is that strings are in one of three states: declared text, declared bytes, unknown. Semantics exist for how to combine these and deal with I/O discipline. The source code is Unicode and string literals are assumed to be text. A new string literal syntax exists for byte strings, like `qb"..."`.
>
> For my money, a useful next step is that we encourage people to opt-in to "source code is unicode and string literals are text." This means that the programmer is then responsible for thinking about how this will affect their I/O. That concern is already there, we're just pushing around the complexity like a lump under the rug. I think this push is a good one. It lets us enable non-ASCII syntax, and it's pretty well understood. Also, we already have something for qb"...." in the form of "do { use bytes; qq{...} }" but we could probably add a qb, too, if we needed it.

I want to bump this thread, noting: I filed a draft RFC <https://github.com/Perl/RFCs/pull/5> on this, and think it's good to move forward. (I think we can separate the question of "what utf8 do you get with *use utf8*" to future consideration and to make that consistent. I don't think there's a practical argument to be made that we should keep its current weirdness.)

I do think that creating improvements for non-ASCII syntax is a compelling step we can take in the near future, but for now, I would like to still have source encoding as a pragma like this, which can be made ASCII by default under use vX.

--
rjbs
Re: "use v5.36.0" should imply ASCII source [ In reply to ]
2021-11-22 6:00 Ricardo Signes <perl.p5p@rjbs.manxome.org> wrote:

>
> My take is this: The end state I'd like is that strings are in one of
> three states: declared text, declared bytes, unknown. Semantics exist for
> how to combine these and deal with I/O discipline. The source code is
> Unicode and string literals are assumed to be text. A new string literal
> syntax exists for byte strings, like qb"...".
>
> I think the flag for text is needed instead of confused and
misused utf8::is_utf8.

if (is_text($text)) {
say Encode::encode('UTF-8', $text);
}


> For my money, a useful next step is that we encourage people to opt-in to
> "source code is unicode and string literals are text." This means that the
> programmer is then responsible for thinking about how this will affect
> their I/O. That concern is already there, we're just pushing around the
> complexity like a lump under the rug. I think this push is a good one. It
> lets us enable non-ASCII syntax, and it's pretty well understood. Also, we
> already have something for qb"...." in the form of "do { use bytes; qq{...}
> }" but we could probably add a qb, too, if we needed it.
>
>
I agree with this.

use v5.40;
# Text (a decoded string). Literal is interpreted as UTF-8
my $text = "abcde";

# Bytes if you need more performance by index access
my $bytes = qb"abcde";

1 2  View All