Mailing List Archive

On Sun, Feb 27, 2022 at 1:22 PM Karl Williamson <public@khwilliamson.com>
wrote:

> On 2/26/22 22:25, Dan Book wrote:
> > How would interpreting characters in a different encoding not silently
> > change behavior?
>
> Because, as I said, if it chooses a different encoding than what it
> currently would do, it raises a compilation warning.
>

Ah I see now. You were referring to the silently component; I was speaking
on the behavior change regardless of whether it's silent (of course, the
warning is better than not).

-Dan

Re: tightening up source code encoding semantics [ In reply to ]

Feb 27, 2022, 10:35 AM

Post #28 of 39 (833 views)

On 2/27/22 04:01, G.W. Haywood via perl5-porters wrote:
> Hi there,
>
> On Sun, 27 Feb 2022, Dan Book wrote:
>> On Sat, Feb 26, 2022 Karl Williamson wrote:
>>> On 2/23/22, G.W. Haywood via perl5-porters wrote:
>>>> On Wed, 23 Feb 2022, Dan Book wrote:
>>>>
>>>>> In my opinion, changing program behavior based on a pretty reliable
>>>>> guess
>>>>> rather than a declaration would be a mistake.
>>>>
>>>> +1
>>>>
>>>> "Pretty reliable" == "rarely experienced problem, difficult to
>>>> diagnose".
>>>>
>>>
>>> I believe you guys don't grasp the proposal
>
> To make sure I've grasped it, I've just been over the thread again.
>
> I'm pretty sure I've grasped it. :)
>
>>> No perl program would silently change behavior from the existing
>>> baseline as a result of this proposal.
>>
>> How would interpreting characters in a different encoding not silently
>> change behavior?
>>
>> I concur with the rest of your message but I am familiar with the
>> reliability of the guess, and still believe it is a mistake in any such
>> widespread application.
>
> I too remain of the opinion that this is a bridge too far. I don't want
> to get into nit-picking so I won't start digging holes but I do think
> that there are things mentioned in the proposal merit more discussion.
> Here are two:
>
> On Wed, 23 Feb 2022 Karl Williamson wrote:
>
>> I believe the only existing programs this scenario would effect are
>> ones that (most likely, unsafely) mix UTF-8 and Latin1.
>
> This was almost an aside in the discussion but it seems to me that
> it's one of the more important issues. Isn't there a case for catching
> potentially unsafe usage in some way if it isn't being caught already?

Please read the original post on this thread. This whole thread is
about trying to prevent unsafe usage. My proposal would do this with
less churn to existing code than the proposal in that original post.
>
>> An advantage is that a 'use utf8' would no longer be required in
>> almost all circumstances.
>
> At this point in our history I see this as a disadvantage, but I admit
> I'm probably numbered amongst the dinosaurs of the coding fraternity.
> Is a consensus to be had?
>

I can't answer that alone

Re: tightening up source code encoding semantics [ In reply to ]

Feb 27, 2022, 11:51 AM

Post #29 of 39 (833 views)

On 2/27/22 11:30, Dan Book wrote:
> On Sun, Feb 27, 2022 at 1:22 PM Karl Williamson <public@khwilliamson.com
> <mailto:public@khwilliamson.com>> wrote:
>
> On 2/26/22 22:25, Dan Book wrote:
> > How would interpreting characters in a different encoding not
> silently
> > change behavior?
>
> Because, as I said, if it chooses a different encoding than what it
> currently would do, it raises a compilation warning.
>
>
> Ah I see now. You were referring to the silently component; I was
> speaking on the behavior change regardless of whether it's silent (of
> course, the warning is better than not).
>
> -Dan

I then don't get your objection. It appears you don't feel the warning
is good enough. Do you consider warnings in general to be good enough?
If so, what makes one not be good enough, and in particular this one?

Re: tightening up source code encoding semantics [ In reply to ]

perl5-porters at perl

Feb 27, 2022, 4:06 PM

Post #30 of 39 (833 views)

Hi there,

On Sun, 27 Feb 2022, Karl Williamson wrote:
> On 2/27/22 04:01, G.W. Haywood via perl5-porters wrote:
>> On Sat, Feb 26, 2022 Karl Williamson wrote:
>>>
>>> I believe you guys don't grasp the proposal
>>
>> ... just been over the thread again. ... merit more discussion.
>>
>> On Wed, 23 Feb 2022 Karl Williamson wrote:
>>
>>> I believe the only existing programs this scenario would effect are
>>> ones that (most likely, unsafely) mix UTF-8 and Latin1.
>>
>> This was almost an aside in the discussion but it seems to me that
>> it's one of the more important issues.? Isn't there a case for catching
>> potentially unsafe usage in some way if it isn't being caught already?
>
> Please read the original post on this thread. This whole thread is about
> trying to prevent unsafe usage. ...

Well as I said I re-read the thread before posting. I'm afraid that to
me the OP reads more like a sermon than a clear statement of the problem
and proposed solutions but it nevertheless strikes some chords here.

Preventing unsafe usage doesn't seem to be contentious, it's just how
it's attempted that might be. I guess my main worry is that it looks
like there's an awful lot of tinkering going on, and that might cause
some ripples. It also looks like a lot of effort is being dissipated
because developers bound themselves hand and foot before setting out.

While I can't honestly say I'd like it very much, I'd be comfortable
with somebody saying

"Welcome to Perl 7. The source is UTF8".

I'm a lot less comfortable with "As of Perl 5.36.8 your sources will
need to ... because 0.N% of programs have mixed UTF-8 and Latin1, and
most of these probably do it unsafely". Sure there are ways to shoot
yourself in the foot. Lots of them. Is this one a serious problem?
"Doctor, it hurts when I do this ..."

FWIW I'd be fine with nothing but ASCII in code for the rest of my
days, but I don't want to add to the understandable frustration so I'm
out of this now.

--

73,
Ged.

Re: tightening up source code encoding semantics [ In reply to ]

Apr 11, 2022, 2:10 PM

Post #31 of 39 (812 views)

On 2/21/22 19:55, Ricardo Signes wrote:
> Porters,
>
> This is a long email which ends up with me mostly spitballing an idea or
> two about how to improve our handling of source code encoding. Sorry?
>

https://github.com/Perl/perl5/issues/11334 is affected by this proposal

Re: tightening up source code encoding semantics [ In reply to ]

Jun 17, 2022, 6:58 PM

Post #32 of 39 (748 views)

On Sat, Feb 26, 2022, at 23:56, Karl Williamson wrote:
> [ things about how automatic detection could work ]

I will restate, tersely, what I think Karl said. I hope Karl can then say "yes, that's right [or close enough]" or "no."
* if the choices are Latin-1 or UTF-8, It is possible to predict with high confidence which a line of input is
* we can use this to avoid having to declare the encoding
* if encoding is declared, and is at odds with what is detected, a warning (or error) could be issued
So, first off: is that about right?

Next: I think this still requires that the program says "my source should be decoded at all". I *do* agree with the assertion that we can "guess" whether input is UTF-8 or Latin-1, but that's not the only relevant question. Imagine this program:
#!/usr/bin/perl
use v5.36;
my $str1 = "??????";
say $str;

Right now, no matter what content is actually in that string literal, the same bytes that were in the source will be sent to stdout. Imagine that we say "We can detect that the string is UTF-8 bytes, so we decode the bytes in the string literal so that $str1 contains the Unicode codepoints encoded in it." When we print that string, we will get a wide string warning, and we will deserve it. This, more or less, is why this proposal ended up existing rather than the previous one to make "use vX" enable utf8.

It was Felipe G., I believe, who said that users would end up more confused when the [lack of] automatic filehandle discipline didn't match the implicit source decoding. I think that claim was correct. I think we'd do users a disservice if we built strings by decoding the source literals based on encoding detection — not because the detection will be wrong, but because right now there is a bytes-in/bytes-out expectation.

Karl: Please tell me if you think I am way off base, here.

I *do* think this all leads to a more exciting possibility, though!

We *could* automatically detect source encoding, but forbid non-ASCII in string literals without declaration. This would allow non-ASCII syntax freely, but would require users clarify that they know their literals will be decoded into codepoint strings rather than octet strings. (If I wanted to keep banging the "adverbs on quote-like operators" drum, I would say that we could easily do this on a per-literal basis that way.) I think the problem we're seeing here is the conflation of text and buffer types in Perl 5, and I feel like we're finding a nice way to smoosh the lump under the carpet into one place, but I don't think we can eliminate it just yet.

--
rjbs

Re: tightening up source code encoding semantics [ In reply to ]

grinnz at gmail

Jun 17, 2022, 8:57 PM

Post #33 of 39 (748 views)

On Fri, Jun 17, 2022 at 9:59 PM Ricardo Signes <perl.p5p@rjbs.manxome.org>
wrote:

> On Sat, Feb 26, 2022, at 23:56, Karl Williamson wrote:
>
> [ things about how automatic detection could work ]
>
>
> I will restate, tersely, what I think Karl said. I hope Karl can then say
> "yes, that's right [or close enough]" or "no."
>
> - if the choices are Latin-1 or UTF-8, It is possible to predict with
> high confidence which a line of input is
> - we can use this to avoid having to declare the encoding
> - if encoding is declared, and is at odds with what is detected, a
> warning (or error) could be issued
>
> So, first off: is that about right?
>
> Next: I think this still requires that the program says "my source should
> be decoded at all". I *do* agree with the assertion that we can "guess"
> whether input is UTF-8 or Latin-1, but that's not the only relevant
> question. Imagine this program:
>
> #!/usr/bin/perl
> use v5.36;
> my $str1 = "??????";
> say $str;
>
>
> Right now, no matter what content is actually in that string literal, the
> same bytes that were in the source will be sent to stdout. Imagine that we
> say "We can detect that the string is UTF-8 bytes, so we decode the bytes
> in the string literal so that $str1 contains the Unicode codepoints encoded
> in it." When we print that string, we will get a wide string warning, and
> we will deserve it. This, more or less, is why this proposal ended up
> existing rather than the previous one to make "use vX" enable utf8.
>
> It was Felipe G., I believe, who said that users would end up more
> confused when the [lack of] automatic filehandle discipline didn't match
> the implicit source decoding. I think that claim was correct. I think
> we'd do users a disservice if we built strings by decoding the source
> literals based on encoding detection — not because the detection will be
> wrong, but because right now there is a bytes-in/bytes-out expectation.
>
> Karl: Please tell me if you think I am way off base, here.
>
> I *do* think this all leads to a more exciting possibility, though!
>
> We *could* automatically detect source encoding, but forbid non-ASCII in
> string literals without declaration. This would allow non-ASCII syntax
> freely, but would require users clarify that they know their literals will
> be decoded into codepoint strings rather than octet strings. (If I wanted
> to keep banging the "adverbs on quote-like operators" drum, I would say
> that we could easily do this on a per-literal basis that way.) I think the
> problem we're seeing here is the conflation of text and buffer types in
> Perl 5, and I feel like we're finding a nice way to smoosh the lump under
> the carpet into one place, but I don't think we can eliminate it just yet.
>

Due to the wide variety of uses for bytes in source code, I continue to
think any attempt at autodetection that would change the behavior of the
program is a mistake.

-Dan

Re: tightening up source code encoding semantics [ In reply to ]

jkeenan at pobox

Jun 18, 2022, 3:58 AM

Post #34 of 39 (748 views)

On 6/17/22 21:58, Ricardo Signes wrote:

>
> Next: I think this still requires that the program says "my source
> should be decoded at all".

Should there be a "not" after "should" in the above?

Re: tightening up source code encoding semantics [ In reply to ]

Jun 18, 2022, 8:06 AM

Post #35 of 39 (747 views)

On Sat, Jun 18, 2022, at 06:58, James E Keenan wrote:
> On 6/17/22 21:58, Ricardo Signes wrote:
>
> > Next: I think this still requires that the program says "my source should be decoded at all".
>
> Should there be a "not" after "should" in the above?

No.

Right now, if you have a literal string which, in the source, is UTF-8 encoded text, the string in perl land will be the UTF-8 bytes. If we want it to instead by the encoded codepoints, this should be declared.

--
rjbs

Re: tightening up source code encoding semantics [ In reply to ]

Jun 21, 2022, 9:17 AM

Post #36 of 39 (747 views)

On 6/17/22 19:58, Ricardo Signes wrote:
> On Sat, Feb 26, 2022, at 23:56, Karl Williamson wrote:
>> [ things about how automatic detection could work ]
>
> I will restate, tersely, what I think Karl said. I hope Karl can then
> say "yes, that's right [or close enough]" or "no."
>
> * if the choices are Latin-1 or UTF-8, It is possible to predict with
> high confidence which a line of input is
> * we can use this to avoid having to declare the encoding
> * if encoding is declared, and is at odds with what is detected, a
> warning (or error) could be issued
>
> So, first off: is that about right?

Yes. But also we could issue a warning if no encoding is declared and
we decided that it is utf8, hence any time the current behavior is
changed a warning would be raised.
>
> Next: I think this still requires that the program says "my source
> should be decoded at all". I /do/ agree with the assertion that we can
> "guess" whether input is UTF-8 or Latin-1, but that's not the only
> relevant question. Imagine this program:
>
> #!/usr/bin/perl
> use v5.36;
> my $str1 = "??????";
> say $str;
>
>
> Right now, no matter what content is actually in that string literal,
> the same bytes that were in the source will be sent to stdout. Imagine
> that we say "We can detect that the string is UTF-8 bytes, so we decode
> the bytes in the string literal so that $str1 contains the Unicode
> codepoints encoded in it." When we print that string, we will get a
> wide string warning, and we will deserve it. This, more or less, is why
> this proposal ended up existing rather than the previous one to make
> "use vX" enable utf8.
>
> It was Felipe G., I believe, who said that users would end up more
> confused when the [lack of] automatic filehandle discipline didn't match
> the implicit source decoding. I think that claim was correct. I think
> we'd do users a disservice if we built strings by decoding the source
> literals based on encoding detection — not because the detection will be
> wrong, but because right now there is a bytes-in/bytes-out expectation.
>
> Karl: Please tell me if you think I am way off base, here.

I think I finally understand the issue here; and no you're on base.

But I will beat the drum again against ever using the word 'decode' or
its variants. It is impossible to decode. Everything is always
encoded as something. You can switch encodings, but you can't decode.
I suppose it's clear if you say decode to X. But it doesn't make sense
to decode to an encoding. I presume that what is meant is to decode to
Perl's internal format, but Perl has multiple different internal
formats. So when people use the word 'decode', I don't know what they
actually mean. And I suspect they don't either.
>
> I /do/ think this all leads to a more exciting possibility, though!
>
> We /could/ automatically detect source encoding, but forbid non-ASCII in
> string literals without declaration. This would allow non-ASCII syntax
> freely, but would require users clarify that they know their literals
> will be decoded into codepoint strings rather than octet strings. (If I
> wanted to keep banging the "adverbs on quote-like operators" drum, I
> would say that we could easily do this on a per-literal basis that
> way.) I think the problem we're seeing here is the conflation of text
> and buffer types in Perl 5, and I feel like we're finding a nice way to
> smoosh the lump under the carpet into one place, but I don't think we
> can eliminate it just yet.

This sounds reasonable to me.
>
> --
> rjbs

Re: tightening up source code encoding semantics [ In reply to ]

Jun 22, 2022, 7:17 PM

Post #37 of 39 (732 views)

On Tue, Jun 21, 2022, at 12:17, Karl Williamson wrote:
> > Karl: Please tell me if you think I am way off base, here.
>
> I think I finally understand the issue here; and no you're on base.

Okay, good. I will revisit the proposal and try to make sure we get to something we all/both think is good.

> But I will beat the drum again against ever using the word 'decode' or its variants. It is impossible to decode.

Because I value easy communication with you, I will try to avoid it. But I think it's often clear to me what it means: the "decode" operation maps from a sequence of bytes to a sequence of codepoints.

Obviously the codepoints have to be represented in the computer memory as bytes, but logically in the program they are now treated as codepoints. So when I say "should we decode the source?" I mean "should the compiler decode the source text so that the variables formed out of its literals are codepoint sequences rather than byte sequences."

I don't think people mean "to decode is to transcode into Perl's internal byte format." They mean "to decode is to transform from a byte sequence (in a known encoding and repertoire) to a codepoint sequence." (Well, some people mean that. Some people don't know what they mean. That's just how people are…)

So: while I don't think it's meaningless or impossible to talk about, I will gladly concede that you find it distracting, and try to be more verbose. ????

--
rjbs

Re: tightening up source code encoding semantics [ In reply to ]

Jul 15, 2022, 9:10 AM

Post #38 of 39 (673 views)

Porters,

Okay, I'm replying to this thread, but sort of starting anew.

What we want to avoid:
* runtime encoding bugs that could be compile time
* more boilerplate than should be necessary
* breaking old code (apart from the quite outré)
* being locked into ASCII "plus maybe Latin-1" forever for the non-literal source code
We've been through a lot of options, which I will not recount here (sorry). One key pair of points:
* we can with high confidence know whether a document is UTF-8
* we can't know what existing programs do with strings, so we can't detect encoding bugs at compile time (or, really, even compile time)
So, here's my new run at the problem (don't reply until you get to the end of this email, okay?):
* the goal state is that Perl programs are always encoded as UTF-8 text files
* literal strings are byte strings (sequences of the octets found in the source document)
* under "use utf8", literal strings are text strings (sequences of codepoints represented by the octets in the source)
* because the source document must be valid Unicode text, a source document of the bytes \x22 \xFF \x22 is *not* legal, because it is not legal UTF-8
* right now, the default is that any byte sequence is legal in the source document, so the program \x22\xFF\x22 is legal, and produces a string whose only element is chr(\xFF)
* this should be rejected at read time, because the source document should be UTF-8
I'm going to stop the bullet list there. Here's what I think: I am describing what, I think, is the *right* state of affairs for Perl 5 if we don't introduce a Str v. Buf type distinction. On the other hand, I imagine the road to getting there: programs saved as Latin-1 encoded files with non-ASCII literals (to say nothing of variable names) need to be warned about, re-encoded, and so on. Is this worth it? I don't know, and I don't know the extent of the work required, but it feels like "surely a bunch."

*So I want to go back toward the original proposal.*

We should have something like "use ascii;" that says "this source code must be entirely in ASCII". If you say "use utf8", it overrides "use ascii". We aim to turn "use ascii" on in v5.x.0. Then uses can't accidentally write in an undeclared encoding. You can't wonder what the behavior of `"????"` is, octet/codepoint-wise, because it is a compile error unless you declared "use utf8". You can't declare "It's UTF-8 included non-ASCII codepoints in the source, but the literal strings are octet strings." Too bad. We could have a qb{...} someday.

I think we have various technical means to provide a "Perl, but with coherent source code encoding semantics" better than that, but I don't think we have the will, and I don't know whether we *should*, relative to our "don't break running code" goals.

So: should we have "use ascii", or something else, or nothing?

--
rjbs

Re: tightening up source code encoding semantics [ In reply to ]

kimoto.yuki at gmail

Jul 15, 2022, 3:19 PM

Post #39 of 39 (673 views)