Mailing List Archive: tightening up source code encoding semantics

tightening up source code encoding semantics

Feb 21, 2022, 6:55 PM

Post #1 of 39 (2004 views)

Porters,

This is a long email which ends up with me mostly spitballing an idea or two about how to improve our handling of source code encoding. Sorry?

I've been talking with Karl about source::encoding, utf8, and related topics. We got talking about whether "no source::encoding" made sense. Meanwhile, Paul was posting about disallowing downgrade from utf8. Then Karl asked about bytes.pm.

I think the whole situation could do with another round of "Yeah, but what would the best world be?" I will start by saying, "In the best world, bytes.pm would not exist." But it does, and I think we can generally allow it to continue to … do what it does. I will not refer to bytes.pm again in this email.

The big question is, how are we to allow Perl source code to be encoded? I think there are a few options worth mentioning:
* ASCII only, all non-ASCII must be represented by escape sequences
* UTF-8 only, all non-ASCII data must be represented by escape sequences
* bytes only, all bytes read from source become the corresponding codepoint in the source (this is sometimes described as "It's Latin-1 by default", which has been a contentious claim over the years)
* a mixture of bytes and UTF-8
The option we've given, for years, is the last one. We start in bytes mode. "use utf8" indicates that the source document is in UTF-8. When utf8 leaves effect, either because its scope ends or because of "no utf8", we return to bytes mode. This is pretty terrible, in my opinion. What's one's editor to make of this?

If we imagine that the reader can correctly swap between reading bytes and UTF-8 at scope boundaries (which I think I've seen recent evidence that it cannot reliably do), this may be a technically sustainable position. I think it's a *bad* position, though.

"The source is bytes" is a bad position and always has been one, with the *possible* exception of string literals. Unfortunately, we have relatively terrible failure modes around non-ASCII outside of string literals.

*Program:*
<<GROß;
foo
GROß

*Output:*
Can't find string terminator "GRO" anywhere before EOF at - line 1.

I think what we really want is to say *either* "This program has stupid legacy behavior" *or* "this program is encoded in UTF-8". Then we want to strongly, *strongly* encourage the second option. You may want to cry out, now, "I thought you said months ago that we wouldn't force everyone to use UTF-8 encoded source!" I am not quite contradicting myself.

Remember, fellow porter, that ASCII encoded data is a subset of UTF-8 encoded data. Once the source is declared to be in UTF-8, it's much less of a problem to say "specifically, entirely codepoints 0-127 except in scopes where that restriction is lifted." I think the problem with "no utf8" is not that it lets you disallow Japanese text, but that it switches back to bytes mode.

The whole thing makes me think that we want source::encoding (or something like it) to say "this document is UTF-8" and optionally "but only ASCII characters." Once that's said, it can't be undone. There is no "no source::encoding", only a switch to ASCII or not. Ideally, this would be the natural state of the program, but given the "the boilerplate should be a single line" doctrine, I think this is what we want implied by "use v5.x".

This gets us back to the "use v5.x should imply ascii encoding", but further to, "and you can't switch it off". I'd say something like:
* you must declare source encoding before any non-ASCII byte is encountered
* you must declare source encoding at the outermost lexical scope in a file, if you are to declare it at all
--
rjbs

Re: tightening up source code encoding semantics [ In reply to ]

grinnz at gmail

Feb 21, 2022, 10:55 PM

Post #2 of 39 (2004 views)

Mailing List Archive

Mailing List Archive

Attached Files: