Mailing List Archive

decode WAS Re: tightening up source code encoding semantics
> On Jun 21, 2022, at 12:17, Karl Williamson <public@khwilliamson.com> wrote:
>
> But I will beat the drum again against ever using the word 'decode' or its variants. It is impossible to decode. Everything is always encoded as something. You can switch encodings, but you can't decode. I suppose it's clear if you say decode to X. But it doesn't make sense to decode to an encoding. I presume that what is meant is to decode to Perl's internal format, but Perl has multiple different internal formats. So when people use the word 'decode', I don't know what they actually mean. And I suspect they don't either.

“Decode to X” makes no sense, IMO; rather, we decode *from* X. A “JSON decode”, for example, takes bytes of JSON and decodes them to a data structure. We could then “encode to JSON” if we want to round-trip that operation.

“Decode” seems a pretty ubiquitous term among languages I’ve seen:

--------
JavaScript:
(new TextDecoder).decode(new Uint8Array([0xc3, 0xa9]))

AssemblyScript:
String.UTF8.decode( (new Uint8Array([0xc3, 0xa9])).buffer )

Python:
b'\xc3\xa9'.decode()

Julia:
decode([0xc3, 0xa9], "UTF-8")

PHP:
utf8_decode("\xc3\xa9")

Go:
utf8.DecodeRuneInString()
--------

Do these languages’ use of “decode” confuse you as its use in Perl does?

From a Perl maintainer’s perspective, you’re right: every string has an encoding for its storage in memory: either bytes/Latin-1 or “generalized UTF-8”. To a Perl *user*, though, Perl’s memory is “off-limits”, and a string is just an opaque sequence of code points, with no context of encoding.

To a Perl user, for example, "\x{100}" is basically [256]: a 1-member uint sequence that contains a single member, 256. There’s no encoding in sight. A Perl maintainer, of course, might see that and think of the UTF8 flag.

The places where Perl users have to consider Perl’s internals--i.e., “When Unicode Does Not Happen” in perlunicode--are abstraction leaks.

The one thing most of those languages above have that Perl lacks is a type system that distinguishes characters from bytes. But that doesn’t mean that the types are nonexistent; they’re just the Perl user’s purview rather than the language’s. So to a Perl user, “decode these bytes as UTF-8” makes perfect sense, even though to Perl itself it’s just a transform of one opaque uint sequence to another.

Given that you believe those of us who use the term “decode” don’t know what we mean: wherein lies my confusion?

-F
Re: decode WAS Re: tightening up source code encoding semantics [ In reply to ]
On Wed, Jun 22, 2022 at 7:21 AM Felipe Gasper <felipe@felipegasper.com>
wrote:

>
> > On Jun 21, 2022, at 12:17, Karl Williamson <public@khwilliamson.com>
> wrote:
> >
> > But I will beat the drum again against ever using the word 'decode' or
> its variants. It is impossible to decode. Everything is always encoded
> as something. You can switch encodings, but you can't decode. I suppose
> it's clear if you say decode to X. But it doesn't make sense to decode to
> an encoding. I presume that what is meant is to decode to Perl's internal
> format, but Perl has multiple different internal formats. So when people
> use the word 'decode', I don't know what they actually mean. And I suspect
> they don't either.
>
> “Decode to X” makes no sense, IMO; rather, we decode *from* X. A “JSON
> decode”, for example, takes bytes of JSON and decodes them to a data
> structure. We could then “encode to JSON” if we want to round-trip that
> operation.
>
> “Decode” seems a pretty ubiquitous term among languages I’ve seen:
>
> --------
> JavaScript:
> (new TextDecoder).decode(new Uint8Array([0xc3, 0xa9]))
>
> AssemblyScript:
> String.UTF8.decode( (new Uint8Array([0xc3, 0xa9])).buffer )
>
> Python:
> b'\xc3\xa9'.decode()
>
> Julia:
> decode([0xc3, 0xa9], "UTF-8")
>
> PHP:
> utf8_decode("\xc3\xa9")
>
> Go:
> utf8.DecodeRuneInString()
> --------
>
> Do these languages’ use of “decode” confuse you as its use in Perl does?
>
> From a Perl maintainer’s perspective, you’re right: every string has an
> encoding for its storage in memory: either bytes/Latin-1 or “generalized
> UTF-8”. To a Perl *user*, though, Perl’s memory is “off-limits”, and a
> string is just an opaque sequence of code points, with no context of
> encoding.
>
> To a Perl user, for example, "\x{100}" is basically [256]: a 1-member uint
> sequence that contains a single member, 256. There’s no encoding in sight.
> A Perl maintainer, of course, might see that and think of the UTF8 flag.
>
> The places where Perl users have to consider Perl’s internals--i.e., “When
> Unicode Does Not Happen” in perlunicode--are abstraction leaks.
>
> The one thing most of those languages above have that Perl lacks is a type
> system that distinguishes characters from bytes. But that doesn’t mean that
> the types are nonexistent; they’re just the Perl user’s purview rather than
> the language’s. So to a Perl user, “decode these bytes as UTF-8” makes
> perfect sense, even though to Perl itself it’s just a transform of one
> opaque uint sequence to another.
>
> Given that you believe those of us who use the term “decode” don’t know
> what we mean: wherein lies my confusion?
>

I agree with Felipe - the operation of decoding from and encoding to a byte
encoding is a perfectly well-known and defined concept.

-Dan
Re: decode WAS Re: tightening up source code encoding semantics [ In reply to ]
On 6/22/22 05:21, Felipe Gasper wrote:
>
>> On Jun 21, 2022, at 12:17, Karl Williamson <public@khwilliamson.com> wrote:
>>
>> But I will beat the drum again against ever using the word 'decode' or its variants. It is impossible to decode. Everything is always encoded as something. You can switch encodings, but you can't decode. I suppose it's clear if you say decode to X. But it doesn't make sense to decode to an encoding. I presume that what is meant is to decode to Perl's internal format, but Perl has multiple different internal formats. So when people use the word 'decode', I don't know what they actually mean. And I suspect they don't either.
>
> “Decode to X” makes no sense, IMO; rather, we decode *from* X. A “JSON decode”, for example, takes bytes of JSON and decodes them to a data structure. We could then “encode to JSON” if we want to round-trip that operation.
>
> “Decode” seems a pretty ubiquitous term among languages I’ve seen:
>
> --------
> JavaScript:
> (new TextDecoder).decode(new Uint8Array([0xc3, 0xa9]))
>
> AssemblyScript:
> String.UTF8.decode( (new Uint8Array([0xc3, 0xa9])).buffer )
>
> Python:
> b'\xc3\xa9'.decode()
>
> Julia:
> decode([0xc3, 0xa9], "UTF-8")
>
> PHP:
> utf8_decode("\xc3\xa9")
>
> Go:
> utf8.DecodeRuneInString()
> --------
>
> Do these languages’ use of “decode” confuse you as its use in Perl does?

The only one of those languages I have ever used is JavaScript, a long
time ago, and I never needed to use that aspect of it.
>
> From a Perl maintainer’s perspective, you’re right: every string has an encoding for its storage in memory: either bytes/Latin-1 or “generalized UTF-8”. To a Perl *user*, though, Perl’s memory is “off-limits”, and a string is just an opaque sequence of code points, with no context of encoding.
>
> To a Perl user, for example, "\x{100}" is basically [256]: a 1-member uint sequence that contains a single member, 256. There’s no encoding in sight. A Perl maintainer, of course, might see that and think of the UTF8 flag.
>
> The places where Perl users have to consider Perl’s internals--i.e., “When Unicode Does Not Happen” in perlunicode--are abstraction leaks.
>
> The one thing most of those languages above have that Perl lacks is a type system that distinguishes characters from bytes. But that doesn’t mean that the types are nonexistent; they’re just the Perl user’s purview rather than the language’s. So to a Perl user, “decode these bytes as UTF-8” makes perfect sense, even though to Perl itself it’s just a transform of one opaque uint sequence to another.
>
> Given that you believe those of us who use the term “decode” don’t know what we mean: wherein lies my confusion?

I certainly didn't have anyone in particular in mind when I wrote that.
So I hope you didn't take it personally. And I can't answer that
question for various reasons involving lack of knowledge of, among other
things, you, your blind spots and my blind spots.

But I can say that it is more complicated than UTF-8/bytes/Latin1, or
perhaps more precisely C0, ASCII, C1, Latin1. Perl supports to some
extent any single-byte locale, so it could be Latin-2, -3, ..., or even
some of the mostly obsolete 7 bit national locales. So, a string of
bytes may be encoded in any of a bunch of different scripts. A
particular byte may be Thai or Cyrillic, or Hebrew, or .... I pay
attention to these kinds of possibilities that many Perl programmers
will never encounter.

So maybe people can speak confidently that they understand this, but not
realize there's more to the story, and maybe for their purposes "decode"
works. But I've been mucking about in the weeds so that I've
internalized some gotchas that most need not be aware of.
>
> -F