> On Jun 21, 2022, at 12:17, Karl Williamson <public@khwilliamson.com> wrote:
>
> But I will beat the drum again against ever using the word 'decode' or its variants. It is impossible to decode. Everything is always encoded as something. You can switch encodings, but you can't decode. I suppose it's clear if you say decode to X. But it doesn't make sense to decode to an encoding. I presume that what is meant is to decode to Perl's internal format, but Perl has multiple different internal formats. So when people use the word 'decode', I don't know what they actually mean. And I suspect they don't either.
“Decode to X” makes no sense, IMO; rather, we decode *from* X. A “JSON decode”, for example, takes bytes of JSON and decodes them to a data structure. We could then “encode to JSON” if we want to round-trip that operation.
“Decode” seems a pretty ubiquitous term among languages I’ve seen:
--------
JavaScript:
(new TextDecoder).decode(new Uint8Array([0xc3, 0xa9]))
AssemblyScript:
String.UTF8.decode( (new Uint8Array([0xc3, 0xa9])).buffer )
Python:
b'\xc3\xa9'.decode()
Julia:
decode([0xc3, 0xa9], "UTF-8")
PHP:
utf8_decode("\xc3\xa9")
Go:
utf8.DecodeRuneInString()
--------
Do these languages’ use of “decode” confuse you as its use in Perl does?
From a Perl maintainer’s perspective, you’re right: every string has an encoding for its storage in memory: either bytes/Latin-1 or “generalized UTF-8”. To a Perl *user*, though, Perl’s memory is “off-limits”, and a string is just an opaque sequence of code points, with no context of encoding.
To a Perl user, for example, "\x{100}" is basically [256]: a 1-member uint sequence that contains a single member, 256. There’s no encoding in sight. A Perl maintainer, of course, might see that and think of the UTF8 flag.
The places where Perl users have to consider Perl’s internals--i.e., “When Unicode Does Not Happen” in perlunicode--are abstraction leaks.
The one thing most of those languages above have that Perl lacks is a type system that distinguishes characters from bytes. But that doesn’t mean that the types are nonexistent; they’re just the Perl user’s purview rather than the language’s. So to a Perl user, “decode these bytes as UTF-8” makes perfect sense, even though to Perl itself it’s just a transform of one opaque uint sequence to another.
Given that you believe those of us who use the term “decode” don’t know what we mean: wherein lies my confusion?
-F
>
> But I will beat the drum again against ever using the word 'decode' or its variants. It is impossible to decode. Everything is always encoded as something. You can switch encodings, but you can't decode. I suppose it's clear if you say decode to X. But it doesn't make sense to decode to an encoding. I presume that what is meant is to decode to Perl's internal format, but Perl has multiple different internal formats. So when people use the word 'decode', I don't know what they actually mean. And I suspect they don't either.
“Decode to X” makes no sense, IMO; rather, we decode *from* X. A “JSON decode”, for example, takes bytes of JSON and decodes them to a data structure. We could then “encode to JSON” if we want to round-trip that operation.
“Decode” seems a pretty ubiquitous term among languages I’ve seen:
--------
JavaScript:
(new TextDecoder).decode(new Uint8Array([0xc3, 0xa9]))
AssemblyScript:
String.UTF8.decode( (new Uint8Array([0xc3, 0xa9])).buffer )
Python:
b'\xc3\xa9'.decode()
Julia:
decode([0xc3, 0xa9], "UTF-8")
PHP:
utf8_decode("\xc3\xa9")
Go:
utf8.DecodeRuneInString()
--------
Do these languages’ use of “decode” confuse you as its use in Perl does?
From a Perl maintainer’s perspective, you’re right: every string has an encoding for its storage in memory: either bytes/Latin-1 or “generalized UTF-8”. To a Perl *user*, though, Perl’s memory is “off-limits”, and a string is just an opaque sequence of code points, with no context of encoding.
To a Perl user, for example, "\x{100}" is basically [256]: a 1-member uint sequence that contains a single member, 256. There’s no encoding in sight. A Perl maintainer, of course, might see that and think of the UTF8 flag.
The places where Perl users have to consider Perl’s internals--i.e., “When Unicode Does Not Happen” in perlunicode--are abstraction leaks.
The one thing most of those languages above have that Perl lacks is a type system that distinguishes characters from bytes. But that doesn’t mean that the types are nonexistent; they’re just the Perl user’s purview rather than the language’s. So to a Perl user, “decode these bytes as UTF-8” makes perfect sense, even though to Perl itself it’s just a transform of one opaque uint sequence to another.
Given that you believe those of us who use the term “decode” don’t know what we mean: wherein lies my confusion?
-F