There is way too much written here so I will be responding as I can.
On Thu, Sep 2, 2021 at 9:21 AM demerphq <demerphq@gmail.com> wrote:
> On Fri, 20 Aug 2021 at 19:48, Felipe Gasper <felipe@felipegasper.com>
> wrote:
>
>>
>> > On Aug 20, 2021, at 1:05 PM, demerphq <demerphq@gmail.com> wrote:
>> >
>> > On Wed, 18 Aug 2021 at 19:17, Felipe Gasper <felipe@felipegasper.com>
>> wrote:
>> > Per recent IRC discussion …
>> >
>> > PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of
>> confusion regarding the flag’s significance. Some think it indicates
>> whether a given PV stores text versus binary. Some think it means that the
>> PV is valid UTF-8. Still others likely hold other inaccurate views.
>> >
>> > The problem here is the naming. For example, consider `perl -e'my $foo
>> = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that
>> its code points (assuming use of a UTF-8 terminal) correspond to the bytes
>> that encode “é” in UTF-8.
>> >
>> > Nope. It might contain utf8, but it is not UTF8-ON. Think of it like a
>> square/rectangle relationship. All strings are "rectangles", all "squares"
>> are rectangles, some strings are squares, but unless SQUARE flag is ON perl
>> should assume it is a rectangle, not a square. The SQUARE flag should only
>> be set when the rectangle has been proved conclusively to be a square. That
>> the SQUARE flag is off does not mean the rectangle is not a square, merely
>> that the square has not been proved to be such.
>>
>> You’re defining “a UTF-8 string” as “a string whose PV is marked as
>> UTF-8”. I’m defining it as “a string whose Perl-visible code points happen
>> to be valid UTF-8”.
>>
>
> I dont find your definition to be very useful, nor descriptive of how perl
> manages these matters, so I am not using it. You are confusing different
> levels of abstraction. Your definition also would include cases where the
> data is already encoded and flagged as utf8. So it doesn't make sense to me.
>
> Here is the set of definitions that I am operating from:
>
> A "string" is a programming concept inside of Perl which is used to
> represent "text" buffers of memory. There are three level of abstraction
> for strings, two of which are tightly coupled. The three are the codepoint
> level, semantic level and encoding level.
>
> At the codepoint levels you can think of strings as variable length arrays
> of numbers (codepoints), where the numbers are restricted to 0 to 0x10FFFF.
>
> At the semantics level you can think of these numbers (codepoints) of
> representing characters from some form of text with specific rules for
> certain operations like case-folding, as well as a well defined mapping to
> graphemes which are displayed to our eyes when those numbers are rendered
> by a display device like a terminal.
>
> The encoding level of abstraction addresses how those numbers (codepoints)
> will be represented as bytes (octets) in memory inside of Perl, and when
> you directly write the data to disk or to some other output stream.
>
> There are two sets of codepoint range, semantics and encoding available,
> which are controlled by a flag associated with the string called the UTF8
> flag. When set this flag indicates that the string can represent codepoints
> 0 to 0x10FFFF, should have Unicode semantics applied to it, and that its in
> memory representation is variable-width utf8. When the flag is not set it
> indicates the string can represent codepoints 0 to 255, has ASCII
> case-folding semantics, and that its in memory representation is fixed
> width octets.
>
> In order to be able to combine these two types of strings we need to
> define some operations:
>
> upgrading/downgrading: converting a string from one set of semantics and
> encoding to the other while preserving exactly the codepoint level
> representation. By tradition we call it upgrading when we go from Latin-1
> to Unicode with the result being UTF8 on, and we call it downgrading when
> we go from Unicode to Latin1 with the result being UTF8-off. These
> operations are NOT symmetrical. It is *not* possible to downgrade every
> Unicode string to Latin-1, however it is possible to upgrade every Latin-1
> string to Unicode. By tradition upgrade and downgrade functions are noops
> when their input is already in the form expected as the result, but this is
> by tradition only.
>
> decoding/encoding: converting a string from one form to the other in a way
> that transforms the codepoints from one form to a potentially different
> form. Traditional we speak of decode_utf8() taking a latin1 string
> containing octets that make up a utf8 encoded string, and returning a
> string which is UTF8 on which represents the Unicode version of those
> octets. For well formed input this results in no change to the underlying
> string, but the flag is flipped on. Vice versa we speak of encode_utf8()
> which converts its input to a utf8 encoded form, regardless of what form it
> was represented internally.
>
This is incorrect. Decode converts a string of bytes at the logical level
(upgraded or downgraded does not matter) and returns a string of characters
at the logical level (upgraded or downgraded does not matter). It may
commonly use upgraded or downgraded strings as the input or output for
efficiency but this is not required.
>
>
>> What you call “a UTF-8 string” is what I propose we call, per existing
>> nomenclature (i.e., utf8::upgrade()) “a PV-upgraded string”, with
>> corresponding code changes. Then the term “UTF-8 string” makes sense from a
>> pure-Perl context without requiring Perl programmers to worry about
>> interpreter internals.
>>
>>
> No. The flag does not mean "upgraded" it means "unicode semantics, utf8
> encoding". Upgrading is one way to get such a string, and it might even be
> the most common, but the most important and likely to be correct way is
> explicit decoding.
>
> If we are to rename the flag then we should just rename it as the UNICODE
> flag. Would have saved a world of confusion.
>
This is exactly what we have defined as "upgraded". Decoding does not
define the internal format of the resulting string at all. The only
internal format which is upgraded is when the UTF8 flag is on.
>
>> Whether Perl stores that code point as one byte or as two is Perl’s
>> business alone … right?
>>
>
> Well it would be weird if we stored Unicode data in a form not supported
> by Unicode. Dont you think? There is no single octet representation of the
> codepoint E9 defined by Unicode as far as I know.
>
>
>>
>> > I do not understand your point that only the initiated can understand
>> this flag. It means one and only one thing: that the perl internals should
>> assume that the buffer contains utf8 encoded data and that perl should
>> apply unicode semantics when doing character and case-sensitive operations,
>> and that perl can make certain assumptions when it processing the data (eg
>> that is not malformed).
>>
>> The behaviour you’re talking about is what the unicode_strings and
>> unicode_eval features specifically do away with (i.e., fix), right?
>
>
> Im not familiar with those enough to comment. I assume they relate to what
> assumptions Perl should make about strings which are constructed as
> literals in the source code, where there is a great deal of ambiguity about
> what is going on compared to actual code that constructs such strings,
> where things are exact.
>
They do not. They relate to consistently applying unicode rules to the
logical contents of the strings (in practice, making sure to work with
upgraded strings internally). The only mechanism that affects the
interpretation of literal strings is "use utf8".
>
>
>>
>> You’re omitting what IMO is the most obvious purpose of the flag: to
>> indicate whether the code points that the PV stores are the plain bytes, or
>> are the UTF-8-decoded code points. This is why you can print() the string
>> in either upgraded or downgraded forms, and it comes out the same.
>>
>
> Its hard to say what you are referring to here. If you mean codepoints
> 0-127, then it is unsurprising as the representation of them is equivalent
> in ASCII and UTF8 and Latin1. But if you mean a codepoint above the ASCII
> plane, then no they should not come out the same. If you are piping that
> data to a file I would expect the octets written to that file to be
> different. (assuming a binary filehandle with no layers magically
> transforming things). If your terminal renders them the same then I assume
> it is doing some magic behind the scenes to deal with malformed utf8.
>
Not correct. An upgraded or downgraded string prints identically because
you are printing the logical ordinals which do not change by this
operation. Whether those ordinals are interpreted as bytes or Unicode
characters depends what you are printing to, but in either case the
internally-stored bytes are irrelevant to the user except to determine what
those logical ordinals are.
>
>>
>> > I also know what happens here:
>> >
>> > my $foo="\x{c3}\x{a9}";
>> > utf8::decode($foo);
>> > Dump($foo);
>> >
>> > SV = PV(0x2303fc0) at 0x2324c98
>> > REFCNT = 1
>> > FLAGS = (POK,IsCOW,pPOK,UTF8)
>> > PV = 0x2328300 "\303\251"\0 [UTF8 "\x{e9}"]
>> > CUR = 2
>> > LEN = 10
>> > COW_REFCNT = 1
>> >
>> > That is, i start off with two octets, C3 - A9, which happens to be the
>> encoding for the codepoint E9, which happens to be é.
>> > I then tell perl to "decode" those octets, which really means I tell
>> perl to check that the octets actually do make up valid utf8. And if perl
>> agrees that indeed these are valid utf8 octets, then it turns the flag on.
>> Now it doesn't matter if you *meant* to construct utf8 in the variable fed
>> to decode, all that matters is that at an octet level those octet happen to
>> make up valid utf8.
>>
>> I think you’re actually breaking the abstraction here by assuming that
>> Perl implements the decode by setting a flag.
>>
>>
> No I am not. The flag is there is there to tell the perl internals how to
> manipulate the string. decode's task is to take arbitrary strings of octets
> and ensure that they can be decoded as valid utf8 and possibly to do some
> conversion (eg for forbidden utf8 sequences or other normalization) as it
> does so and then SETS THE FLAG. Only once decode is done is the string
> "Unicode" and is the string "utf8". Prior to that it was just random
> octets. It doesnt need to do anything BUT set the flag because its internal
> encoding matches the external encoding in this case. If it was decoding
> UTF16LE then it would have do conversion as well.
>
Not correct. The flag is there only to tell Perl internals whether the
internal bytes represent the ordinals directly or via UTF-8-like encoding.
The result of decoding can be downgraded, and an upgraded string can be
decoded, these are perfectly cromulent operations if the logical contents
are as expected. A unicode string can exist without ever having been
decoded, all that is required is to call a function that interprets the
ordinals as a unicode string.
>
>
>> It would be just as legitimate to mutate the PV to store a single octet,
>> 0xe9, and leave the UTF8 flag off.
>
>
> Nope. That would mean that Perl would use ASCII/Latin-1 case folding rules
> on the result, which would be wrong. It should use Unicode case folding
> rules for codepoint E9 if it was decoded as that codepoint. (Change the
> example to \x{DF} and you can see these issues in the flesh, \x{DF} should
> match "ss" in Unicode, but in ASCII/Latin1 it only matches \x{DF}. The lc()
> version of \x{DF} is "ss" but in Latin-1/Ascii there are no multi-byte case
> folds). Even more suggestive that Perl doing this would be wrong is that
> in fact there is NO valid Unicode encoding of codepoint E9 which is only 1
> octet long. So that would be extremely wrong of Perl to use a non Unicode
> encoding of unicode data dont you think? Also, what would perl do when the
> codepoint doesn't fit into a single octet? Your argument might have some
> merit if you were arguing that Perl could have decoded it into
> "\x{E9}\0\0\0" and set the UTF-32 flag, but as stated it doesn't make sense.
>
Not correct. Under old rules, yes, the UTF8 flag determined whether Unicode
rules are used in various operations; this was an abstraction break, and so
the unicode_strings feature was added to fix the problem, and enabled in
feature bundles since 5.12
-Dan
>
On Thu, Sep 2, 2021 at 9:21 AM demerphq <demerphq@gmail.com> wrote:
> On Fri, 20 Aug 2021 at 19:48, Felipe Gasper <felipe@felipegasper.com>
> wrote:
>
>>
>> > On Aug 20, 2021, at 1:05 PM, demerphq <demerphq@gmail.com> wrote:
>> >
>> > On Wed, 18 Aug 2021 at 19:17, Felipe Gasper <felipe@felipegasper.com>
>> wrote:
>> > Per recent IRC discussion …
>> >
>> > PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of
>> confusion regarding the flag’s significance. Some think it indicates
>> whether a given PV stores text versus binary. Some think it means that the
>> PV is valid UTF-8. Still others likely hold other inaccurate views.
>> >
>> > The problem here is the naming. For example, consider `perl -e'my $foo
>> = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that
>> its code points (assuming use of a UTF-8 terminal) correspond to the bytes
>> that encode “é” in UTF-8.
>> >
>> > Nope. It might contain utf8, but it is not UTF8-ON. Think of it like a
>> square/rectangle relationship. All strings are "rectangles", all "squares"
>> are rectangles, some strings are squares, but unless SQUARE flag is ON perl
>> should assume it is a rectangle, not a square. The SQUARE flag should only
>> be set when the rectangle has been proved conclusively to be a square. That
>> the SQUARE flag is off does not mean the rectangle is not a square, merely
>> that the square has not been proved to be such.
>>
>> You’re defining “a UTF-8 string” as “a string whose PV is marked as
>> UTF-8”. I’m defining it as “a string whose Perl-visible code points happen
>> to be valid UTF-8”.
>>
>
> I dont find your definition to be very useful, nor descriptive of how perl
> manages these matters, so I am not using it. You are confusing different
> levels of abstraction. Your definition also would include cases where the
> data is already encoded and flagged as utf8. So it doesn't make sense to me.
>
> Here is the set of definitions that I am operating from:
>
> A "string" is a programming concept inside of Perl which is used to
> represent "text" buffers of memory. There are three level of abstraction
> for strings, two of which are tightly coupled. The three are the codepoint
> level, semantic level and encoding level.
>
> At the codepoint levels you can think of strings as variable length arrays
> of numbers (codepoints), where the numbers are restricted to 0 to 0x10FFFF.
>
> At the semantics level you can think of these numbers (codepoints) of
> representing characters from some form of text with specific rules for
> certain operations like case-folding, as well as a well defined mapping to
> graphemes which are displayed to our eyes when those numbers are rendered
> by a display device like a terminal.
>
> The encoding level of abstraction addresses how those numbers (codepoints)
> will be represented as bytes (octets) in memory inside of Perl, and when
> you directly write the data to disk or to some other output stream.
>
> There are two sets of codepoint range, semantics and encoding available,
> which are controlled by a flag associated with the string called the UTF8
> flag. When set this flag indicates that the string can represent codepoints
> 0 to 0x10FFFF, should have Unicode semantics applied to it, and that its in
> memory representation is variable-width utf8. When the flag is not set it
> indicates the string can represent codepoints 0 to 255, has ASCII
> case-folding semantics, and that its in memory representation is fixed
> width octets.
>
> In order to be able to combine these two types of strings we need to
> define some operations:
>
> upgrading/downgrading: converting a string from one set of semantics and
> encoding to the other while preserving exactly the codepoint level
> representation. By tradition we call it upgrading when we go from Latin-1
> to Unicode with the result being UTF8 on, and we call it downgrading when
> we go from Unicode to Latin1 with the result being UTF8-off. These
> operations are NOT symmetrical. It is *not* possible to downgrade every
> Unicode string to Latin-1, however it is possible to upgrade every Latin-1
> string to Unicode. By tradition upgrade and downgrade functions are noops
> when their input is already in the form expected as the result, but this is
> by tradition only.
>
> decoding/encoding: converting a string from one form to the other in a way
> that transforms the codepoints from one form to a potentially different
> form. Traditional we speak of decode_utf8() taking a latin1 string
> containing octets that make up a utf8 encoded string, and returning a
> string which is UTF8 on which represents the Unicode version of those
> octets. For well formed input this results in no change to the underlying
> string, but the flag is flipped on. Vice versa we speak of encode_utf8()
> which converts its input to a utf8 encoded form, regardless of what form it
> was represented internally.
>
This is incorrect. Decode converts a string of bytes at the logical level
(upgraded or downgraded does not matter) and returns a string of characters
at the logical level (upgraded or downgraded does not matter). It may
commonly use upgraded or downgraded strings as the input or output for
efficiency but this is not required.
>
>
>> What you call “a UTF-8 string” is what I propose we call, per existing
>> nomenclature (i.e., utf8::upgrade()) “a PV-upgraded string”, with
>> corresponding code changes. Then the term “UTF-8 string” makes sense from a
>> pure-Perl context without requiring Perl programmers to worry about
>> interpreter internals.
>>
>>
> No. The flag does not mean "upgraded" it means "unicode semantics, utf8
> encoding". Upgrading is one way to get such a string, and it might even be
> the most common, but the most important and likely to be correct way is
> explicit decoding.
>
> If we are to rename the flag then we should just rename it as the UNICODE
> flag. Would have saved a world of confusion.
>
This is exactly what we have defined as "upgraded". Decoding does not
define the internal format of the resulting string at all. The only
internal format which is upgraded is when the UTF8 flag is on.
>
>> Whether Perl stores that code point as one byte or as two is Perl’s
>> business alone … right?
>>
>
> Well it would be weird if we stored Unicode data in a form not supported
> by Unicode. Dont you think? There is no single octet representation of the
> codepoint E9 defined by Unicode as far as I know.
>
>
>>
>> > I do not understand your point that only the initiated can understand
>> this flag. It means one and only one thing: that the perl internals should
>> assume that the buffer contains utf8 encoded data and that perl should
>> apply unicode semantics when doing character and case-sensitive operations,
>> and that perl can make certain assumptions when it processing the data (eg
>> that is not malformed).
>>
>> The behaviour you’re talking about is what the unicode_strings and
>> unicode_eval features specifically do away with (i.e., fix), right?
>
>
> Im not familiar with those enough to comment. I assume they relate to what
> assumptions Perl should make about strings which are constructed as
> literals in the source code, where there is a great deal of ambiguity about
> what is going on compared to actual code that constructs such strings,
> where things are exact.
>
They do not. They relate to consistently applying unicode rules to the
logical contents of the strings (in practice, making sure to work with
upgraded strings internally). The only mechanism that affects the
interpretation of literal strings is "use utf8".
>
>
>>
>> You’re omitting what IMO is the most obvious purpose of the flag: to
>> indicate whether the code points that the PV stores are the plain bytes, or
>> are the UTF-8-decoded code points. This is why you can print() the string
>> in either upgraded or downgraded forms, and it comes out the same.
>>
>
> Its hard to say what you are referring to here. If you mean codepoints
> 0-127, then it is unsurprising as the representation of them is equivalent
> in ASCII and UTF8 and Latin1. But if you mean a codepoint above the ASCII
> plane, then no they should not come out the same. If you are piping that
> data to a file I would expect the octets written to that file to be
> different. (assuming a binary filehandle with no layers magically
> transforming things). If your terminal renders them the same then I assume
> it is doing some magic behind the scenes to deal with malformed utf8.
>
Not correct. An upgraded or downgraded string prints identically because
you are printing the logical ordinals which do not change by this
operation. Whether those ordinals are interpreted as bytes or Unicode
characters depends what you are printing to, but in either case the
internally-stored bytes are irrelevant to the user except to determine what
those logical ordinals are.
>
>>
>> > I also know what happens here:
>> >
>> > my $foo="\x{c3}\x{a9}";
>> > utf8::decode($foo);
>> > Dump($foo);
>> >
>> > SV = PV(0x2303fc0) at 0x2324c98
>> > REFCNT = 1
>> > FLAGS = (POK,IsCOW,pPOK,UTF8)
>> > PV = 0x2328300 "\303\251"\0 [UTF8 "\x{e9}"]
>> > CUR = 2
>> > LEN = 10
>> > COW_REFCNT = 1
>> >
>> > That is, i start off with two octets, C3 - A9, which happens to be the
>> encoding for the codepoint E9, which happens to be é.
>> > I then tell perl to "decode" those octets, which really means I tell
>> perl to check that the octets actually do make up valid utf8. And if perl
>> agrees that indeed these are valid utf8 octets, then it turns the flag on.
>> Now it doesn't matter if you *meant* to construct utf8 in the variable fed
>> to decode, all that matters is that at an octet level those octet happen to
>> make up valid utf8.
>>
>> I think you’re actually breaking the abstraction here by assuming that
>> Perl implements the decode by setting a flag.
>>
>>
> No I am not. The flag is there is there to tell the perl internals how to
> manipulate the string. decode's task is to take arbitrary strings of octets
> and ensure that they can be decoded as valid utf8 and possibly to do some
> conversion (eg for forbidden utf8 sequences or other normalization) as it
> does so and then SETS THE FLAG. Only once decode is done is the string
> "Unicode" and is the string "utf8". Prior to that it was just random
> octets. It doesnt need to do anything BUT set the flag because its internal
> encoding matches the external encoding in this case. If it was decoding
> UTF16LE then it would have do conversion as well.
>
Not correct. The flag is there only to tell Perl internals whether the
internal bytes represent the ordinals directly or via UTF-8-like encoding.
The result of decoding can be downgraded, and an upgraded string can be
decoded, these are perfectly cromulent operations if the logical contents
are as expected. A unicode string can exist without ever having been
decoded, all that is required is to call a function that interprets the
ordinals as a unicode string.
>
>
>> It would be just as legitimate to mutate the PV to store a single octet,
>> 0xe9, and leave the UTF8 flag off.
>
>
> Nope. That would mean that Perl would use ASCII/Latin-1 case folding rules
> on the result, which would be wrong. It should use Unicode case folding
> rules for codepoint E9 if it was decoded as that codepoint. (Change the
> example to \x{DF} and you can see these issues in the flesh, \x{DF} should
> match "ss" in Unicode, but in ASCII/Latin1 it only matches \x{DF}. The lc()
> version of \x{DF} is "ss" but in Latin-1/Ascii there are no multi-byte case
> folds). Even more suggestive that Perl doing this would be wrong is that
> in fact there is NO valid Unicode encoding of codepoint E9 which is only 1
> octet long. So that would be extremely wrong of Perl to use a non Unicode
> encoding of unicode data dont you think? Also, what would perl do when the
> codepoint doesn't fit into a single octet? Your argument might have some
> merit if you were arguing that Perl could have decoded it into
> "\x{E9}\0\0\0" and set the UTF-32 flag, but as stated it doesn't make sense.
>
Not correct. Under old rules, yes, the UTF8 flag determined whether Unicode
rules are used in various operations; this was an abstraction break, and so
the unicode_strings feature was added to fix the problem, and enabled in
feature bundles since 5.12
-Dan
>