Mailing List Archive

1 2  View All
Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]
There is way too much written here so I will be responding as I can.

On Thu, Sep 2, 2021 at 9:21 AM demerphq <demerphq@gmail.com> wrote:

> On Fri, 20 Aug 2021 at 19:48, Felipe Gasper <felipe@felipegasper.com>
> wrote:
>
>>
>> > On Aug 20, 2021, at 1:05 PM, demerphq <demerphq@gmail.com> wrote:
>> >
>> > On Wed, 18 Aug 2021 at 19:17, Felipe Gasper <felipe@felipegasper.com>
>> wrote:
>> > Per recent IRC discussion …
>> >
>> > PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of
>> confusion regarding the flag’s significance. Some think it indicates
>> whether a given PV stores text versus binary. Some think it means that the
>> PV is valid UTF-8. Still others likely hold other inaccurate views.
>> >
>> > The problem here is the naming. For example, consider `perl -e'my $foo
>> = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that
>> its code points (assuming use of a UTF-8 terminal) correspond to the bytes
>> that encode “é” in UTF-8.
>> >
>> > Nope. It might contain utf8, but it is not UTF8-ON. Think of it like a
>> square/rectangle relationship. All strings are "rectangles", all "squares"
>> are rectangles, some strings are squares, but unless SQUARE flag is ON perl
>> should assume it is a rectangle, not a square. The SQUARE flag should only
>> be set when the rectangle has been proved conclusively to be a square. That
>> the SQUARE flag is off does not mean the rectangle is not a square, merely
>> that the square has not been proved to be such.
>>
>> You’re defining “a UTF-8 string” as “a string whose PV is marked as
>> UTF-8”. I’m defining it as “a string whose Perl-visible code points happen
>> to be valid UTF-8”.
>>
>
> I dont find your definition to be very useful, nor descriptive of how perl
> manages these matters, so I am not using it. You are confusing different
> levels of abstraction. Your definition also would include cases where the
> data is already encoded and flagged as utf8. So it doesn't make sense to me.
>
> Here is the set of definitions that I am operating from:
>
> A "string" is a programming concept inside of Perl which is used to
> represent "text" buffers of memory. There are three level of abstraction
> for strings, two of which are tightly coupled. The three are the codepoint
> level, semantic level and encoding level.
>
> At the codepoint levels you can think of strings as variable length arrays
> of numbers (codepoints), where the numbers are restricted to 0 to 0x10FFFF.
>
> At the semantics level you can think of these numbers (codepoints) of
> representing characters from some form of text with specific rules for
> certain operations like case-folding, as well as a well defined mapping to
> graphemes which are displayed to our eyes when those numbers are rendered
> by a display device like a terminal.
>
> The encoding level of abstraction addresses how those numbers (codepoints)
> will be represented as bytes (octets) in memory inside of Perl, and when
> you directly write the data to disk or to some other output stream.
>
> There are two sets of codepoint range, semantics and encoding available,
> which are controlled by a flag associated with the string called the UTF8
> flag. When set this flag indicates that the string can represent codepoints
> 0 to 0x10FFFF, should have Unicode semantics applied to it, and that its in
> memory representation is variable-width utf8. When the flag is not set it
> indicates the string can represent codepoints 0 to 255, has ASCII
> case-folding semantics, and that its in memory representation is fixed
> width octets.
>
> In order to be able to combine these two types of strings we need to
> define some operations:
>
> upgrading/downgrading: converting a string from one set of semantics and
> encoding to the other while preserving exactly the codepoint level
> representation. By tradition we call it upgrading when we go from Latin-1
> to Unicode with the result being UTF8 on, and we call it downgrading when
> we go from Unicode to Latin1 with the result being UTF8-off. These
> operations are NOT symmetrical. It is *not* possible to downgrade every
> Unicode string to Latin-1, however it is possible to upgrade every Latin-1
> string to Unicode. By tradition upgrade and downgrade functions are noops
> when their input is already in the form expected as the result, but this is
> by tradition only.
>
> decoding/encoding: converting a string from one form to the other in a way
> that transforms the codepoints from one form to a potentially different
> form. Traditional we speak of decode_utf8() taking a latin1 string
> containing octets that make up a utf8 encoded string, and returning a
> string which is UTF8 on which represents the Unicode version of those
> octets. For well formed input this results in no change to the underlying
> string, but the flag is flipped on. Vice versa we speak of encode_utf8()
> which converts its input to a utf8 encoded form, regardless of what form it
> was represented internally.
>

This is incorrect. Decode converts a string of bytes at the logical level
(upgraded or downgraded does not matter) and returns a string of characters
at the logical level (upgraded or downgraded does not matter). It may
commonly use upgraded or downgraded strings as the input or output for
efficiency but this is not required.


>
>
>> What you call “a UTF-8 string” is what I propose we call, per existing
>> nomenclature (i.e., utf8::upgrade()) “a PV-upgraded string”, with
>> corresponding code changes. Then the term “UTF-8 string” makes sense from a
>> pure-Perl context without requiring Perl programmers to worry about
>> interpreter internals.
>>
>>
> No. The flag does not mean "upgraded" it means "unicode semantics, utf8
> encoding". Upgrading is one way to get such a string, and it might even be
> the most common, but the most important and likely to be correct way is
> explicit decoding.
>
> If we are to rename the flag then we should just rename it as the UNICODE
> flag. Would have saved a world of confusion.
>

This is exactly what we have defined as "upgraded". Decoding does not
define the internal format of the resulting string at all. The only
internal format which is upgraded is when the UTF8 flag is on.


>
>> Whether Perl stores that code point as one byte or as two is Perl’s
>> business alone … right?
>>
>
> Well it would be weird if we stored Unicode data in a form not supported
> by Unicode. Dont you think? There is no single octet representation of the
> codepoint E9 defined by Unicode as far as I know.
>
>
>>
>> > I do not understand your point that only the initiated can understand
>> this flag. It means one and only one thing: that the perl internals should
>> assume that the buffer contains utf8 encoded data and that perl should
>> apply unicode semantics when doing character and case-sensitive operations,
>> and that perl can make certain assumptions when it processing the data (eg
>> that is not malformed).
>>
>> The behaviour you’re talking about is what the unicode_strings and
>> unicode_eval features specifically do away with (i.e., fix), right?
>
>
> Im not familiar with those enough to comment. I assume they relate to what
> assumptions Perl should make about strings which are constructed as
> literals in the source code, where there is a great deal of ambiguity about
> what is going on compared to actual code that constructs such strings,
> where things are exact.
>

They do not. They relate to consistently applying unicode rules to the
logical contents of the strings (in practice, making sure to work with
upgraded strings internally). The only mechanism that affects the
interpretation of literal strings is "use utf8".


>
>
>>
>> You’re omitting what IMO is the most obvious purpose of the flag: to
>> indicate whether the code points that the PV stores are the plain bytes, or
>> are the UTF-8-decoded code points. This is why you can print() the string
>> in either upgraded or downgraded forms, and it comes out the same.
>>
>
> Its hard to say what you are referring to here. If you mean codepoints
> 0-127, then it is unsurprising as the representation of them is equivalent
> in ASCII and UTF8 and Latin1. But if you mean a codepoint above the ASCII
> plane, then no they should not come out the same. If you are piping that
> data to a file I would expect the octets written to that file to be
> different. (assuming a binary filehandle with no layers magically
> transforming things). If your terminal renders them the same then I assume
> it is doing some magic behind the scenes to deal with malformed utf8.
>

Not correct. An upgraded or downgraded string prints identically because
you are printing the logical ordinals which do not change by this
operation. Whether those ordinals are interpreted as bytes or Unicode
characters depends what you are printing to, but in either case the
internally-stored bytes are irrelevant to the user except to determine what
those logical ordinals are.


>
>>
>> > I also know what happens here:
>> >
>> > my $foo="\x{c3}\x{a9}";
>> > utf8::decode($foo);
>> > Dump($foo);
>> >
>> > SV = PV(0x2303fc0) at 0x2324c98
>> > REFCNT = 1
>> > FLAGS = (POK,IsCOW,pPOK,UTF8)
>> > PV = 0x2328300 "\303\251"\0 [UTF8 "\x{e9}"]
>> > CUR = 2
>> > LEN = 10
>> > COW_REFCNT = 1
>> >
>> > That is, i start off with two octets, C3 - A9, which happens to be the
>> encoding for the codepoint E9, which happens to be é.
>> > I then tell perl to "decode" those octets, which really means I tell
>> perl to check that the octets actually do make up valid utf8. And if perl
>> agrees that indeed these are valid utf8 octets, then it turns the flag on.
>> Now it doesn't matter if you *meant* to construct utf8 in the variable fed
>> to decode, all that matters is that at an octet level those octet happen to
>> make up valid utf8.
>>
>> I think you’re actually breaking the abstraction here by assuming that
>> Perl implements the decode by setting a flag.
>>
>>
> No I am not. The flag is there is there to tell the perl internals how to
> manipulate the string. decode's task is to take arbitrary strings of octets
> and ensure that they can be decoded as valid utf8 and possibly to do some
> conversion (eg for forbidden utf8 sequences or other normalization) as it
> does so and then SETS THE FLAG. Only once decode is done is the string
> "Unicode" and is the string "utf8". Prior to that it was just random
> octets. It doesnt need to do anything BUT set the flag because its internal
> encoding matches the external encoding in this case. If it was decoding
> UTF16LE then it would have do conversion as well.
>

Not correct. The flag is there only to tell Perl internals whether the
internal bytes represent the ordinals directly or via UTF-8-like encoding.
The result of decoding can be downgraded, and an upgraded string can be
decoded, these are perfectly cromulent operations if the logical contents
are as expected. A unicode string can exist without ever having been
decoded, all that is required is to call a function that interprets the
ordinals as a unicode string.


>
>
>> It would be just as legitimate to mutate the PV to store a single octet,
>> 0xe9, and leave the UTF8 flag off.
>
>
> Nope. That would mean that Perl would use ASCII/Latin-1 case folding rules
> on the result, which would be wrong. It should use Unicode case folding
> rules for codepoint E9 if it was decoded as that codepoint. (Change the
> example to \x{DF} and you can see these issues in the flesh, \x{DF} should
> match "ss" in Unicode, but in ASCII/Latin1 it only matches \x{DF}. The lc()
> version of \x{DF} is "ss" but in Latin-1/Ascii there are no multi-byte case
> folds). Even more suggestive that Perl doing this would be wrong is that
> in fact there is NO valid Unicode encoding of codepoint E9 which is only 1
> octet long. So that would be extremely wrong of Perl to use a non Unicode
> encoding of unicode data dont you think? Also, what would perl do when the
> codepoint doesn't fit into a single octet? Your argument might have some
> merit if you were arguing that Perl could have decoded it into
> "\x{E9}\0\0\0" and set the UTF-32 flag, but as stated it doesn't make sense.
>

Not correct. Under old rules, yes, the UTF8 flag determined whether Unicode
rules are used in various operations; this was an abstraction break, and so
the unicode_strings feature was added to fix the problem, and enabled in
feature bundles since 5.12

-Dan

>
Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]
On Thu, Sep 2, 2021 at 9:53 AM demerphq <demerphq@gmail.com> wrote:

> Having said that I have seen a lot of people for one reason or another get
> encoding wrong in various ways, especially with MySQL or other over-wire
> situations. Double encoding errors are common (eg where people accidentally
> upgrade already encoded but flag-off utf8 data). At work we have a function
> called recurse_decode_utf8() which takes a string and does its best to
> "reduce" it to its minimal form by repeatedly turning off the utf8 flag,
> and then executing decode_utf8() on the string and then downgrade until the
> decode operation throws an error. Widespread use of this function o string
> data almost completely eliminated all of our utf8 problems. (Ill post the
> code in another mail.)
>

If it works for this case fine, but please do not suggest this for general
use. This is guessing, and results in decoding strings which were already
characters (false positives), because there is no way to differentiate a
valid string of UTF-8 bytes from a string of characters whose ordinals
happen to form a valid UTF-8 byte sequence. The correct solution is to fix
your double encoding, and always decode a string the exact number of times
it was encoded. The use of the utf8 flag to "decode" is an unrelated
problem.

-Dan
Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]
On Thu, Sep 2, 2021 at 10:02 AM demerphq <demerphq@gmail.com> wrote:

>
> I was rereading this and I thought of something to add here. Part of the
> confusion with Perl strings is that we try to hide the flag. We dont really
> want people to look at it and think about it. Instead we provide a handful
> of verbs which can be used to force the string to the shape we want, or
> throw an error if we cant (or sometimes be a no-op).
>
> I mean, if I want to be sure i have a latin-1 string then i would do
> something like:
>
> eval { utf8::downgrade($str); 1 } or warn "Cant downgrade string!";
>
> And if want to be user I have a utf8 string then I would do something like:
>
> utf8::upgrade($str);
>
> I wonder if we made accessing the flag state more socially acceptable
> whether people would find this less confusing.
>

This is fine and I often recommend use of these functions to workaround
broken abstractions (in Perl, XS or user code mistakenly using the utf8
flag). The problem is relying on the flag state for things it does not
represent, and propagating such issues.

As a side note, latin-1 is a convenient way to refer to downgraded strings
but since we are discussing internals it's important to note that they are
not specifically latin-1 strings, any more than upgraded strings are
specifically Unicode strings. A downgraded string may only consist of
ordinals in the byte range due to being stored that way, but what those
byte ordinals represent (if they even represent bytes) is up to what the
string is used for and whether the unicode_strings feature is in effect.
latin-1 mostly works as a description because the latin-1 code space maps
exactly to the first 255 codepoints of Unicode.

-Dan
Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]
I want to get the basic knowledge to join this discussion.

Would you tell me the following things?

1. Do the following things mean the same or different?

my $bytes = Encode::encode('UTF-8', $string);

utf8::encode($string);
my $bytes = $string;

2. Do the following things mean the same or different?

my $string = Encode::decode('UTF-8', $bytes);

utf8::decode($bytes);
my $string = $bytes;

3. Do the following things mean the same or different?

# Perl
utf8::decode

# XS
sv_utf8_decode

4. Do the following things mean the same or different?

# Perl
utf8::encode

# XS
sv_utf8_encode

My first interest is the difference between the Perl world and the XS world.
Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]
On Thu, Sep 2, 2021 at 9:03 PM Yuki Kimoto <kimoto.yuki@gmail.com> wrote:

> I want to get the basic knowledge to join this discussion.
>
> Would you tell me the following things?
>
> 1. Do the following things mean the same or different?
>
> my $bytes = Encode::encode('UTF-8', $string);
>
> utf8::encode($string);
> my $bytes = $string;
>

Similar, with some implementation differences: Encode::encode doesn't
modify $string in place (with those arguments), and utf8::encode does;
Encode::encode with UTF-8 will encode invalid codepoints (such as
surrogates, supercharacters) to replacement characters (with those
arguments) and utf8::encode will naively encode them with Perl's internal
encoding method like other codepoints (which can result in bytestrings
which UTF-8 decoders may consider invalid).


> 2. Do the following things mean the same or different?
>
> my $string = Encode::decode('UTF-8', $bytes);
>
> utf8::decode($bytes);
> my $string = $bytes;
>

Similar as above, but additionally, if the bytes cannot be interpreted as
even Perl's lax internal encoding, utf8::decode will return false and leave
the string unchanged; whereas Encode::decode decodes malformed byte
sequences to replacement characters (with those arguments). Encode::decode
will also decode invalid codepoints to replacement characters, but
utf8::decode will naively accept them.


> 3. Do the following things mean the same or different?
>
> # Perl
> utf8::decode
>
> # XS
> sv_utf8_decode
>

These are the same.

4. Do the following things mean the same or different?
>
> # Perl
> utf8::encode
>
> # XS
> sv_utf8_encode
>

These are the same.

Overall, all of these change the logical contents of the string from bytes
to the Unicode characters they represent, or from Unicode characters to
representative bytes.

-Dan
Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]
2021-9-3 10:30 Dan Book <grinnz@gmail.com> wrote :

> On Thu, Sep 2, 2021 at 9:03 PM Yuki Kimoto <kimoto.yuki@gmail.com> wrote:
>
>> I want to get the basic knowledge to join this discussion.
>>
>> Would you tell me the following things?
>>
>> 1. Do the following things mean the same or different?
>>
>> my $bytes = Encode::encode('UTF-8', $string);
>>
>> utf8::encode($string);
>> my $bytes = $string;
>>
>
> Similar, with some implementation differences: Encode::encode doesn't
> modify $string in place (with those arguments), and utf8::encode does;
> Encode::encode with UTF-8 will encode invalid codepoints (such as
> surrogates, supercharacters) to replacement characters (with those
> arguments) and utf8::encode will naively encode them with Perl's internal
> encoding method like other codepoints (which can result in bytestrings
> which UTF-8 decoders may consider invalid).
>
>
>> 2. Do the following things mean the same or different?
>>
>> my $string = Encode::decode('UTF-8', $bytes);
>>
>> utf8::decode($bytes);
>> my $string = $bytes;
>>
>
> Similar as above, but additionally, if the bytes cannot be interpreted as
> even Perl's lax internal encoding, utf8::decode will return false and leave
> the string unchanged; whereas Encode::decode decodes malformed byte
> sequences to replacement characters (with those arguments). Encode::decode
> will also decode invalid codepoints to replacement characters, but
> utf8::decode will naively accept them.
>
>
>> 3. Do the following things mean the same or different?
>>
>> # Perl
>> utf8::decode
>>
>> # XS
>> sv_utf8_decode
>>
>
> These are the same.
>
> 4. Do the following things mean the same or different?
>>
>> # Perl
>> utf8::encode
>>
>> # XS
>> sv_utf8_encode
>>
>
> These are the same.
>
> Overall, all of these change the logical contents of the string from bytes
> to the Unicode characters they represent, or from Unicode characters to
> representative bytes.
>
> -Dan
>

Dan

Thank you.

I have some time to understand this.
Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]
On Thu, 2 Sept 2021 at 16:39, Dan Book <grinnz@gmail.com> wrote:

> There is way too much written here so I will be responding as I can.
>
> On Thu, Sep 2, 2021 at 9:21 AM demerphq <demerphq@gmail.com> wrote:
>
>> On Fri, 20 Aug 2021 at 19:48, Felipe Gasper <felipe@felipegasper.com>
>> wrote:
>>
>>>
>>> > On Aug 20, 2021, at 1:05 PM, demerphq <demerphq@gmail.com> wrote:
>>> >
>>> > On Wed, 18 Aug 2021 at 19:17, Felipe Gasper <felipe@felipegasper.com>
>>> wrote:
>>> > Per recent IRC discussion …
>>> >
>>> > PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of
>>> confusion regarding the flag’s significance. Some think it indicates
>>> whether a given PV stores text versus binary. Some think it means that the
>>> PV is valid UTF-8. Still others likely hold other inaccurate views.
>>> >
>>> > The problem here is the naming. For example, consider `perl -e'my $foo
>>> = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that
>>> its code points (assuming use of a UTF-8 terminal) correspond to the bytes
>>> that encode “é” in UTF-8.
>>> >
>>> > Nope. It might contain utf8, but it is not UTF8-ON. Think of it like a
>>> square/rectangle relationship. All strings are "rectangles", all "squares"
>>> are rectangles, some strings are squares, but unless SQUARE flag is ON perl
>>> should assume it is a rectangle, not a square. The SQUARE flag should only
>>> be set when the rectangle has been proved conclusively to be a square. That
>>> the SQUARE flag is off does not mean the rectangle is not a square, merely
>>> that the square has not been proved to be such.
>>>
>>> You’re defining “a UTF-8 string” as “a string whose PV is marked as
>>> UTF-8”. I’m defining it as “a string whose Perl-visible code points happen
>>> to be valid UTF-8”.
>>>
>>
>> I dont find your definition to be very useful, nor descriptive of how
>> perl manages these matters, so I am not using it. You are confusing
>> different levels of abstraction. Your definition also would include cases
>> where the data is already encoded and flagged as utf8. So it doesn't make
>> sense to me.
>>
>> Here is the set of definitions that I am operating from:
>>
>> A "string" is a programming concept inside of Perl which is used to
>> represent "text" buffers of memory. There are three level of abstraction
>> for strings, two of which are tightly coupled. The three are the codepoint
>> level, semantic level and encoding level.
>>
>> At the codepoint levels you can think of strings as variable length
>> arrays of numbers (codepoints), where the numbers are restricted to 0 to
>> 0x10FFFF.
>>
>> At the semantics level you can think of these numbers (codepoints) of
>> representing characters from some form of text with specific rules for
>> certain operations like case-folding, as well as a well defined mapping to
>> graphemes which are displayed to our eyes when those numbers are rendered
>> by a display device like a terminal.
>>
>> The encoding level of abstraction addresses how those numbers
>> (codepoints) will be represented as bytes (octets) in memory inside of
>> Perl, and when you directly write the data to disk or to some other output
>> stream.
>>
>> There are two sets of codepoint range, semantics and encoding available,
>> which are controlled by a flag associated with the string called the UTF8
>> flag. When set this flag indicates that the string can represent codepoints
>> 0 to 0x10FFFF, should have Unicode semantics applied to it, and that its in
>> memory representation is variable-width utf8. When the flag is not set it
>> indicates the string can represent codepoints 0 to 255, has ASCII
>> case-folding semantics, and that its in memory representation is fixed
>> width octets.
>>
>> In order to be able to combine these two types of strings we need to
>> define some operations:
>>
>> upgrading/downgrading: converting a string from one set of semantics and
>> encoding to the other while preserving exactly the codepoint level
>> representation. By tradition we call it upgrading when we go from Latin-1
>> to Unicode with the result being UTF8 on, and we call it downgrading when
>> we go from Unicode to Latin1 with the result being UTF8-off. These
>> operations are NOT symmetrical. It is *not* possible to downgrade every
>> Unicode string to Latin-1, however it is possible to upgrade every Latin-1
>> string to Unicode. By tradition upgrade and downgrade functions are noops
>> when their input is already in the form expected as the result, but this is
>> by tradition only.
>>
>> decoding/encoding: converting a string from one form to the other in a
>> way that transforms the codepoints from one form to a potentially different
>> form. Traditional we speak of decode_utf8() taking a latin1 string
>> containing octets that make up a utf8 encoded string, and returning a
>> string which is UTF8 on which represents the Unicode version of those
>> octets. For well formed input this results in no change to the underlying
>> string, but the flag is flipped on. Vice versa we speak of encode_utf8()
>> which converts its input to a utf8 encoded form, regardless of what form it
>> was represented internally.
>>
>
> This is incorrect. Decode converts a string of bytes at the logical level
> (upgraded or downgraded does not matter) and returns a string of characters
> at the logical level (upgraded or downgraded does not matter). It may
> commonly use upgraded or downgraded strings as the input or output for
> efficiency but this is not required.
>

Nope *you* are wrong. Decoding does not use upgrading or downgrading.
Decoding utf8 is logically equivalent to an upgrade operation when the
string contains only codepoints 0-127. For any codepoint ABOVE that it does
something very different.



>
>>
>>
>>> What you call “a UTF-8 string” is what I propose we call, per existing
>>> nomenclature (i.e., utf8::upgrade()) “a PV-upgraded string”, with
>>> corresponding code changes. Then the term “UTF-8 string” makes sense from a
>>> pure-Perl context without requiring Perl programmers to worry about
>>> interpreter internals.
>>>
>>>
>> No. The flag does not mean "upgraded" it means "unicode semantics, utf8
>> encoding". Upgrading is one way to get such a string, and it might even be
>> the most common, but the most important and likely to be correct way is
>> explicit decoding.
>>
>> If we are to rename the flag then we should just rename it as the UNICODE
>> flag. Would have saved a world of confusion.
>>
>
> This is exactly what we have defined as "upgraded". Decoding does not
> define the internal format of the resulting string at all. The only
> internal format which is upgraded is when the UTF8 flag is on.
>

Your definition is wrong then. You seem to have "upgrading" and "decoding"
muddled.

Decoding most definitely DOES define the internal format of the result
string. If you decode utf8 the result is a UTF8 on string. If that string
contained utf8 representing codepoints above 127 then the result will be
different.

If you upgrade the string: "\303\251" you will end up with a utf8 on string
which contains two codepoints, "\303" and "\251". You will NOT end up with
the correct codepoint E9


>
>>
>>> Whether Perl stores that code point as one byte or as two is Perl’s
>>> business alone … right?
>>>
>>
>> Well it would be weird if we stored Unicode data in a form not supported
>> by Unicode. Dont you think? There is no single octet representation of the
>> codepoint E9 defined by Unicode as far as I know.
>>
>>
>>>
>>> > I do not understand your point that only the initiated can understand
>>> this flag. It means one and only one thing: that the perl internals should
>>> assume that the buffer contains utf8 encoded data and that perl should
>>> apply unicode semantics when doing character and case-sensitive operations,
>>> and that perl can make certain assumptions when it processing the data (eg
>>> that is not malformed).
>>>
>>> The behaviour you’re talking about is what the unicode_strings and
>>> unicode_eval features specifically do away with (i.e., fix), right?
>>
>>
>> Im not familiar with those enough to comment. I assume they relate to
>> what assumptions Perl should make about strings which are constructed as
>> literals in the source code, where there is a great deal of ambiguity about
>> what is going on compared to actual code that constructs such strings,
>> where things are exact.
>>
>
> They do not. They relate to consistently applying unicode rules to the
> logical contents of the strings (in practice, making sure to work with
> upgraded strings internally). The only mechanism that affects the
> interpretation of literal strings is "use utf8.
>

Ill read up on this.


>
>
>>
>>
>>>
>>> You’re omitting what IMO is the most obvious purpose of the flag: to
>>> indicate whether the code points that the PV stores are the plain bytes, or
>>> are the UTF-8-decoded code points. This is why you can print() the string
>>> in either upgraded or downgraded forms, and it comes out the same.
>>>
>>
>> Its hard to say what you are referring to here. If you mean codepoints
>> 0-127, then it is unsurprising as the representation of them is equivalent
>> in ASCII and UTF8 and Latin1. But if you mean a codepoint above the ASCII
>> plane, then no they should not come out the same. If you are piping that
>> data to a file I would expect the octets written to that file to be
>> different. (assuming a binary filehandle with no layers magically
>> transforming things). If your terminal renders them the same then I assume
>> it is doing some magic behind the scenes to deal with malformed utf8.
>>
>
> Not correct. An upgraded or downgraded string prints identically because
> you are printing the logical ordinals which do not change by this
> operation. Whether those ordinals are interpreted as bytes or Unicode
> characters depends what you are printing to, but in either case the
> internally-stored bytes are irrelevant to the user except to determine what
> those logical ordinals are
>

Dude, you keep saying I am not correct when what I have said is easily
verifiable.

If you print chr(0xe9) to a filehandle and it does not contain the octet E9
then there is a problem

If you print chr(0xe9) to a utf8 terminal it should render a Unicode
replacement character for a broken utf8 sequence.

If you print an encoded chr(0xe9) then it should rendr the glyph for E9.

If you think anything else is happening then prove it with code.


>
>>
>>>
>>> > I also know what happens here:
>>> >
>>> > my $foo="\x{c3}\x{a9}";
>>> > utf8::decode($foo);
>>> > Dump($foo);
>>> >
>>> > SV = PV(0x2303fc0) at 0x2324c98
>>> > REFCNT = 1
>>> > FLAGS = (POK,IsCOW,pPOK,UTF8)
>>> > PV = 0x2328300 "\303\251"\0 [UTF8 "\x{e9}"]
>>> > CUR = 2
>>> > LEN = 10
>>> > COW_REFCNT = 1
>>> >
>>> > That is, i start off with two octets, C3 - A9, which happens to be the
>>> encoding for the codepoint E9, which happens to be é.
>>> > I then tell perl to "decode" those octets, which really means I tell
>>> perl to check that the octets actually do make up valid utf8. And if perl
>>> agrees that indeed these are valid utf8 octets, then it turns the flag on.
>>> Now it doesn't matter if you *meant* to construct utf8 in the variable fed
>>> to decode, all that matters is that at an octet level those octet happen to
>>> make up valid utf8.
>>>
>>> I think you’re actually breaking the abstraction here by assuming that
>>> Perl implements the decode by setting a flag.
>>>
>>>
>> No I am not. The flag is there is there to tell the perl internals how to
>> manipulate the string. decode's task is to take arbitrary strings of octets
>> and ensure that they can be decoded as valid utf8 and possibly to do some
>> conversion (eg for forbidden utf8 sequences or other normalization) as it
>> does so and then SETS THE FLAG. Only once decode is done is the string
>> "Unicode" and is the string "utf8". Prior to that it was just random
>> octets. It doesnt need to do anything BUT set the flag because its internal
>> encoding matches the external encoding in this case. If it was decoding
>> UTF16LE then it would have do conversion as well.
>>
>
> Not correct. The flag is there only to tell Perl internals whether the
> internal bytes represent the ordinals directly or via UTF-8-like encoding.
> The result of decoding can be downgraded, and an upgraded string can be
> decoded,
>

Show me the code. As far as I know decode operations do not operate on
unicode strings. Calling decode_utf8 on a string which is utf8 is a noop.


> these are perfectly cromulent operations if the logical contents are as
> expected. A unicode string can exist without ever having been decoded, all
> that is required is to call a function that interprets the ordinals as a
> unicode string.
>

"A unicode string can exist without ever having been decoded, all that is
required is to call a function that interprets the ordinals as a unicode
string."

And that function that does that interpretation is called decode. You just
contradicted yourself.


>>
>>> It would be just as legitimate to mutate the PV to store a single octet,
>>> 0xe9, and leave the UTF8 flag off.
>>
>>
>> Nope. That would mean that Perl would use ASCII/Latin-1 case folding
>> rules on the result, which would be wrong. It should use Unicode case
>> folding rules for codepoint E9 if it was decoded as that codepoint. (Change
>> the example to \x{DF} and you can see these issues in the flesh, \x{DF}
>> should match "ss" in Unicode, but in ASCII/Latin1 it only matches \x{DF}.
>> The lc() version of \x{DF} is "ss" but in Latin-1/Ascii there are no
>> multi-byte case folds). Even more suggestive that Perl doing this would be
>> wrong is that in fact there is NO valid Unicode encoding of codepoint E9
>> which is only 1 octet long. So that would be extremely wrong of Perl to use
>> a non Unicode encoding of unicode data dont you think? Also, what would
>> perl do when the codepoint doesn't fit into a single octet? Your argument
>> might have some merit if you were arguing that Perl could have decoded it
>> into "\x{E9}\0\0\0" and set the UTF-32 flag, but as stated it doesn't make
>> sense.
>>
>
> Not correct. Under old rules, yes, the UTF8 flag determined whether
> Unicode rules are used in various operations; this was an abstraction
> break, and so the unicode_strings feature was added to fix the problem, and
> enabled in feature bundles since 5.12
>

Ah, ok, so if you *change* the default mode of perl it does something
different than I described, and that makes my comments "incorrect"? What i
described is how "normal" perl without any new features enabled works. If
there are features that change what I have said feel free to use them. But
it doesnt change that what I said is an accurate version of how the perl
internals normally function.

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"
Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]
On Fri, 3 Sept 2021 at 14:30, demerphq <demerphq@gmail.com> wrote:

> On Thu, 2 Sept 2021 at 16:39, Dan Book <grinnz@gmail.com> wrote:
>
>> On Thu, Sep 2, 2021 at 9:21 AM demerphq <demerphq@gmail.com> wrote:
>>
>>> On Fri, 20 Aug 2021 at 19:48, Felipe Gasper <felipe@felipegasper.com>
>>> wrote:
>>>
>>>>
>>>> What you call “a UTF-8 string” is what I propose we call, per existing
>>>> nomenclature (i.e., utf8::upgrade()) “a PV-upgraded string”, with
>>>> corresponding code changes. Then the term “UTF-8 string” makes sense from a
>>>> pure-Perl context without requiring Perl programmers to worry about
>>>> interpreter internals.
>>>>
>>>
>>>>
>>> No. The flag does not mean "upgraded" it means "unicode semantics, utf8
>>> encoding". Upgrading is one way to get such a string, and it might even be
>>> the most common, but the most important and likely to be correct way is
>>> explicit decoding.
>>>
>>> If we are to rename the flag then we should just rename it as the
>>> UNICODE flag. Would have saved a world of confusion.
>>>
>>
>> This is exactly what we have defined as "upgraded". Decoding does not
>> define the internal format of the resulting string at all. The only
>> internal format which is upgraded is when the UTF8 flag is on.
>>
>
> Your definition is wrong then. You seem to have "upgrading" and "decoding"
> muddled.
>
> Decoding most definitely DOES define the internal format of the result
> string. If you decode utf8 the result is a UTF8 on string. If that string
> contained utf8 representing codepoints above 127 then the result will be
> different.
>

Given this:

perl -e'use Devel::Peek; use Encode; print Dump(Encode::decode("UTF-8",
"example"))'
SV = PV(0x55b88281b2b0) at 0x55b88272e4e0
REFCNT = 2
FLAGS = (TEMP,POK,pPOK,UTF8)
PV = 0x55b8827f45d0 "example"\0 [UTF8 "example"]
CUR = 7
LEN = 10

I think the current behaviour is at least inefficient, if perhaps not
outright *wrong*... why would decoding enforce the UTF8 flag?

Put another way, if the resulting string has only codepoints 0..127, why
not leave the flag off so that string operations can be more efficient?

This extends to common cases such as UTF8-safe filter chains:

echo "example" | perl -CSD -lne'use Devel::Peek; s{e$}{es}; print Dump($_)'
SV = PV(0x556c3aab2000) at 0x556c3aaeb3e8
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x556c3aae41b0 "examples"\0 [UTF8 "examples"]
CUR = 8
LEN = 24

If that's not taking the faster pure-ASCII path for input, this would seem
like an easy optimisation opportunity. If the behaviour only happened with
the non-validating `utf8` decoding, then maybe it could be explained away
by not wanting to walk the entire length of the string... but then I'd at
least expect it to be different with the "UTF-8" encoding layer:

echo "example" | perl -lne'use Devel::Peek; BEGIN { binmode STDIN,
":encoding(UTF-8)" } s{e$}{es}; print Dump($_)'
SV = PV(0x55d8d2b76000) at 0x55d8d2baf4a8
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x55d8d2bad300 "examples"\0 [UTF8 "examples"]
CUR = 8
LEN = 24

So yes, decoding does set the UTF8 flag - but I'd argue that it
*shouldn't*, and the current behaviour is somewhere between a historical
accident and an oversight. To be clear, I'd expect the same non-UTF8 status
in the examples so far, as we see from this:

perl -e'use Devel::Peek; use utf8; my $text = "example"; print Dump($text)'
SV = PV(0x55be3864aff0) at 0x55be3866fe60
REFCNT = 1
FLAGS = (POK,IsCOW,pPOK)
PV = 0x55be3867f020 "example"\0
CUR = 7
LEN = 10
COW_REFCNT = 1

What am I missing here?
Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]
On Fri, Sep 3, 2021 at 8:30 AM demerphq <demerphq@gmail.com> wrote:

> On Thu, 2 Sept 2021 at 16:39, Dan Book <grinnz@gmail.com> wrote:
>
>> There is way too much written here so I will be responding as I can.
>>
>> On Thu, Sep 2, 2021 at 9:21 AM demerphq <demerphq@gmail.com> wrote:
>>
>>> On Fri, 20 Aug 2021 at 19:48, Felipe Gasper <felipe@felipegasper.com>
>>> wrote:
>>>
>>>>
>>>> > On Aug 20, 2021, at 1:05 PM, demerphq <demerphq@gmail.com> wrote:
>>>> >
>>>> > On Wed, 18 Aug 2021 at 19:17, Felipe Gasper <felipe@felipegasper.com>
>>>> wrote:
>>>> > Per recent IRC discussion …
>>>> >
>>>> > PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of
>>>> confusion regarding the flag’s significance. Some think it indicates
>>>> whether a given PV stores text versus binary. Some think it means that the
>>>> PV is valid UTF-8. Still others likely hold other inaccurate views.
>>>> >
>>>> > The problem here is the naming. For example, consider `perl -e'my
>>>> $foo = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact
>>>> that its code points (assuming use of a UTF-8 terminal) correspond to the
>>>> bytes that encode “é” in UTF-8.
>>>> >
>>>> > Nope. It might contain utf8, but it is not UTF8-ON. Think of it like
>>>> a square/rectangle relationship. All strings are "rectangles", all
>>>> "squares" are rectangles, some strings are squares, but unless SQUARE flag
>>>> is ON perl should assume it is a rectangle, not a square. The SQUARE flag
>>>> should only be set when the rectangle has been proved conclusively to be a
>>>> square. That the SQUARE flag is off does not mean the rectangle is not a
>>>> square, merely that the square has not been proved to be such.
>>>>
>>>> You’re defining “a UTF-8 string” as “a string whose PV is marked as
>>>> UTF-8”. I’m defining it as “a string whose Perl-visible code points happen
>>>> to be valid UTF-8”.
>>>>
>>>
>>> I dont find your definition to be very useful, nor descriptive of how
>>> perl manages these matters, so I am not using it. You are confusing
>>> different levels of abstraction. Your definition also would include cases
>>> where the data is already encoded and flagged as utf8. So it doesn't make
>>> sense to me.
>>>
>>> Here is the set of definitions that I am operating from:
>>>
>>> A "string" is a programming concept inside of Perl which is used to
>>> represent "text" buffers of memory. There are three level of abstraction
>>> for strings, two of which are tightly coupled. The three are the codepoint
>>> level, semantic level and encoding level.
>>>
>>> At the codepoint levels you can think of strings as variable length
>>> arrays of numbers (codepoints), where the numbers are restricted to 0 to
>>> 0x10FFFF.
>>>
>>> At the semantics level you can think of these numbers (codepoints) of
>>> representing characters from some form of text with specific rules for
>>> certain operations like case-folding, as well as a well defined mapping to
>>> graphemes which are displayed to our eyes when those numbers are rendered
>>> by a display device like a terminal.
>>>
>>> The encoding level of abstraction addresses how those numbers
>>> (codepoints) will be represented as bytes (octets) in memory inside of
>>> Perl, and when you directly write the data to disk or to some other output
>>> stream.
>>>
>>> There are two sets of codepoint range, semantics and encoding available,
>>> which are controlled by a flag associated with the string called the UTF8
>>> flag. When set this flag indicates that the string can represent codepoints
>>> 0 to 0x10FFFF, should have Unicode semantics applied to it, and that its in
>>> memory representation is variable-width utf8. When the flag is not set it
>>> indicates the string can represent codepoints 0 to 255, has ASCII
>>> case-folding semantics, and that its in memory representation is fixed
>>> width octets.
>>>
>>> In order to be able to combine these two types of strings we need to
>>> define some operations:
>>>
>>> upgrading/downgrading: converting a string from one set of semantics and
>>> encoding to the other while preserving exactly the codepoint level
>>> representation. By tradition we call it upgrading when we go from Latin-1
>>> to Unicode with the result being UTF8 on, and we call it downgrading when
>>> we go from Unicode to Latin1 with the result being UTF8-off. These
>>> operations are NOT symmetrical. It is *not* possible to downgrade every
>>> Unicode string to Latin-1, however it is possible to upgrade every Latin-1
>>> string to Unicode. By tradition upgrade and downgrade functions are noops
>>> when their input is already in the form expected as the result, but this is
>>> by tradition only.
>>>
>>> decoding/encoding: converting a string from one form to the other in a
>>> way that transforms the codepoints from one form to a potentially different
>>> form. Traditional we speak of decode_utf8() taking a latin1 string
>>> containing octets that make up a utf8 encoded string, and returning a
>>> string which is UTF8 on which represents the Unicode version of those
>>> octets. For well formed input this results in no change to the underlying
>>> string, but the flag is flipped on. Vice versa we speak of encode_utf8()
>>> which converts its input to a utf8 encoded form, regardless of what form it
>>> was represented internally.
>>>
>>
>> This is incorrect. Decode converts a string of bytes at the logical level
>> (upgraded or downgraded does not matter) and returns a string of characters
>> at the logical level (upgraded or downgraded does not matter). It may
>> commonly use upgraded or downgraded strings as the input or output for
>> efficiency but this is not required.
>>
>
> Nope *you* are wrong. Decoding does not use upgrading or downgrading.
> Decoding utf8 is logically equivalent to an upgrade operation when the
> string contains only codepoints 0-127. For any codepoint ABOVE that it does
> something very different.
>
>
>
>>
>>>
>>>
>>>> What you call “a UTF-8 string” is what I propose we call, per existing
>>>> nomenclature (i.e., utf8::upgrade()) “a PV-upgraded string”, with
>>>> corresponding code changes. Then the term “UTF-8 string” makes sense from a
>>>> pure-Perl context without requiring Perl programmers to worry about
>>>> interpreter internals.
>>>>
>>>>
>>> No. The flag does not mean "upgraded" it means "unicode semantics, utf8
>>> encoding". Upgrading is one way to get such a string, and it might even be
>>> the most common, but the most important and likely to be correct way is
>>> explicit decoding.
>>>
>>> If we are to rename the flag then we should just rename it as the
>>> UNICODE flag. Would have saved a world of confusion.
>>>
>>
>> This is exactly what we have defined as "upgraded". Decoding does not
>> define the internal format of the resulting string at all. The only
>> internal format which is upgraded is when the UTF8 flag is on.
>>
>
> Your definition is wrong then. You seem to have "upgrading" and "decoding"
> muddled.
>
> Decoding most definitely DOES define the internal format of the result
> string. If you decode utf8 the result is a UTF8 on string. If that string
> contained utf8 representing codepoints above 127 then the result will be
> different.
>
> If you upgrade the string: "\303\251" you will end up with a utf8 on
> string which contains two codepoints, "\303" and "\251". You will NOT end
> up with the correct codepoint E9
>

It rather sounds to me like your disagreement is mostly on definitions.
This happens a lot in discussing perl unicode support


>
>>
>>>
>>>
>>>>
>>>> You’re omitting what IMO is the most obvious purpose of the flag: to
>>>> indicate whether the code points that the PV stores are the plain bytes, or
>>>> are the UTF-8-decoded code points. This is why you can print() the string
>>>> in either upgraded or downgraded forms, and it comes out the same.
>>>>
>>>
>>> Its hard to say what you are referring to here. If you mean codepoints
>>> 0-127, then it is unsurprising as the representation of them is equivalent
>>> in ASCII and UTF8 and Latin1. But if you mean a codepoint above the ASCII
>>> plane, then no they should not come out the same. If you are piping that
>>> data to a file I would expect the octets written to that file to be
>>> different. (assuming a binary filehandle with no layers magically
>>> transforming things). If your terminal renders them the same then I assume
>>> it is doing some magic behind the scenes to deal with malformed utf8.
>>>
>>
>> Not correct. An upgraded or downgraded string prints identically because
>> you are printing the logical ordinals which do not change by this
>> operation. Whether those ordinals are interpreted as bytes or Unicode
>> characters depends what you are printing to, but in either case the
>> internally-stored bytes are irrelevant to the user except to determine what
>> those logical ordinals are
>>
>
> Dude, you keep saying I am not correct when what I have said is easily
> verifiable.
>
> If you print chr(0xe9) to a filehandle and it does not contain the octet
> E9 then there is a problem
>
> If you print chr(0xe9) to a utf8 terminal it should render a Unicode
> replacement character for a broken utf8 sequence.
>
> If you print an encoded chr(0xe9) then it should rendr the glyph for E9.
>
> If you think anything else is happening then prove it with code.
>

That is all true in the absence of an :encoding(...) or :utf8 layer.

An upgraded E9 will also still print E9 (and thus be broken utf-8).


>
>>>
>>>>
>>>> > I also know what happens here:
>>>> >
>>>> > my $foo="\x{c3}\x{a9}";
>>>> > utf8::decode($foo);
>>>> > Dump($foo);
>>>> >
>>>> > SV = PV(0x2303fc0) at 0x2324c98
>>>> > REFCNT = 1
>>>> > FLAGS = (POK,IsCOW,pPOK,UTF8)
>>>> > PV = 0x2328300 "\303\251"\0 [UTF8 "\x{e9}"]
>>>> > CUR = 2
>>>> > LEN = 10
>>>> > COW_REFCNT = 1
>>>> >
>>>> > That is, i start off with two octets, C3 - A9, which happens to be
>>>> the encoding for the codepoint E9, which happens to be é.
>>>> > I then tell perl to "decode" those octets, which really means I tell
>>>> perl to check that the octets actually do make up valid utf8. And if perl
>>>> agrees that indeed these are valid utf8 octets, then it turns the flag on.
>>>> Now it doesn't matter if you *meant* to construct utf8 in the variable fed
>>>> to decode, all that matters is that at an octet level those octet happen to
>>>> make up valid utf8.
>>>>
>>>> I think you’re actually breaking the abstraction here by assuming that
>>>> Perl implements the decode by setting a flag.
>>>>
>>>>
>>> No I am not. The flag is there is there to tell the perl internals how
>>> to manipulate the string. decode's task is to take arbitrary strings of
>>> octets and ensure that they can be decoded as valid utf8 and possibly to do
>>> some conversion (eg for forbidden utf8 sequences or other normalization) as
>>> it does so and then SETS THE FLAG. Only once decode is done is the string
>>> "Unicode" and is the string "utf8". Prior to that it was just random
>>> octets. It doesnt need to do anything BUT set the flag because its internal
>>> encoding matches the external encoding in this case. If it was decoding
>>> UTF16LE then it would have do conversion as well.
>>>
>>
>> Not correct. The flag is there only to tell Perl internals whether the
>> internal bytes represent the ordinals directly or via UTF-8-like encoding.
>> The result of decoding can be downgraded, and an upgraded string can be
>> decoded,
>>
>
> Show me the code. As far as I know decode operations do not operate on
> unicode strings. Calling decode_utf8 on a string which is utf8 is a noop.
>

Almost. It will try to downgrade the string, and if that fails it will
return false (and thus noop). It will decode a latin1-safe unicode string.

So «my $s = "\303\251"; utf8::upgrade($s); utf8::decode($s)» will result in
$s being equal to "\x{e9}" (an will be upgraded)

Leon
Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]
> On Sep 3, 2021, at 3:48 AM, Tom Molesworth via perl5-porters <perl5-porters@perl.org> wrote:
>
> On Fri, 3 Sept 2021 at 14:30, demerphq <demerphq@gmail.com> wrote:
> On Thu, 2 Sept 2021 at 16:39, Dan Book <grinnz@gmail.com> wrote:
> On Thu, Sep 2, 2021 at 9:21 AM demerphq <demerphq@gmail.com> wrote:
> On Fri, 20 Aug 2021 at 19:48, Felipe Gasper <felipe@felipegasper.com> wrote:
>
> What you call “a UTF-8 string” is what I propose we call, per existing nomenclature (i.e., utf8::upgrade()) “a PV-upgraded string”, with corresponding code changes. Then the term “UTF-8 string” makes sense from a pure-Perl context without requiring Perl programmers to worry about interpreter internals.
>
>
> No. The flag does not mean "upgraded" it means "unicode semantics, utf8 encoding". Upgrading is one way to get such a string, and it might even be the most common, but the most important and likely to be correct way is explicit decoding.
>
> If we are to rename the flag then we should just rename it as the UNICODE flag. Would have saved a world of confusion.
>
> This is exactly what we have defined as "upgraded". Decoding does not define the internal format of the resulting string at all. The only internal format which is upgraded is when the UTF8 flag is on.
>
> Your definition is wrong then. You seem to have "upgrading" and "decoding" muddled.
>
> Decoding most definitely DOES define the internal format of the result string. If you decode utf8 the result is a UTF8 on string. If that string contained utf8 representing codepoints above 127 then the result will be different.
>
> Given this:
>
> perl -e'use Devel::Peek; use Encode; print Dump(Encode::decode("UTF-8", "example"))'
> SV = PV(0x55b88281b2b0) at 0x55b88272e4e0
> REFCNT = 2
> FLAGS = (TEMP,POK,pPOK,UTF8)
> PV = 0x55b8827f45d0 "example"\0 [UTF8 "example"]
> CUR = 7
> LEN = 10
>
> I think the current behaviour is at least inefficient, if perhaps not outright *wrong*... why would decoding enforce the UTF8 flag?
>
> Put another way, if the resulting string has only codepoints 0..127, why not leave the flag off so that string operations can be more efficient?
>
> This extends to common cases such as UTF8-safe filter chains:
>
> echo "example" | perl -CSD -lne'use Devel::Peek; s{e$}{es}; print Dump($_)'
> SV = PV(0x556c3aab2000) at 0x556c3aaeb3e8
> REFCNT = 1
> FLAGS = (POK,pPOK,UTF8)
> PV = 0x556c3aae41b0 "examples"\0 [UTF8 "examples"]
> CUR = 8
> LEN = 24
>
> If that's not taking the faster pure-ASCII path for input, this would seem like an easy optimisation opportunity. If the behaviour only happened with the non-validating `utf8` decoding, then maybe it could be explained away by not wanting to walk the entire length of the string... but then I'd at least expect it to be different with the "UTF-8" encoding layer:
>
> echo "example" | perl -lne'use Devel::Peek; BEGIN { binmode STDIN, ":encoding(UTF-8)" } s{e$}{es}; print Dump($_)'
> SV = PV(0x55d8d2b76000) at 0x55d8d2baf4a8
> REFCNT = 1
> FLAGS = (POK,pPOK,UTF8)
> PV = 0x55d8d2bad300 "examples"\0 [UTF8 "examples"]
> CUR = 8
> LEN = 24
>
> So yes, decoding does set the UTF8 flag - but I'd argue that it *shouldn't*, and the current behaviour is somewhere between a historical accident and an oversight. To be clear, I'd expect the same non-UTF8 status in the examples so far, as we see from this:
>
> perl -e'use Devel::Peek; use utf8; my $text = "example"; print Dump($text)'
> SV = PV(0x55be3864aff0) at 0x55be3866fe60
> REFCNT = 1
> FLAGS = (POK,IsCOW,pPOK)
> PV = 0x55be3867f020 "example"\0
> CUR = 7
> LEN = 10
> COW_REFCNT = 1
>
> What am I missing here?

Try utf8::decode(); it uses Perl’s internal decoder and behaves as you expect (i.e., it leaves invariant strings alone).

Unicode::UTF8 mimics Encode.pm, though.

These are implementation details, though; the only thing the decoding algorithm itself requires is the correct translation of code points.

-FG
Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]
> On Sep 3, 2021, at 2:30 AM, demerphq <demerphq@gmail.com> wrote:
>
> On Thu, 2 Sept 2021 at 16:39, Dan Book <grinnz@gmail.com> wrote:
> There is way too much written here so I will be responding as I can.
>
> On Thu, Sep 2, 2021 at 9:21 AM demerphq <demerphq@gmail.com> wrote:
> On Fri, 20 Aug 2021 at 19:48, Felipe Gasper <felipe@felipegasper.com> wrote:
>
> > On Aug 20, 2021, at 1:05 PM, demerphq <demerphq@gmail.com> wrote:
> >
>
> decoding/encoding: converting a string from one form to the other in a way that transforms the codepoints from one form to a potentially different form. Traditional we speak of decode_utf8() taking a latin1 string containing octets that make up a utf8 encoded string, and returning a string which is UTF8 on which represents the Unicode version of those octets. For well formed input this results in no change to the underlying string, but the flag is flipped on. Vice versa we speak of encode_utf8() which converts its input to a utf8 encoded form, regardless of what form it was represented internally.
>
> This is incorrect. Decode converts a string of bytes at the logical level (upgraded or downgraded does not matter) and returns a string of characters at the logical level (upgraded or downgraded does not matter). It may commonly use upgraded or downgraded strings as the input or output for efficiency but this is not required.
>
> Nope *you* are wrong. Decoding does not use upgrading or downgrading. Decoding utf8 is logically equivalent to an upgrade operation when the string contains only codepoints 0-127. For any codepoint ABOVE that it does something very different.

Decoding doesn’t *use* upgrading nor downgrading, but it accepts either and may output either.

> Decoding most definitely DOES define the internal format of the result string. If you decode utf8 the result is a UTF8 on string. If that string contained utf8 representing codepoints above 127 then the result will be different.

This is wrong. Example:

> perl -MDevel::Peek -e'my $foo = "e"; utf8::decode($foo); Dump $foo'
SV = PV(0x7fdb0a804c70) at 0x7fdb0b00ccd0
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x7fdb0a501cc0 "e"\0
CUR = 1
LEN = 10

As an *implementation detail*, utf8::decode *happens* to set the flag when given UTF-8 for code points 128-255:

> perl -MDevel::Peek -e'my $foo = "é"; utf8::decode($foo); Dump $foo'
SV = PV(0x7fd600804c70) at 0x7fd6008162d0
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x7fd6018026e0 "\303\251"\0 [UTF8 "\x{e9}"]
CUR = 2
LEN = 10

… but it would be just as valid -- and would print() the same way -- if utf8::decode() modified the PV to contain just \xe9.

> FG: You’re omitting what IMO is the most obvious purpose of the flag: to indicate whether the code points that the PV stores are the plain bytes, or are the UTF-8-decoded code points. This is why you can print() the string in either upgraded or downgraded forms, and it comes out the same.
>
> Yves: Its hard to say what you are referring to here. If you mean codepoints 0-127, then it is unsurprising as the representation of them is equivalent in ASCII and UTF8 and Latin1. But if you mean a codepoint above the ASCII plane, then no they should not come out the same. If you are piping that data to a file I would expect the octets written to that file to be different. (assuming a binary filehandle with no layers magically transforming things). If your terminal renders them the same then I assume it is doing some magic behind the scenes to deal with malformed utf8.
>
> DB: Not correct. An upgraded or downgraded string prints identically because you are printing the logical ordinals which do not change by this operation. Whether those ordinals are interpreted as bytes or Unicode characters depends what you are printing to, but in either case the internally-stored bytes are irrelevant to the user except to determine what those logical ordinals are
>
> Yves: Dude, you keep saying I am not correct when what I have said is easily verifiable.
>
> If you print chr(0xe9) to a filehandle and it does not contain the octet E9 then there is a problem
>
> If you print chr(0xe9) to a utf8 terminal it should render a Unicode replacement character for a broken utf8 sequence.
>
> If you print an encoded chr(0xe9) then it should rendr the glyph for E9.
>
> If you think anything else is happening then prove it with code.

These illustrate Dan’s point (assuming a UTF-8 terminal):

> perl -e'my $foo = "\xc3\xa9"; print $foo'
é

> perl -e'my $foo = "\xc3\xa9"; utf8::upgrade($foo); print $foo'
é

Upgraded or downgraded doesn’t change the logical content of the string; the important thing is the codepoints.

The cases you’ve mentioned -- pattern matching, system calls, and the like -- where a string’s internal storage *does* matter, e.g.:

> perl -e'my $foo = "é"; exec "echo", $foo'
é

> perl -e'my $foo = "é"; utf8::upgrade($foo); exec "echo", $foo'
é

... are bugs in Perl. This is why the feature bundles enable the features that fix (some of) those bugs. (And why IMO Sys::Binmode should join them.)


> Show me the code. As far as I know decode operations do not operate on unicode strings. Calling decode_utf8 on a string which is utf8 is a noop.

This isn’t true for either definition of “UTF-8 string”. This shows an upgraded string whose codepoints are UTF-8 being decoded:

> perl -MDevel::Peek -e'my $foo = "\xc3\xa9"; utf8::upgrade($foo); Dump $foo; utf8::decode($foo); Dump $foo;'
SV = PV(0x7f93eb804c70) at 0x7f93ec00c8d0
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x7f93eb5019c0 "\303\203\302\251"\0 [UTF8 "\x{c3}\x{a9}"]
CUR = 4
LEN = 10
SV = PV(0x7f93eb804c70) at 0x7f93ec00c8d0
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x7f93eb5019c0 "\303\251"\0 [UTF8 "\x{e9}"]
CUR = 2
LEN = 10

> It would be just as legitimate to mutate the PV to store a single octet, 0xe9, and leave the UTF8 flag off.
>
> Nope. That would mean that Perl would use ASCII/Latin-1 case folding rules on the result, which would be wrong. It should use Unicode case folding rules for codepoint E9 if it was decoded as that codepoint. (Change the example to \x{DF} and you can see these issues in the flesh, \x{DF} should match "ss" in Unicode, but in ASCII/Latin1 it only matches \x{DF}. The lc() version of \x{DF} is "ss" but in Latin-1/Ascii there are no multi-byte case folds). Even more suggestive that Perl doing this would be wrong is that in fact there is NO valid Unicode encoding of codepoint E9 which is only 1 octet long. So that would be extremely wrong of Perl to use a non Unicode encoding of unicode data dont you think? Also, what would perl do when the codepoint doesn't fit into a single octet? Your argument might have some merit if you were arguing that Perl could have decoded it into "\x{E9}\0\0\0" and set the UTF-32 flag, but as stated it doesn't make sense.
>
> Not correct. Under old rules, yes, the UTF8 flag determined whether Unicode rules are used in various operations; this was an abstraction break, and so the unicode_strings feature was added to fix the problem, and enabled in feature bundles since 5.12
>
> Ah, ok, so if you *change* the default mode of perl it does something different than I described, and that makes my comments "incorrect"? What i described is how "normal" perl without any new features enabled works. If there are features that change what I have said feel free to use them. But it doesnt change that what I said is an accurate version of how the perl internals normally function.

The problem is that Perl’s default behaviour is inconsistent: when outputting to filehandles, computing length() or ord(), comparing strings, etc. all code points are the same regardless of the internal storage format. But when doing pattern-matches Perl treats upgraded/wide/UTF8-flagged strings differently from downgraded/narrow/non-flagged ones.

The latter behaviour is considered a bug.

-FG
Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]
On Wed, 18 Aug 2021 13:18:34 -0400
Felipe Gasper <felipe@felipegasper.com> wrote:

> PROPOSAL: Rename the following identifiers in code and documentation,
> leaving macros for the old ones as aliases:
> - SVf_UTF8 -> SVf_PVUPGRADED
> - SvUTF8 -> Sv_PVUPGRADED
> - SvUTF8_on -> Sv_PVUPGRADED_on
> - SvUTF8_off -> Sv_PVUPGRADED_off
> - SvPOK_only_UTF8 -> SvPOK_only_UPGRADED

This got briefly mentioned as a side-comment on PSC today.

Thoughts are "What about WIDE"? As in

SVf_WIDE (though really I'd want to call that SVppv_WIDE)
SvWIDE
SvWIDE_on
etc...

--
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk | https://metacpan.org/author/PEVANS
http://www.leonerd.org.uk/ | https://www.tindie.com/stores/leonerd/
Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]
On Fri, Sep 3, 2021, at 10:24 AM, Paul "LeoNerd" Evans wrote:
> Thoughts are "What about WIDE"? As in
>
> SVf_WIDE (though really I'd want to call that SVppv_WIDE)
> SvWIDE
> SvWIDE_on

If we don't like wide because of "wide character in", there's always COOKED.

--
rjbs
Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]
> On Sep 3, 2021, at 10:24 AM, Paul LeoNerd Evans <leonerd@leonerd.org.uk> wrote:
>
> On Wed, 18 Aug 2021 13:18:34 -0400
> Felipe Gasper <felipe@felipegasper.com> wrote:
>
>> PROPOSAL: Rename the following identifiers in code and documentation,
>> leaving macros for the old ones as aliases:
>> - SVf_UTF8 -> SVf_PVUPGRADED
>> - SvUTF8 -> Sv_PVUPGRADED
>> - SvUTF8_on -> Sv_PVUPGRADED_on
>> - SvUTF8_off -> Sv_PVUPGRADED_off
>> - SvPOK_only_UTF8 -> SvPOK_only_UPGRADED
>
> This got briefly mentioned as a side-comment on PSC today.
>
> Thoughts are "What about WIDE"? As in
>
> SVf_WIDE (though really I'd want to call that SVppv_WIDE)
> SvWIDE
> SvWIDE_on
> etc...

IMO it would at least improve on status quo.

My only reservation would be potential confusion with the notion of “wide character” (anything >255); someone might see that flag and think it means there’s a wide character in the string or some such.

-F
Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]
On Fri, 3 Sep 2021 11:44:18 -0400
Felipe Gasper <felipe@felipegasper.com> wrote:

> My only reservation would be potential confusion with the notion of
> “wide character” (anything >255); someone might see that flag and
> think it means there’s a wide character in the string or some such.

I think that's fine. An SVppv_WIDE string might well contain a "wide
character". If the string isn't SVppv_WIDE (i.e. it's "narrow"?), then
it definitely does not.

--
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk | https://metacpan.org/author/PEVANS
http://www.leonerd.org.uk/ | https://www.tindie.com/stores/leonerd/
Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]
On Thu, Sep 2, 2021, at 9:20 AM, demerphq wrote:
> No. The flag does not mean "upgraded" it means "unicode semantics, utf8 encoding". Upgrading is one way to get such a string, and it might even be the most common, but the most important and likely to be correct way is explicit decoding.

You wrote a whole lot, but this quote is, I think, a the center of what I have found confusing.

The utf8 flag on a scalar doesn't mean Unicode semantics. That way lies The Unicode Bug. Under the unicode_strings feature, recommended and in the version bundle since v5.12 (2010), all strings have unicode semantics and are treated as a sequence of codepoints when performing textish operations.

perl -E 'say "word" if "\xFF" =~ /\w/'

This string hasn't been upgraded, hasn't been decoded, and prior to unicode_string, would not have matched.

My take here is that unicode_strings is a *bugfix* (fixing the "Unicode Bug"), and it sounds like you are implying that it is not, and that the correct behavior to learn is that the utf8 flag on a scalar is the *correct* way to know whether Unicode semantics would be applied. This surprises me.

--
rjbs
Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]
> On Sep 3, 2021, at 10:24 AM, Paul LeoNerd Evans <leonerd@leonerd.org.uk> wrote:
>
> On Wed, 18 Aug 2021 13:18:34 -0400
> Felipe Gasper <felipe@felipegasper.com> wrote:
>
>> PROPOSAL: Rename the following identifiers in code and documentation,
>> leaving macros for the old ones as aliases:
>> - SVf_UTF8 -> SVf_PVUPGRADED
>> - SvUTF8 -> Sv_PVUPGRADED
>> - SvUTF8_on -> Sv_PVUPGRADED_on
>> - SvUTF8_off -> Sv_PVUPGRADED_off
>> - SvPOK_only_UTF8 -> SvPOK_only_UPGRADED
>
> This got briefly mentioned as a side-comment on PSC today.
>
> Thoughts are "What about WIDE"? As in
>
> SVf_WIDE (though really I'd want to call that SVppv_WIDE)
> SvWIDE
> SvWIDE_on
> etc...

Now that this thread seems to have “settled” a bit, I wonder where this idea stands in the general mindset:


a) Good idea, worth the overhead of renaming a long-established identifier.

b) Good idea, but *not* worth that overhead.

c) Bad idea; the status quo is better than either of the proposed renames.

d) … some other stance?


To recap, arguments in favour include:

1) More accurate: “wide” encoding allows things that UTF-8 proper forbids, so calling it “UTF8” isn’t quite right.

2) The rename discourages thinking of the flag as indicating a “UTF-8 string”--a widely-held misconception.

3) The upheaval would highlight how the abstraction *should* work and hopefully right some lingering misconceptions out and about.


Thank you!


-FG
Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]
2021-9-16 5:02 Felipe Gasper <felipe@felipegasper.com> wrote:

>
>
> 1) More accurate: “wide” encoding allows things that UTF-8 proper forbids,
> so calling it “UTF8” isn’t quite right.
>
>
>
Now I am learning UTF-8 and UNICODE for good ideas.

Can you hear about my categorization of UTF-8?

A. Text - Text means perl text expression

1. Loose UTF-8

This is not valid UTF-8

This contains

3-byte surrogate

4-byte super characters(over U+10FFFF)

This don't contains

latin-1 code

2. Valid UTF-8

This is valid UTF-8

this doesn't contain

3-byte surrogate

4-byte super characters(over U+10FFFF)

3. Valid Minimal UTF-8 (this is for secure)

This is valid and minimal UTF-8(Normalized with the minimum number of
bytes)

? is ? (? doesn't ? + ")

B. Bytes

Any bytes.
Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]
Hi Felipe,

We discussed this in a recent PSC meeting, and agreed that we’d like this to progress.

So can you submit this as a formal RFC please?

Cheers,
Neil
Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]
I just want to repeat my strong objection to this. I really don't think we
should be forcing this on XS developers or muddying the waters any more.

Please don't do this.

cheers,
Yves



On Fri, 12 Nov 2021 at 14:00, Neil Bowers <neilb@neilb.org> wrote:

> Hi Felipe,
>
> We discussed this in a recent PSC meeting, and agreed that we’d like this
> to progress.
>
> So can you submit this as a formal RFC please?
>
> Cheers,
> Neil
>


--
perl -Mre=debug -e "/just|another|perl|hacker/"
Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]
On Wed, Sep 15, 2021 at 10:02 PM Felipe Gasper <felipe@felipegasper.com>
wrote:

>
> > On Sep 3, 2021, at 10:24 AM, Paul LeoNerd Evans <leonerd@leonerd.org.uk>
> wrote:
> >
> > On Wed, 18 Aug 2021 13:18:34 -0400
> > Felipe Gasper <felipe@felipegasper.com> wrote:
> >
> >> PROPOSAL: Rename the following identifiers in code and documentation,
> >> leaving macros for the old ones as aliases:
> >> - SVf_UTF8 -> SVf_PVUPGRADED
> >> - SvUTF8 -> Sv_PVUPGRADED
> >> - SvUTF8_on -> Sv_PVUPGRADED_on
> >> - SvUTF8_off -> Sv_PVUPGRADED_off
> >> - SvPOK_only_UTF8 -> SvPOK_only_UPGRADED
> >
> > This got briefly mentioned as a side-comment on PSC today.
> >
> > Thoughts are "What about WIDE"? As in
> >
> > SVf_WIDE (though really I'd want to call that SVppv_WIDE)
> > SvWIDE
> > SvWIDE_on
> > etc...
>
> Now that this thread seems to have “settled” a bit, I wonder where this
> idea stands in the general mindset:
>
>
> a) Good idea, worth the overhead of renaming a long-established identifier.
>
> b) Good idea, but *not* worth that overhead.
>
> c) Bad idea; the status quo is better than either of the proposed renames.
>
> d) … some other stance?
>

I strongly believe it's not worth the overhead (from an effort and
confusion POV), and less strongly feel it's not a good idea.

Leon

1 2  View All