On Fri, 20 Aug 2021 at 19:48, Felipe Gasper <felipe@felipegasper.com> wrote:
>
> > On Aug 20, 2021, at 1:05 PM, demerphq <demerphq@gmail.com> wrote:
> >
> > On Wed, 18 Aug 2021 at 19:17, Felipe Gasper <felipe@felipegasper.com>
> wrote:
> > Per recent IRC discussion …
> >
> > PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of
> confusion regarding the flag’s significance. Some think it indicates
> whether a given PV stores text versus binary. Some think it means that the
> PV is valid UTF-8. Still others likely hold other inaccurate views.
> >
> > The problem here is the naming. For example, consider `perl -e'my $foo =
> "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its
> code points (assuming use of a UTF-8 terminal) correspond to the bytes that
> encode “é” in UTF-8.
> >
> > Nope. It might contain utf8, but it is not UTF8-ON. Think of it like a
> square/rectangle relationship. All strings are "rectangles", all "squares"
> are rectangles, some strings are squares, but unless SQUARE flag is ON perl
> should assume it is a rectangle, not a square. The SQUARE flag should only
> be set when the rectangle has been proved conclusively to be a square. That
> the SQUARE flag is off does not mean the rectangle is not a square, merely
> that the square has not been proved to be such.
>
> You’re defining “a UTF-8 string” as “a string whose PV is marked as
> UTF-8”. I’m defining it as “a string whose Perl-visible code points happen
> to be valid UTF-8”.
>
I dont find your definition to be very useful, nor descriptive of how perl
manages these matters, so I am not using it. You are confusing different
levels of abstraction. Your definition also would include cases where the
data is already encoded and flagged as utf8. So it doesn't make sense to me.
Here is the set of definitions that I am operating from:
A "string" is a programming concept inside of Perl which is used to
represent "text" buffers of memory. There are three level of abstraction
for strings, two of which are tightly coupled. The three are the codepoint
level, semantic level and encoding level.
At the codepoint levels you can think of strings as variable length arrays
of numbers (codepoints), where the numbers are restricted to 0 to 0x10FFFF.
At the semantics level you can think of these numbers (codepoints) of
representing characters from some form of text with specific rules for
certain operations like case-folding, as well as a well defined mapping to
graphemes which are displayed to our eyes when those numbers are rendered
by a display device like a terminal.
The encoding level of abstraction addresses how those numbers (codepoints)
will be represented as bytes (octets) in memory inside of Perl, and when
you directly write the data to disk or to some other output stream.
There are two sets of codepoint range, semantics and encoding available,
which are controlled by a flag associated with the string called the UTF8
flag. When set this flag indicates that the string can represent codepoints
0 to 0x10FFFF, should have Unicode semantics applied to it, and that its in
memory representation is variable-width utf8. When the flag is not set it
indicates the string can represent codepoints 0 to 255, has ASCII
case-folding semantics, and that its in memory representation is fixed
width octets.
In order to be able to combine these two types of strings we need to define
some operations:
upgrading/downgrading: converting a string from one set of semantics and
encoding to the other while preserving exactly the codepoint level
representation. By tradition we call it upgrading when we go from Latin-1
to Unicode with the result being UTF8 on, and we call it downgrading when
we go from Unicode to Latin1 with the result being UTF8-off. These
operations are NOT symmetrical. It is *not* possible to downgrade every
Unicode string to Latin-1, however it is possible to upgrade every Latin-1
string to Unicode. By tradition upgrade and downgrade functions are noops
when their input is already in the form expected as the result, but this is
by tradition only.
decoding/encoding: converting a string from one form to the other in a way
that transforms the codepoints from one form to a potentially different
form. Traditional we speak of decode_utf8() taking a latin1 string
containing octets that make up a utf8 encoded string, and returning a
string which is UTF8 on which represents the Unicode version of those
octets. For well formed input this results in no change to the underlying
string, but the flag is flipped on. Vice versa we speak of encode_utf8()
which converts its input to a utf8 encoded form, regardless of what form it
was represented internally.
When we are confronted with combining the two forms of string Perl has
little choice but to use the "safe" strategy of "upgrading" the Latin-1
parts to Unicode.
Both the operations of "upgrading" and "decoding" result in Utf8-on
strings, and indeed both can result in not changing their input at all, but
when they do change their input they change it very differently. Most of
the places people get into trouble with strings is when they end up doing
upgrade operations when they should have done a decode operation. This is
because upgrade operations can happen implicitly based on simple rules and
thus can happen "by accident", but decode operations are always explicit so
they never happen without the involvement of the developer in some way.
This is at least partly because upgrade operations do not have any failure
modes but decode operations do.
Most of the time, as long as you are only thinking about codepoints,
developers dont have to worry about this stuff. The places where they do
are when they are reading or writing data, and in some cases when they are
embedding string constants in their code where they want a particular set
of semantics and encoding. As long as people are disciplined to use
decode_utf8() before they use utf8 string data, and encode_utf8 before
they emit it then the complexities above should be transparent to the
developer.
> What you call “a UTF-8 string” is what I propose we call, per existing
> nomenclature (i.e., utf8::upgrade()) “a PV-upgraded string”, with
> corresponding code changes. Then the term “UTF-8 string” makes sense from a
> pure-Perl context without requiring Perl programmers to worry about
> interpreter internals.
>
>
No. The flag does not mean "upgraded" it means "unicode semantics, utf8
encoding". Upgrading is one way to get such a string, and it might even be
the most common, but the most important and likely to be correct way is
explicit decoding.
If we are to rename the flag then we should just rename it as the UNICODE
flag. Would have saved a world of confusion.
> The “UTF-8 flag”, however, is likely *not* set on this string. By
> contrast, consider `perl -Mutf8 -e'my $foo = "é"'`. Here $foo has the
> “UTF-8 flag” set, but $foo is NOT a “UTF-8 string” because its code points
> (in this case, only 1) aren’t valid UTF-8.
> >
> > Except it is valid UTF-8: (at least in my utf8 terminal).
> >
> > $ perl -MDevel::Peek -Mutf8 -e'my $foo = "é"; Dump($foo)'
> > SV = PV(0x153efc0) at 0x155fb38
> > REFCNT = 1
> > FLAGS = (POK,IsCOW,pPOK,UTF8)
> > PV = 0x1563240 "\303\251"\0 [UTF8 "\x{e9}"]
> > CUR = 2
> > LEN = 10
> > COW_REFCNT = 1
> >
> > So the string is UTF-8.
>
> Again, different definitions.
You cant define yourself away from how things actually work. The string is
UTF8 on because perl says so.
> The Perl-visible string contains a single code point, 0xe9. This code
> point doesn’t correspond to valid UTF-8 bytes,
Codepoints and octets (bytes) are abstractions at different levels in Perl
and Unicode so "this codepoint doesnt correspond to valid UTF-8 bytes"
doesn't really make any sense as a sentence. Codepoints are integers from 0
to 0x1FFFF. They can be *encoded* in a variety of ways as octets, for
instance the codepoint E9 has at least 5 different representations at the
octet level under Uncode: "\x{E9}\x{00}" "\x{00}\x{E9}",
"\x{E9}\x{00}\x{00}\x{00}", "\x{00}\x{00}\x{00}\x{E9}", and "\303\251" are
all equally valid ways of representing the codepoint E9. Notice, that the
octet "E9" by itself is NOT a valid way to represent the codepoint E9 in
any Unicode encoding.
The dump above shows correctly the octet and codepoint representation of
the string. The buffer contains "\303\251" which is the UTF8
representation of the codepoint E9, and the flag is on which is why it
understands that this is a single codepoint, not two.
> so IMO it doesn’t make sense to call it a “UTF-8 string”.
After the operations I performed it is a Unicode string encoded in UTF-8,
as represented by the UTF-8 flag in the dump.
> Whether Perl stores that code point as one byte or as two is Perl’s
> business alone … right?
>
Well it would be weird if we stored Unicode data in a form not supported by
Unicode. Dont you think? There is no single octet representation of the
codepoint E9 defined by Unicode as far as I know.
>
> > I do not understand your point that only the initiated can understand
> this flag. It means one and only one thing: that the perl internals should
> assume that the buffer contains utf8 encoded data and that perl should
> apply unicode semantics when doing character and case-sensitive operations,
> and that perl can make certain assumptions when it processing the data (eg
> that is not malformed).
>
> The behaviour you’re talking about is what the unicode_strings and
> unicode_eval features specifically do away with (i.e., fix), right?
Im not familiar with those enough to comment. I assume they relate to what
assumptions Perl should make about strings which are constructed as
literals in the source code, where there is a great deal of ambiguity about
what is going on compared to actual code that constructs such strings,
where things are exact.
>
> You’re omitting what IMO is the most obvious purpose of the flag: to
> indicate whether the code points that the PV stores are the plain bytes, or
> are the UTF-8-decoded code points. This is why you can print() the string
> in either upgraded or downgraded forms, and it comes out the same.
>
Its hard to say what you are referring to here. If you mean codepoints
0-127, then it is unsurprising as the representation of them is equivalent
in ASCII and UTF8 and Latin1. But if you mean a codepoint above the ASCII
plane, then no they should not come out the same. If you are piping that
data to a file I would expect the octets written to that file to be
different. (assuming a binary filehandle with no layers magically
transforming things). If your terminal renders them the same then I assume
it is doing some magic behind the scenes to deal with malformed utf8.
>
> > BTW, your scheme needs to account for WAS_UTF8 as well. Most people dont
> know it, but there are actually three types of strings in the perl
> internals, UTF8-ON, UTF8-OFF, UTF8-OFF + WAS_UTF8. It only manifests in
> hash keys. But it needs to be accounted for as well in any renaming. Perl
> dictates that keys which are character-wise equivalent hash the same
> regardless of the UTF8 flag (or put alternative, the hash should be of the
> codepoints the string represents NOT the octets that make up that
> representation). This means UTF8-ON keys are always downgraded on lookup or
> store in a hash. If the downgrade is successful the key is marked as
> WAS-UTF8 and the downgraded string is stored and hashed, if it was
> unsuccessful (eg it contains codepoints above 255) it is marked as UTF8-ON
> and the original buffer is hashed. When the key is extracted with keys() or
> each() if the WASUTF8 flag is set the string is upgraded back to the UTF8
> form.
>
> Thank you for this. I knew about the was-UTF8 status but didn’t know why
> it exists.
>
> > I think you need to step back and consider that strings are sequences of
> octets. Sometimes those octets are ordered such that they can be
> interpreted as utf8. The UTF-8 flag being on tells perl that it can and
> should treat the octets as utf8.
>
> C strings are sequences of octets, yes. Perl strings, though, are
> sequences of code points, not octets. In this they’re more like JavaScript
> strings than C strings.
>
Perl strings are very similar to C strings when the flag is off, and
JavaScript strings when the flag is on.
>
> > my $foo = "é";
> >
> > I don't know exactly what that code does without doing an octet level
> investigation of the data. It could be one octet and in latin-1 or it could
> be two octets and be Unicode in one of several formats (utf8, utf-16BE
> utf-16LE) and still be rendered identically in an editor or browser.
>
> Sorry, I assumed we all use UTF-8 terminals. :) But yes, I should have
> written it as two \x escapes, sorry.
>
I do, but this email is being rendered by gmail in a browser. Any number of
conversions of the actual bytes on disk could have happened between you and
me. For all I know you might have wrote your email in a text editor using
UTF-32.
>
> > I also know what happens here:
> >
> > my $foo="\x{c3}\x{a9}";
> > utf8::decode($foo);
> > Dump($foo);
> >
> > SV = PV(0x2303fc0) at 0x2324c98
> > REFCNT = 1
> > FLAGS = (POK,IsCOW,pPOK,UTF8)
> > PV = 0x2328300 "\303\251"\0 [UTF8 "\x{e9}"]
> > CUR = 2
> > LEN = 10
> > COW_REFCNT = 1
> >
> > That is, i start off with two octets, C3 - A9, which happens to be the
> encoding for the codepoint E9, which happens to be é.
> > I then tell perl to "decode" those octets, which really means I tell
> perl to check that the octets actually do make up valid utf8. And if perl
> agrees that indeed these are valid utf8 octets, then it turns the flag on.
> Now it doesn't matter if you *meant* to construct utf8 in the variable fed
> to decode, all that matters is that at an octet level those octet happen to
> make up valid utf8.
>
> I think you’re actually breaking the abstraction here by assuming that
> Perl implements the decode by setting a flag.
>
>
No I am not. The flag is there is there to tell the perl internals how to
manipulate the string. decode's task is to take arbitrary strings of octets
and ensure that they can be decoded as valid utf8 and possibly to do some
conversion (eg for forbidden utf8 sequences or other normalization) as it
does so and then SETS THE FLAG. Only once decode is done is the string
"Unicode" and is the string "utf8". Prior to that it was just random
octets. It doesnt need to do anything BUT set the flag because its internal
encoding matches the external encoding in this case. If it was decoding
UTF16LE then it would have do conversion as well.
> It would be just as legitimate to mutate the PV to store a single octet,
> 0xe9, and leave the UTF8 flag off.
Nope. That would mean that Perl would use ASCII/Latin-1 case folding rules
on the result, which would be wrong. It should use Unicode case folding
rules for codepoint E9 if it was decoded as that codepoint. (Change the
example to \x{DF} and you can see these issues in the flesh, \x{DF} should
match "ss" in Unicode, but in ASCII/Latin1 it only matches \x{DF}. The lc()
version of \x{DF} is "ss" but in Latin-1/Ascii there are no multi-byte case
folds). Even more suggestive that Perl doing this would be wrong is that
in fact there is NO valid Unicode encoding of codepoint E9 which is only 1
octet long. So that would be extremely wrong of Perl to use a non Unicode
encoding of unicode data dont you think? Also, what would perl do when the
codepoint doesn't fit into a single octet? Your argument might have some
merit if you were arguing that Perl could have decoded it into
"\x{E9}\0\0\0" and set the UTF-32 flag, but as stated it doesn't make sense.
> Perl doesn’t do that, of course, because it’s easier just to set a flag,
> but as long as the string content is the single code point 0xe9 it doesn’t
> really matter how Perl achieves that.
>
Yes, Perl deliberately chose to use Utf8 internally for the same reason
Unicode defined utf8 the way it did, so that all of the existing ASCII data
would still be valid when interpreted as Unicode, thus avoiding storage and
performance penalties alternative schemes might impose.
(Notwithstanding, of course, the abstraction leaks that things like the
> unicode_strings feature and Sys::Binmode fix.)
>
> There are parts of the code that appear to go the other way and prioritize
> downgraded storage. Perl_refcounted_he_fetch_pvn(), for example.
>
I would not have put it like that. With hashing you dont have a lot of
choices if you want the unicode form of latin-1 strings to hash the same.
You can either decode to the codepoint, and then use a codepoint by
codepoint hashing algorithm, which is slow and actually as far as I know
there arent any published hash algorithms to do this. So to stay safe with
the hash function you can downgrade strings which can be downgraded and
then hash the result, or you can upgrade the strings and hash the upgraded
form. Upgrade strings are on average larger than downgraded equivalents, so
hashing them is more expensive, and there is an assumption that most keys
will actually be ASCII so they don't need to be downgraded. When you
consider that perl was an early adopter of Unicode and was bolting it on to
a latin-1 codebase the bias seems pretty reasonable.
cheers,
yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"