Mailing List Archive: Pre-RFC: Rename SVf_UTF8 et al.

Pre-RFC: Rename SVf_UTF8 et al.

felipe at felipegasper

Aug 18, 2021, 10:18 AM

Post #1 of 46 (2412 views)

Per recent IRC discussion …

PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of confusion regarding the flag’s significance. Some think it indicates whether a given PV stores text versus binary. Some think it means that the PV is valid UTF-8. Still others likely hold other inaccurate views.

The problem here is the naming. For example, consider `perl -e'my $foo = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its code points (assuming use of a UTF-8 terminal) correspond to the bytes that encode “é” in UTF-8. The “UTF-8 flag”, however, is likely *not* set on this string. By contrast, consider `perl -Mutf8 -e'my $foo = "é"'`. Here $foo has the “UTF-8 flag” set, but $foo is NOT a “UTF-8 string” because its code points (in this case, only 1) aren’t valid UTF-8.

The fact that quite often a “UTF-8 string” lacks the “UTF-8 flag”, and a “UTF-8-flagged” string is (usually) *not* a “UTF-8 string”, makes little sense except to the “highly initiated”.

Another problem is “UTF-8” doesn’t really describe the “upgraded” format. This format is what Perl historically called “lax UTF-8” and is now widely called “generalized UTF-8”, which includes unpaired surrogates and code points above Unicode’s maximum.

PROPOSAL: Rename the following identifiers in code and documentation, leaving macros for the old ones as aliases:
- SVf_UTF8 -> SVf_PVUPGRADED
- SvUTF8 -> Sv_PVUPGRADED
- SvUTF8_on -> Sv_PVUPGRADED_on
- SvUTF8_off -> Sv_PVUPGRADED_off
- SvPOK_only_UTF8 -> SvPOK_only_UPGRADED

Note that flags like REFCOUNTED_HE_KEY_UTF8 do not need a rename because these indicate an actual (if incomplete/invalidated) UTF-8 decoding step.

BENEFITS: Over time, this rename will minimize the confusion between Perl’s upgraded-PV storage format versus UTF-8. The rename may also compel current users of the language who hold mistaken mental models of the flag’s purpose to reexamine their understanding, hopefully for the better.

POTENTIAL COMPLICATIONS: The mismatch between amended documentation and existing documentation may cause confusion; it should, though, be an auspicious confusion that eventually clarifies rather than misleads.

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

Aug 18, 2021, 12:50 PM

Post #2 of 46 (2412 views)

On Wed, 18 Aug 2021 13:18:34 -0400
Felipe Gasper <felipe@felipegasper.com> wrote:

> Per recent IRC discussion ?
>
> PROBLEM: The naming of Perl?s ?UTF-8 flag? is a continual source of confusion regarding the flag?s significance. Some think it indicates whether a given PV stores text versus binary. Some think it means that the PV is valid UTF-8. Still others likely hold other inaccurate views.
>
> The problem here is the naming. For example, consider `perl -e'my $foo = "?"'`. In this code $foo is a ?UTF-8 string? by virtue of the fact that its code points (assuming use of a UTF-8 terminal) correspond to the bytes that encode ??? in UTF-8. The ?UTF-8 flag?, however, is likely *not* set on this string. By contrast, consider `perl -Mutf8 -e'my $foo = "?"'`. Here $foo has the ?UTF-8 flag? set, but $foo is NOT a ?UTF-8 string? because its code points (in this case, only 1) aren?t valid UTF-8.
>
> The fact that quite often a ?UTF-8 string? lacks the ?UTF-8 flag?, and a ?UTF-8-flagged? string is (usually) *not* a ?UTF-8 string?, makes little sense except to the ?highly initiated?.
>
> Another problem is ?UTF-8? doesn?t really describe the ?upgraded? format. This format is what Perl historically called ?lax UTF-8? and is now widely called ?generalized UTF-8?, which includes unpaired surrogates and code points above Unicode?s maximum.
>
> PROPOSAL: Rename the following identifiers in code and documentation, leaving macros for the old ones as aliases:
> - SVf_UTF8 -> SVf_PVUPGRADED
> - SvUTF8 -> Sv_PVUPGRADED
> - SvUTF8_on -> Sv_PVUPGRADED_on
> - SvUTF8_off -> Sv_PVUPGRADED_off
> - SvPOK_only_UTF8 -> SvPOK_only_UPGRADED
>
> Note that flags like REFCOUNTED_HE_KEY_UTF8 do not need a rename because these indicate an actual (if incomplete/invalidated) UTF-8 decoding step.
>
> BENEFITS: Over time, this rename will minimize the confusion between Perl?s upgraded-PV storage format versus UTF-8. The rename may also compel current users of the language who hold mistaken mental models of the flag?s purpose to reexamine their understanding, hopefully for the better.
>
> POTENTIAL COMPLICATIONS: The mismatch between amended documentation and existing documentation may cause confusion; it should, though, be an auspicious confusion that eventually clarifies rather than misleads.

utf8::is_utf8 probably should be renamed too. Anyway, +1 from me.

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

grinnz at gmail

Aug 18, 2021, 1:08 PM

Post #3 of 46 (2412 views)

On Wed, Aug 18, 2021 at 3:50 PM Tomasz Konojacki <me@xenu.pl> wrote:

> On Wed, 18 Aug 2021 13:18:34 -0400
> Felipe Gasper <felipe@felipegasper.com> wrote:
>
> > Per recent IRC discussion …
> >
> > PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of
> confusion regarding the flag’s significance. Some think it indicates
> whether a given PV stores text versus binary. Some think it means that the
> PV is valid UTF-8. Still others likely hold other inaccurate views.
> >
> > The problem here is the naming. For example, consider `perl -e'my $foo =
> "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its
> code points (assuming use of a UTF-8 terminal) correspond to the bytes that
> encode “é” in UTF-8. The “UTF-8 flag”, however, is likely *not* set on this
> string. By contrast, consider `perl -Mutf8 -e'my $foo = "é"'`. Here $foo
> has the “UTF-8 flag” set, but $foo is NOT a “UTF-8 string” because its code
> points (in this case, only 1) aren’t valid UTF-8.
> >
> > The fact that quite often a “UTF-8 string” lacks the “UTF-8 flag”, and a
> “UTF-8-flagged” string is (usually) *not* a “UTF-8 string”, makes little
> sense except to the “highly initiated”.
> >
> > Another problem is “UTF-8” doesn’t really describe the “upgraded”
> format. This format is what Perl historically called “lax UTF-8” and is now
> widely called “generalized UTF-8”, which includes unpaired surrogates and
> code points above Unicode’s maximum.
> >
> > PROPOSAL: Rename the following identifiers in code and documentation,
> leaving macros for the old ones as aliases:
> > - SVf_UTF8 -> SVf_PVUPGRADED
> > - SvUTF8 -> Sv_PVUPGRADED
> > - SvUTF8_on -> Sv_PVUPGRADED_on
> > - SvUTF8_off -> Sv_PVUPGRADED_off
> > - SvPOK_only_UTF8 -> SvPOK_only_UPGRADED
> >
> > Note that flags like REFCOUNTED_HE_KEY_UTF8 do not need a rename because
> these indicate an actual (if incomplete/invalidated) UTF-8 decoding step.
> >
> > BENEFITS: Over time, this rename will minimize the confusion between
> Perl’s upgraded-PV storage format versus UTF-8. The rename may also compel
> current users of the language who hold mistaken mental models of the flag’s
> purpose to reexamine their understanding, hopefully for the better.
> >
> > POTENTIAL COMPLICATIONS: The mismatch between amended documentation and
> existing documentation may cause confusion; it should, though, be an
> auspicious confusion that eventually clarifies rather than misleads.
>
> utf8::is_utf8 probably should be renamed too. Anyway, +1 from me.
>
>
Frankly it (and upgrade/downgrade) shouldn't even be in the utf8::
namespace, it's named that for internal reasons not interface reasons.

-Dan

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

public at khwilliamson

Aug 18, 2021, 1:13 PM

Post #4 of 46 (2412 views)

On 8/18/21 2:08 PM, Dan Book wrote:
> On Wed, Aug 18, 2021 at 3:50 PM Tomasz Konojacki <me@xenu.pl
> <mailto:me@xenu.pl>> wrote:
>
> On Wed, 18 Aug 2021 13:18:34 -0400
> Felipe Gasper <felipe@felipegasper.com
> <mailto:felipe@felipegasper.com>> wrote:
>
> > Per recent IRC discussion …
> >
> > PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source
> of confusion regarding the flag’s significance. Some think it
> indicates whether a given PV stores text versus binary. Some think
> it means that the PV is valid UTF-8. Still others likely hold other
> inaccurate views.
> >
> > The problem here is the naming. For example, consider `perl -e'my
> $foo = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the
> fact that its code points (assuming use of a UTF-8 terminal)
> correspond to the bytes that encode “é” in UTF-8. The “UTF-8 flag”,
> however, is likely *not* set on this string. By contrast, consider
> `perl -Mutf8 -e'my $foo = "é"'`. Here $foo has the “UTF-8 flag” set,
> but $foo is NOT a “UTF-8 string” because its code points (in this
> case, only 1) aren’t valid UTF-8.
> >
> > The fact that quite often a “UTF-8 string” lacks the “UTF-8
> flag”, and a “UTF-8-flagged” string is (usually) *not* a “UTF-8
> string”, makes little sense except to the “highly initiated”.
> >
> > Another problem is “UTF-8” doesn’t really describe the “upgraded”
> format. This format is what Perl historically called “lax UTF-8” and
> is now widely called “generalized UTF-8”, which includes unpaired
> surrogates and code points above Unicode’s maximum.
> >
> > PROPOSAL: Rename the following identifiers in code and
> documentation, leaving macros for the old ones as aliases:
> > - SVf_UTF8 -> SVf_PVUPGRADED
> > - SvUTF8 -> Sv_PVUPGRADED
> > - SvUTF8_on -> Sv_PVUPGRADED_on
> > - SvUTF8_off -> Sv_PVUPGRADED_off
> > - SvPOK_only_UTF8 -> SvPOK_only_UPGRADED
> >
> > Note that flags like REFCOUNTED_HE_KEY_UTF8 do not need a rename
> because these indicate an actual (if incomplete/invalidated) UTF-8
> decoding step.
> >
> > BENEFITS: Over time, this rename will minimize the confusion
> between Perl’s upgraded-PV storage format versus UTF-8. The rename
> may also compel current users of the language who hold mistaken
> mental models of the flag’s purpose to reexamine their
> understanding, hopefully for the better.
> >
> > POTENTIAL COMPLICATIONS: The mismatch between amended
> documentation and existing documentation may cause confusion; it
> should, though, be an auspicious confusion that eventually clarifies
> rather than misleads.
>
> utf8::is_utf8 probably should be renamed too. Anyway, +1 from me.
>
> Frankly it (and upgrade/downgrade) shouldn't even be in the utf8::
> namespace, it's named that for internal reasons not interface reasons.
>
> -Dan

Upgrade and downgrade tell me nothing. I don't object to renaming, but
something better than these needs to be found

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

grinnz at gmail

Aug 18, 2021, 1:19 PM

Post #5 of 46 (2412 views)

On Wed, Aug 18, 2021 at 4:13 PM Karl Williamson <public@khwilliamson.com>
wrote:

> On 8/18/21 2:08 PM, Dan Book wrote:
> > On Wed, Aug 18, 2021 at 3:50 PM Tomasz Konojacki <me@xenu.pl
> > <mailto:me@xenu.pl>> wrote:
> >
> > On Wed, 18 Aug 2021 13:18:34 -0400
> > Felipe Gasper <felipe@felipegasper.com
> > <mailto:felipe@felipegasper.com>> wrote:
> >
> > > Per recent IRC discussion …
> > >
> > > PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source
> > of confusion regarding the flag’s significance. Some think it
> > indicates whether a given PV stores text versus binary. Some think
> > it means that the PV is valid UTF-8. Still others likely hold other
> > inaccurate views.
> > >
> > > The problem here is the naming. For example, consider `perl -e'my
> > $foo = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the
> > fact that its code points (assuming use of a UTF-8 terminal)
> > correspond to the bytes that encode “é” in UTF-8. The “UTF-8 flag”,
> > however, is likely *not* set on this string. By contrast, consider
> > `perl -Mutf8 -e'my $foo = "é"'`. Here $foo has the “UTF-8 flag” set,
> > but $foo is NOT a “UTF-8 string” because its code points (in this
> > case, only 1) aren’t valid UTF-8.
> > >
> > > The fact that quite often a “UTF-8 string” lacks the “UTF-8
> > flag”, and a “UTF-8-flagged” string is (usually) *not* a “UTF-8
> > string”, makes little sense except to the “highly initiated”.
> > >
> > > Another problem is “UTF-8” doesn’t really describe the “upgraded”
> > format. This format is what Perl historically called “lax UTF-8” and
> > is now widely called “generalized UTF-8”, which includes unpaired
> > surrogates and code points above Unicode’s maximum.
> > >
> > > PROPOSAL: Rename the following identifiers in code and
> > documentation, leaving macros for the old ones as aliases:
> > > - SVf_UTF8 -> SVf_PVUPGRADED
> > > - SvUTF8 -> Sv_PVUPGRADED
> > > - SvUTF8_on -> Sv_PVUPGRADED_on
> > > - SvUTF8_off -> Sv_PVUPGRADED_off
> > > - SvPOK_only_UTF8 -> SvPOK_only_UPGRADED
> > >
> > > Note that flags like REFCOUNTED_HE_KEY_UTF8 do not need a rename
> > because these indicate an actual (if incomplete/invalidated) UTF-8
> > decoding step.
> > >
> > > BENEFITS: Over time, this rename will minimize the confusion
> > between Perl’s upgraded-PV storage format versus UTF-8. The rename
> > may also compel current users of the language who hold mistaken
> > mental models of the flag’s purpose to reexamine their
> > understanding, hopefully for the better.
> > >
> > > POTENTIAL COMPLICATIONS: The mismatch between amended
> > documentation and existing documentation may cause confusion; it
> > should, though, be an auspicious confusion that eventually clarifies
> > rather than misleads.
> >
> > utf8::is_utf8 probably should be renamed too. Anyway, +1 from me.
> >
> > Frankly it (and upgrade/downgrade) shouldn't even be in the utf8::
> > namespace, it's named that for internal reasons not interface reasons.
> >
> > -Dan
>
> Upgrade and downgrade tell me nothing. I don't object to renaming, but
> something better than these needs to be found
>

It is related to the two possible string formats. Do you know of any other
name for them than UTF8/non-UTF8 (which is a misleading name to expose to
the logical string layer, which may separately be UTF-8 encoded or not) or
upgraded/downgraded?

-Dan

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

fawaka at gmail

Aug 18, 2021, 1:24 PM

Post #6 of 46 (2412 views)

On Wed, Aug 18, 2021 at 7:17 PM Felipe Gasper <felipe@felipegasper.com>
wrote:

> Per recent IRC discussion …
>
> PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of
> confusion regarding the flag’s significance. Some think it indicates
> whether a given PV stores text versus binary. Some think it means that the
> PV is valid UTF-8. Still others likely hold other inaccurate views.
>
> The problem here is the naming. For example, consider `perl -e'my $foo =
> "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its
> code points (assuming use of a UTF-8 terminal) correspond to the bytes that
> encode “é” in UTF-8. The “UTF-8 flag”, however, is likely *not* set on this
> string. By contrast, consider `perl -Mutf8 -e'my $foo = "é"'`. Here $foo
> has the “UTF-8 flag” set, but $foo is NOT a “UTF-8 string” because its code
> points (in this case, only 1) aren’t valid UTF-8.
>
> The fact that quite often a “UTF-8 string” lacks the “UTF-8 flag”, and a
> “UTF-8-flagged” string is (usually) *not* a “UTF-8 string”, makes little
> sense except to the “highly initiated”.
>
> Another problem is “UTF-8” doesn’t really describe the “upgraded” format.
> This format is what Perl historically called “lax UTF-8” and is now widely
> called “generalized UTF-8”, which includes unpaired surrogates and code
> points above Unicode’s maximum.
>
> PROPOSAL: Rename the following identifiers in code and documentation,
> leaving macros for the old ones as aliases:
> - SVf_UTF8 -> SVf_PVUPGRADED
> - SvUTF8 -> Sv_PVUPGRADED
> - SvUTF8_on -> Sv_PVUPGRADED_on
> - SvUTF8_off -> Sv_PVUPGRADED_off
> - SvPOK_only_UTF8 -> SvPOK_only_UPGRADED
>
> Note that flags like REFCOUNTED_HE_KEY_UTF8 do not need a rename because
> these indicate an actual (if incomplete/invalidated) UTF-8 decoding step.
>
> BENEFITS: Over time, this rename will minimize the confusion between
> Perl’s upgraded-PV storage format versus UTF-8. The rename may also compel
> current users of the language who hold mistaken mental models of the flag’s
> purpose to reexamine their understanding, hopefully for the better.
>
> POTENTIAL COMPLICATIONS: The mismatch between amended documentation and
> existing documentation may cause confusion; it should, though, be an
> auspicious confusion that eventually clarifies rather than misleads.

I would disagree. Perl code should not have to care/see what the internal
encoding is (it's breaking the encapsulation, really), but perl's internals
very much do and should care about the internal encoding.

So to me this logic only makes sense for the perl-visible side of things
(e.g. utf8::upgrade), not on the C-side.

Leon

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

felipe at felipegasper

Aug 18, 2021, 1:25 PM

Post #7 of 46 (2412 views)

> On Aug 18, 2021, at 4:19 PM, Dan Book <grinnz@gmail.com> wrote:
>
> On Wed, Aug 18, 2021 at 4:13 PM Karl Williamson <public@khwilliamson.com> wrote:
> On 8/18/21 2:08 PM, Dan Book wrote:
> > On Wed, Aug 18, 2021 at 3:50 PM Tomasz Konojacki <me@xenu.pl
> > <mailto:me@xenu.pl>> wrote:
> >
> > On Wed, 18 Aug 2021 13:18:34 -0400
> > Felipe Gasper <felipe@felipegasper.com
> > <mailto:felipe@felipegasper.com>> wrote:
> >
> > > Per recent IRC discussion …
> > >
> > > PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source
> > of confusion regarding the flag’s significance. Some think it
> > indicates whether a given PV stores text versus binary. Some think
> > it means that the PV is valid UTF-8. Still others likely hold other
> > inaccurate views.
> > >
> > > The problem here is the naming. For example, consider `perl -e'my
> > $foo = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the
> > fact that its code points (assuming use of a UTF-8 terminal)
> > correspond to the bytes that encode “é” in UTF-8. The “UTF-8 flag”,
> > however, is likely *not* set on this string. By contrast, consider
> > `perl -Mutf8 -e'my $foo = "é"'`. Here $foo has the “UTF-8 flag” set,
> > but $foo is NOT a “UTF-8 string” because its code points (in this
> > case, only 1) aren’t valid UTF-8.
> > >
> > > The fact that quite often a “UTF-8 string” lacks the “UTF-8
> > flag”, and a “UTF-8-flagged” string is (usually) *not* a “UTF-8
> > string”, makes little sense except to the “highly initiated”.
> > >
> > > Another problem is “UTF-8” doesn’t really describe the “upgraded”
> > format. This format is what Perl historically called “lax UTF-8” and
> > is now widely called “generalized UTF-8”, which includes unpaired
> > surrogates and code points above Unicode’s maximum.
> > >
> > > PROPOSAL: Rename the following identifiers in code and
> > documentation, leaving macros for the old ones as aliases:
> > > - SVf_UTF8 -> SVf_PVUPGRADED
> > > - SvUTF8 -> Sv_PVUPGRADED
> > > - SvUTF8_on -> Sv_PVUPGRADED_on
> > > - SvUTF8_off -> Sv_PVUPGRADED_off
> > > - SvPOK_only_UTF8 -> SvPOK_only_UPGRADED
> > >
> > > Note that flags like REFCOUNTED_HE_KEY_UTF8 do not need a rename
> > because these indicate an actual (if incomplete/invalidated) UTF-8
> > decoding step.
> > >
> > > BENEFITS: Over time, this rename will minimize the confusion
> > between Perl’s upgraded-PV storage format versus UTF-8. The rename
> > may also compel current users of the language who hold mistaken
> > mental models of the flag’s purpose to reexamine their
> > understanding, hopefully for the better.
> > >
> > > POTENTIAL COMPLICATIONS: The mismatch between amended
> > documentation and existing documentation may cause confusion; it
> > should, though, be an auspicious confusion that eventually clarifies
> > rather than misleads.
> >
> > utf8::is_utf8 probably should be renamed too. Anyway, +1 from me.
> >
> > Frankly it (and upgrade/downgrade) shouldn't even be in the utf8::
> > namespace, it's named that for internal reasons not interface reasons.
> >
> > -Dan
>
> Upgrade and downgrade tell me nothing. I don't object to renaming, but
> something better than these needs to be found
>
> It is related to the two possible string formats. Do you know of any other name for them than UTF8/non-UTF8 (which is a misleading name to expose to the logical string layer, which may separately be UTF-8 encoded or not) or upgraded/downgraded?

RJBS called it “the wide flag” in a presentation some years back. SVf_WIDEPV may clash with the “wide character” warning, though.

SVf_BIGPV?

-F

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

grinnz at gmail

Aug 18, 2021, 1:31 PM

Post #8 of 46 (2412 views)

On Wed, Aug 18, 2021 at 4:24 PM Leon Timmermans <fawaka@gmail.com> wrote:

> On Wed, Aug 18, 2021 at 7:17 PM Felipe Gasper <felipe@felipegasper.com>
> wrote:
>
>> Per recent IRC discussion …
>>
>> PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of
>> confusion regarding the flag’s significance. Some think it indicates
>> whether a given PV stores text versus binary. Some think it means that the
>> PV is valid UTF-8. Still others likely hold other inaccurate views.
>>
>> The problem here is the naming. For example, consider `perl -e'my $foo =
>> "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its
>> code points (assuming use of a UTF-8 terminal) correspond to the bytes that
>> encode “é” in UTF-8. The “UTF-8 flag”, however, is likely *not* set on this
>> string. By contrast, consider `perl -Mutf8 -e'my $foo = "é"'`. Here $foo
>> has the “UTF-8 flag” set, but $foo is NOT a “UTF-8 string” because its code
>> points (in this case, only 1) aren’t valid UTF-8.
>>
>> The fact that quite often a “UTF-8 string” lacks the “UTF-8 flag”, and a
>> “UTF-8-flagged” string is (usually) *not* a “UTF-8 string”, makes little
>> sense except to the “highly initiated”.
>>
>> Another problem is “UTF-8” doesn’t really describe the “upgraded” format.
>> This format is what Perl historically called “lax UTF-8” and is now widely
>> called “generalized UTF-8”, which includes unpaired surrogates and code
>> points above Unicode’s maximum.
>>
>> PROPOSAL: Rename the following identifiers in code and documentation,
>> leaving macros for the old ones as aliases:
>> - SVf_UTF8 -> SVf_PVUPGRADED
>> - SvUTF8 -> Sv_PVUPGRADED
>> - SvUTF8_on -> Sv_PVUPGRADED_on
>> - SvUTF8_off -> Sv_PVUPGRADED_off
>> - SvPOK_only_UTF8 -> SvPOK_only_UPGRADED
>>
>> Note that flags like REFCOUNTED_HE_KEY_UTF8 do not need a rename because
>> these indicate an actual (if incomplete/invalidated) UTF-8 decoding step.
>>
>> BENEFITS: Over time, this rename will minimize the confusion between
>> Perl’s upgraded-PV storage format versus UTF-8. The rename may also compel
>> current users of the language who hold mistaken mental models of the flag’s
>> purpose to reexamine their understanding, hopefully for the better.
>>
>> POTENTIAL COMPLICATIONS: The mismatch between amended documentation and
>> existing documentation may cause confusion; it should, though, be an
>> auspicious confusion that eventually clarifies rather than misleads.
>
>
> I would disagree. Perl code should not have to care/see what the internal
> encoding is (it's breaking the encapsulation, really), but perl's internals
> very much do and should care about the internal encoding.
>
> So to me this logic only makes sense for the perl-visible side of things
> (e.g. utf8::upgrade), not on the C-side.
>

I would agree except that people not working on the internals also have to
use these functions (for XS code), and thus misuse them because they think
they're related to the logical contents of the string.

-Dan

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

felipe at felipegasper

Aug 18, 2021, 1:35 PM

Post #9 of 46 (2412 views)

> On Aug 18, 2021, at 4:24 PM, Leon Timmermans <fawaka@gmail.com> wrote:
>
> On Wed, Aug 18, 2021 at 7:17 PM Felipe Gasper <felipe@felipegasper.com> wrote:
> Per recent IRC discussion …
>
> PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of confusion regarding the flag’s significance. Some think it indicates whether a given PV stores text versus binary. Some think it means that the PV is valid UTF-8. Still others likely hold other inaccurate views.
>
> The problem here is the naming. For example, consider `perl -e'my $foo = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its code points (assuming use of a UTF-8 terminal) correspond to the bytes that encode “é” in UTF-8. The “UTF-8 flag”, however, is likely *not* set on this string. By contrast, consider `perl -Mutf8 -e'my $foo = "é"'`. Here $foo has the “UTF-8 flag” set, but $foo is NOT a “UTF-8 string” because its code points (in this case, only 1) aren’t valid UTF-8.
>
> The fact that quite often a “UTF-8 string” lacks the “UTF-8 flag”, and a “UTF-8-flagged” string is (usually) *not* a “UTF-8 string”, makes little sense except to the “highly initiated”.
>
> Another problem is “UTF-8” doesn’t really describe the “upgraded” format. This format is what Perl historically called “lax UTF-8” and is now widely called “generalized UTF-8”, which includes unpaired surrogates and code points above Unicode’s maximum.
>
> PROPOSAL: Rename the following identifiers in code and documentation, leaving macros for the old ones as aliases:
> - SVf_UTF8 -> SVf_PVUPGRADED
> - SvUTF8 -> Sv_PVUPGRADED
> - SvUTF8_on -> Sv_PVUPGRADED_on
> - SvUTF8_off -> Sv_PVUPGRADED_off
> - SvPOK_only_UTF8 -> SvPOK_only_UPGRADED
>
> Note that flags like REFCOUNTED_HE_KEY_UTF8 do not need a rename because these indicate an actual (if incomplete/invalidated) UTF-8 decoding step.
>
> BENEFITS: Over time, this rename will minimize the confusion between Perl’s upgraded-PV storage format versus UTF-8. The rename may also compel current users of the language who hold mistaken mental models of the flag’s purpose to reexamine their understanding, hopefully for the better.
>
> POTENTIAL COMPLICATIONS: The mismatch between amended documentation and existing documentation may cause confusion; it should, though, be an auspicious confusion that eventually clarifies rather than misleads.
>
> I would disagree. Perl code should not have to care/see what the internal encoding is (it's breaking the encapsulation, really), but perl's internals very much do and should care about the internal encoding.

This isn’t really true, though. Pure Perl code also frequently has to care about the internal encoding due to the many instances where Perl itself leaks it.

Example:
-----
perl -Mutf8 -MJSON::PP -e'my $foo = JSON::PP::decode_json( JSON::PP::encode_json(["é"]) )->[0]; exec "echo", $foo'
-----
This *should* print mojibake, but it happens to print “é” because of the leak.

When/if that leaky behaviour gets fixed -- 5.36 feature bundle, maybe? -- then it’ll make more sense to consider the PV encoding a wholly internal matter.

-FG

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

kimoto.yuki at gmail

Aug 18, 2021, 6:09 PM

Post #10 of 46 (2412 views)

2021-8-19? 2:17 Felipe Gasper <felipe@felipegasper.com> wrote:

> Per recent IRC discussion …
>
> PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of
> confusion regarding the flag’s significance. Some think it indicates
> whether a given PV stores text versus binary. Some think it means that the
> PV is valid UTF-8. Still others likely hold other inaccurate views.
> h, be an auspicious confusion that eventually clarifies rather than
> misleads.

I feel that the starting point for this discussion is that people
misunderstand that the current Perl implementation can distinguish between
binary and text.

On this point, I agree with Feripe.

People likely to believe

utf8::is_utf8 : 0 : this string is binary
utf8::is_utf8 : 1 : this string is text

However, this is completely wrong.

Current Perl can't make this distinction.

Perl freely changes this interpretation for performance and use.

The meaning is as follows.

?Interpretation considered bytes
?Interpretation considered UTF-8 characters

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

sergey.aleynikov at gmail

Aug 20, 2021, 12:04 AM

Post #11 of 46 (2412 views)

??, 18 ???. 2021 ?. ? 20:17, Felipe Gasper <felipe@felipegasper.com>:
> The problem here is the naming. For example, consider `perl -e'my $foo = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its code points (assuming use of a UTF-8 terminal) correspond to the bytes that encode “é” in UTF-8. The “UTF-8 flag”, however, is likely *not* set on this string

There's no likeness. For literal string, there're deterministic rules
set (though they may not be documented).

>Here $foo has the “UTF-8 flag” set, but $foo is NOT a “UTF-8 string” because its code points (in this case, only 1) aren’t valid UTF-8.

Maybe I don't understand you, but perl can't have invalid UTF8 in
literals under 'use utf8'.

> PROPOSAL: Rename the following identifiers in code and documentation, leaving macros for the old ones as aliases:

Which will only bring more confusion going forward. If you want to
fight SVf_UTF8 confusion, the problem lies not in it's name, but in
the logic behind it. You're trying to shove this issue under the rug,
but what really makes things this messy is this flag's mere existence
(and it still might be better than Python's choice for theirs Unicode
strings). -1 from me.

Best regards,
Sergey Aleynikov

>
> Per recent IRC discussion …
>
> PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of confusion regarding the flag’s significance. Some think it indicates whether a given PV stores text versus binary. Some think it means that the PV is valid UTF-8. Still others likely hold other inaccurate views.
>
> The problem here is the naming. For example, consider `perl -e'my $foo = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its code points (assuming use of a UTF-8 terminal) correspond to the bytes that encode “é” in UTF-8. The “UTF-8 flag”, however, is likely *not* set on this string. By contrast, consider `perl -Mutf8 -e'my $foo = "é"'`. Here $foo has the “UTF-8 flag” set, but $foo is NOT a “UTF-8 string” because its code points (in this case, only 1) aren’t valid UTF-8.
>
> The fact that quite often a “UTF-8 string” lacks the “UTF-8 flag”, and a “UTF-8-flagged” string is (usually) *not* a “UTF-8 string”, makes little sense except to the “highly initiated”.
>
> Another problem is “UTF-8” doesn’t really describe the “upgraded” format. This format is what Perl historically called “lax UTF-8” and is now widely called “generalized UTF-8”, which includes unpaired surrogates and code points above Unicode’s maximum.
>
> PROPOSAL: Rename the following identifiers in code and documentation, leaving macros for the old ones as aliases:
> - SVf_UTF8 -> SVf_PVUPGRADED
> - SvUTF8 -> Sv_PVUPGRADED
> - SvUTF8_on -> Sv_PVUPGRADED_on
> - SvUTF8_off -> Sv_PVUPGRADED_off
> - SvPOK_only_UTF8 -> SvPOK_only_UPGRADED
>
> Note that flags like REFCOUNTED_HE_KEY_UTF8 do not need a rename because these indicate an actual (if incomplete/invalidated) UTF-8 decoding step.
>
> BENEFITS: Over time, this rename will minimize the confusion between Perl’s upgraded-PV storage format versus UTF-8. The rename may also compel current users of the language who hold mistaken mental models of the flag’s purpose to reexamine their understanding, hopefully for the better.
>
> POTENTIAL COMPLICATIONS: The mismatch between amended documentation and existing documentation may cause confusion; it should, though, be an auspicious confusion that eventually clarifies rather than misleads.

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

Aug 20, 2021, 1:42 AM

Post #12 of 46 (2412 views)

On Fri, Aug 20, 2021 at 9:05 AM Sergey Aleynikov
<sergey.aleynikov@gmail.com> wrote:
>
> ??, 18 ???. 2021 ?. ? 20:17, Felipe Gasper <felipe@felipegasper.com>:
> > The problem here is the naming. For example, consider `perl -e'my $foo = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its code points (assuming use of a UTF-8 terminal) correspond to the bytes that encode “é” in UTF-8. The “UTF-8 flag”, however, is likely *not* set on this string
>
> There's no likeness. For literal string, there're deterministic rules
> set (though they may not be documented).
>
> >Here $foo has the “UTF-8 flag” set, but $foo is NOT a “UTF-8 string” because its code points (in this case, only 1) aren’t valid UTF-8.
>
> Maybe I don't understand you, but perl can't have invalid UTF8 in
> literals under 'use utf8'.

But the contents of the string are not "UTF-8". UTF-8 is byte encoding
for Unicode codepoints. From a language perspective (not considering
perl's implementation), the contents of the string is a single
codepoint. It is not a UTF-8 byte sequence.

>
> > PROPOSAL: Rename the following identifiers in code and documentation, leaving macros for the old ones as aliases:
>
> Which will only bring more confusion going forward. If you want to
> fight SVf_UTF8 confusion, the problem lies not in it's name, but in
> the logic behind it. You're trying to shove this issue under the rug,
> but what really makes things this messy is this flag's mere existence
> (and it still might be better than Python's choice for theirs Unicode
> strings). -1 from me.
>
> Best regards,
> Sergey Aleynikov
>
> >
> > Per recent IRC discussion …
> >
> > PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of confusion regarding the flag’s significance. Some think it indicates whether a given PV stores text versus binary. Some think it means that the PV is valid UTF-8. Still others likely hold other inaccurate views.
> >
> > The problem here is the naming. For example, consider `perl -e'my $foo = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its code points (assuming use of a UTF-8 terminal) correspond to the bytes that encode “é” in UTF-8. The “UTF-8 flag”, however, is likely *not* set on this string. By contrast, consider `perl -Mutf8 -e'my $foo = "é"'`. Here $foo has the “UTF-8 flag” set, but $foo is NOT a “UTF-8 string” because its code points (in this case, only 1) aren’t valid UTF-8.
> >
> > The fact that quite often a “UTF-8 string” lacks the “UTF-8 flag”, and a “UTF-8-flagged” string is (usually) *not* a “UTF-8 string”, makes little sense except to the “highly initiated”.
> >
> > Another problem is “UTF-8” doesn’t really describe the “upgraded” format. This format is what Perl historically called “lax UTF-8” and is now widely called “generalized UTF-8”, which includes unpaired surrogates and code points above Unicode’s maximum.
> >
> > PROPOSAL: Rename the following identifiers in code and documentation, leaving macros for the old ones as aliases:
> > - SVf_UTF8 -> SVf_PVUPGRADED
> > - SvUTF8 -> Sv_PVUPGRADED
> > - SvUTF8_on -> Sv_PVUPGRADED_on
> > - SvUTF8_off -> Sv_PVUPGRADED_off
> > - SvPOK_only_UTF8 -> SvPOK_only_UPGRADED
> >
> > Note that flags like REFCOUNTED_HE_KEY_UTF8 do not need a rename because these indicate an actual (if incomplete/invalidated) UTF-8 decoding step.
> >
> > BENEFITS: Over time, this rename will minimize the confusion between Perl’s upgraded-PV storage format versus UTF-8. The rename may also compel current users of the language who hold mistaken mental models of the flag’s purpose to reexamine their understanding, hopefully for the better.
> >
> > POTENTIAL COMPLICATIONS: The mismatch between amended documentation and existing documentation may cause confusion; it should, though, be an auspicious confusion that eventually clarifies rather than misleads.

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

fawaka at gmail

Aug 20, 2021, 1:48 AM

Post #13 of 46 (2412 views)

>
> I would agree except that people not working on the internals also have to
> use these functions (for XS code), and thus misuse them because they think
> they're related to the logical contents of the string.
>

Any code dealing with strings on a C level will need to know if it is
encoding in UTF8 or something different. Changing the internal name to
upgraded would make sense if we could change the internal implementation
(e.g. to UTF16) without *everything* exploding.

Leon

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

felipe at felipegasper

Aug 20, 2021, 6:20 AM

Post #14 of 46 (2412 views)

> On Aug 20, 2021, at 4:48 AM, Leon Timmermans <fawaka@gmail.com> wrote:
>
> I would agree except that people not working on the internals also have to use these functions (for XS code), and thus misuse them because they think they're related to the logical contents of the string.
>
> Any code dealing with strings on a C level will need to know if it is encoding in UTF8 or something different.

A bit off-topic, but worth bearing in mind: XS modules that merely interface between Perl and an external library (e.g., libcurl, libunbound) can avoid SvPV et al. in favour of the variants that preserve the abstraction.

It’s probably an overly simplistic ideal, but in theory it seems the only things that would need to care about what we currently call SVf_UTF8 are things that must read or manipulate Perl’s internals directly. Everything else -- even XS modules and embedding C applications -- can respect the abstraction.

> Changing the internal name to upgraded would make sense if we could change the internal implementation (e.g. to UTF16) without *everything* exploding.

That’s true, but the fact that the internal encoding can’t pragmatically change doesn’t invalidate the benefits of strengthening the abstraction, which include:

- Less confusion about what a “UTF-8 string” is: it’ll be clearer that upgraded/downgraded and (Perl-visible) UTF-8-ness are orthogonal qualities.

- More abstract terminology will discourage folks from trying to think too deeply about Perl’s internals.

-F

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

felipe at felipegasper

Aug 20, 2021, 6:24 AM

Post #15 of 46 (2412 views)

> On Aug 20, 2021, at 3:04 AM, Sergey Aleynikov <sergey.aleynikov@gmail.com> wrote:
>
> ??, 18 ???. 2021 ?. ? 20:17, Felipe Gasper <felipe@felipegasper.com>:
>> The problem here is the naming. For example, consider `perl -e'my $foo = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its code points (assuming use of a UTF-8 terminal) correspond to the bytes that encode “é” in UTF-8. The “UTF-8 flag”, however, is likely *not* set on this string
>
> There's no likeness. For literal string, there're deterministic rules
> set (though they may not be documented).

They’re not documented; ergo, they can change at any time. This is by design, right? A Perl application should not have to think about how Perl stores its code points?

> what really makes things this messy is this flag's mere existence
> (and it still might be better than Python's choice for theirs Unicode
> strings).

Out of curiosity, what do you think would be the ideal? Store all strings internally as UTF-8, à la Rust?

-F

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

demerphq at gmail

Aug 20, 2021, 10:05 AM

Post #16 of 46 (2412 views)

On Wed, 18 Aug 2021 at 19:17, Felipe Gasper <felipe@felipegasper.com> wrote:

> Per recent IRC discussion …
>
> PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of
> confusion regarding the flag’s significance. Some think it indicates
> whether a given PV stores text versus binary. Some think it means that the
> PV is valid UTF-8. Still others likely hold other inaccurate views.
>
> The problem here is the naming. For example, consider `perl -e'my $foo =
> "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its
> code points (assuming use of a UTF-8 terminal) correspond to the bytes that
> encode “é” in UTF-8.

Nope. It might contain utf8, but it is not UTF8-ON. Think of it like a
square/rectangle relationship. All strings are "rectangles", all "squares"
are rectangles, some strings are squares, but unless SQUARE flag is ON perl
should assume it is a rectangle, not a square. The SQUARE flag should
only be set when the rectangle has been proved conclusively to be a square.
That the SQUARE flag is off does not mean the rectangle is not a square,
merely that the square has not been proved to be such.

The “UTF-8 flag”, however, is likely *not* set on this string. By contrast,
> consider `perl -Mutf8 -e'my $foo = "é"'`. Here $foo has the “UTF-8 flag”
> set, but $foo is NOT a “UTF-8 string” because its code points (in this
> case, only 1) aren’t valid UTF-8.
>

Except it is valid UTF-8: (at least in my utf8 terminal).

$ perl -MDevel::Peek -Mutf8 -e'my $foo = "é"; Dump($foo)'
SV = PV(0x153efc0) at 0x155fb38
REFCNT = 1
FLAGS = (POK,IsCOW,pPOK,UTF8)
PV = 0x1563240 "\303\251"\0 [UTF8 "\x{e9}"]
CUR = 2
LEN = 10
COW_REFCNT = 1

So the string is UTF-8.

You cannot get the UTF-8 flag on without using XS tricks and have the
buffer contain non-utf8. It is that simple. (Sure you can do it with
Encode::_utf8_on() but that is XS.)

I do not understand your point that only the initiated can understand this
flag. It means one and only one thing: that the perl internals should
assume that the buffer contains utf8 encoded data and that perl should
apply unicode semantics when doing character and case-sensitive operations,
and that perl can make certain assumptions when it processing the data (eg
that is not malformed).

When it is off it does not mean that the data cannot be utf8 data, merely
that Perl cannot and should not assume it is utf8 data, and should not try
to interpret it as utf8 data when the string is used in character
operations, and that when it is used in case-sensitive operations it should
use the tradition limited case-insensitive logic from ASCII.

Personally I think renaming this flag will just increase confusion, not
decrease.

BTW, your scheme needs to account for WAS_UTF8 as well. Most people dont
know it, but there are actually three types of strings in the perl
internals, UTF8-ON, UTF8-OFF, UTF8-OFF + WAS_UTF8. It only manifests in
hash keys. But it needs to be accounted for as well in any renaming. Perl
dictates that keys which are character-wise equivalent hash the same
regardless of the UTF8 flag (or put alternative, the hash should be of the
codepoints the string represents NOT the octets that make up that
representation). This means UTF8-ON keys are always downgraded on lookup or
store in a hash. If the downgrade is successful the key is marked as
WAS-UTF8 and the downgraded string is stored and hashed, if it was
unsuccessful (eg it contains codepoints above 255) it is marked as UTF8-ON
and the original buffer is hashed. When the key is extracted with keys() or
each() if the WASUTF8 flag is set the string is upgraded back to the UTF8
form.

I think you need to step back and consider that strings are sequences of
octets. Sometimes those octets are ordered such that they can be
interpreted as utf8. The UTF-8 flag being on tells perl that it can and
should treat the octets as utf8.

You used examples that involve source code which I think might be confusing
you, as it introduces weird issues related to what character set your
terminal thinks it is using, and what format the text in the file is stored
in, and what operating system is in use. If you stick to examples that
only use code then all of that ambiguity goes away and it should be easy to
understand. Eg when you say:

my $foo = "é";

I don't know exactly what that code does without doing an octet level
investigation of the data. It could be one octet and in latin-1 or it could
be two octets and be Unicode in one of several formats (utf8, utf-16BE
utf-16LE) and still be rendered identically in an editor or browser.

However if you say:

my $foo= chr(0xe9); # é

I know exactly what is going on, and what $foo should contain.

I also know what happens here:

my $foo="\x{c3}\x{a9}";
utf8::decode($foo);
Dump($foo);

SV = PV(0x2303fc0) at 0x2324c98
REFCNT = 1
FLAGS = (POK,IsCOW,pPOK,UTF8)
PV = 0x2328300 "\303\251"\0 [UTF8 "\x{e9}"]
CUR = 2
LEN = 10
COW_REFCNT = 1

That is, i start off with two octets, C3 - A9, which happens to be the
encoding for the codepoint E9, which happens to be é.
I then tell perl to "decode" those octets, which really means I tell perl
to check that the octets actually do make up valid utf8. And if perl agrees
that indeed these are valid utf8 octets, then it turns the flag on. Now it
doesn't matter if you *meant* to construct utf8 in the variable fed to
decode, all that matters is that at an octet level those octet happen to
make up valid utf8.

Try

my $foo="\x{c3}\x{a9}\x{c3}";
utf8::decode($foo);
Dump($foo);
SV = PV(0x23040a0) at 0x23249f8
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x2329350 "\303\251\303"\0
CUR = 3
LEN = 10

So here we can see that perl did nothing with this version of $foo, because
it did not contain a valid utf8 sequence. \x{c3} can never be the last byte
in valid utf8, it always must be followed by something, so perl did not
turn the UTF8 flag on.

Work the problem like this a while and you will see that really this
subject is pretty simple, and there is a tremendous amount of fud about it
when in fact it is really simple. The flag says that the buffer contains
valid octets that are not illegal utf8, and that perl should apply
utf8/unicode semantics when doing "character" operations on the string. The
flag being off means that when doing character operations it should assume
fixed width octet operations, and it should use ASCII case-folding rules.
That is it. The flag being off does not *ever* mean the data is NOT utf8,
it simply means that data has not been *validated* as utf8 and thus perl
cannot use utf8 rules to process it. That is it.

cheers,
Yves

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

grinnz at gmail

Aug 20, 2021, 10:16 AM

Post #17 of 46 (2412 views)

On Fri, Aug 20, 2021 at 1:06 PM demerphq <demerphq@gmail.com> wrote:

> On Wed, 18 Aug 2021 at 19:17, Felipe Gasper <felipe@felipegasper.com>
> wrote:
>
>> Per recent IRC discussion …
>>
>> PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of
>> confusion regarding the flag’s significance. Some think it indicates
>> whether a given PV stores text versus binary. Some think it means that the
>> PV is valid UTF-8. Still others likely hold other inaccurate views.
>>
>> The problem here is the naming. For example, consider `perl -e'my $foo =
>> "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its
>> code points (assuming use of a UTF-8 terminal) correspond to the bytes that
>> encode “é” in UTF-8.
>
>
> Nope. It might contain utf8, but it is not UTF8-ON. Think of it like a
> square/rectangle relationship. All strings are "rectangles", all "squares"
> are rectangles, some strings are squares, but unless SQUARE flag is ON perl
> should assume it is a rectangle, not a square. The SQUARE flag should
> only be set when the rectangle has been proved conclusively to be a square.
> That the SQUARE flag is off does not mean the rectangle is not a square,
> merely that the square has not been proved to be such.
>
>
> The “UTF-8 flag”, however, is likely *not* set on this string. By
>> contrast, consider `perl -Mutf8 -e'my $foo = "é"'`. Here $foo has the
>> “UTF-8 flag” set, but $foo is NOT a “UTF-8 string” because its code points
>> (in this case, only 1) aren’t valid UTF-8.
>>
>
> Except it is valid UTF-8: (at least in my utf8 terminal).
>
> $ perl -MDevel::Peek -Mutf8 -e'my $foo = "é"; Dump($foo)'
> SV = PV(0x153efc0) at 0x155fb38
> REFCNT = 1
> FLAGS = (POK,IsCOW,pPOK,UTF8)
> PV = 0x1563240 "\303\251"\0 [UTF8 "\x{e9}"]
> CUR = 2
> LEN = 10
> COW_REFCNT = 1
>
> So the string is UTF-8.
>

The premise of this email seems to be about the internals of the string.
That is not the contents of the string (which is "\x{e9}" in this example).
Please re-evaluate in that context.

-Dan

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

felipe at felipegasper

Aug 20, 2021, 10:48 AM

Post #18 of 46 (2412 views)

> On Aug 20, 2021, at 1:05 PM, demerphq <demerphq@gmail.com> wrote:
>
> On Wed, 18 Aug 2021 at 19:17, Felipe Gasper <felipe@felipegasper.com> wrote:
> Per recent IRC discussion …
>
> PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of confusion regarding the flag’s significance. Some think it indicates whether a given PV stores text versus binary. Some think it means that the PV is valid UTF-8. Still others likely hold other inaccurate views.
>
> The problem here is the naming. For example, consider `perl -e'my $foo = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its code points (assuming use of a UTF-8 terminal) correspond to the bytes that encode “é” in UTF-8.
>
> Nope. It might contain utf8, but it is not UTF8-ON. Think of it like a square/rectangle relationship. All strings are "rectangles", all "squares" are rectangles, some strings are squares, but unless SQUARE flag is ON perl should assume it is a rectangle, not a square. The SQUARE flag should only be set when the rectangle has been proved conclusively to be a square. That the SQUARE flag is off does not mean the rectangle is not a square, merely that the square has not been proved to be such.

You’re defining “a UTF-8 string” as “a string whose PV is marked as UTF-8”. I’m defining it as “a string whose Perl-visible code points happen to be valid UTF-8”.

What you call “a UTF-8 string” is what I propose we call, per existing nomenclature (i.e., utf8::upgrade()) “a PV-upgraded string”, with corresponding code changes. Then the term “UTF-8 string” makes sense from a pure-Perl context without requiring Perl programmers to worry about interpreter internals.

> The “UTF-8 flag”, however, is likely *not* set on this string. By contrast, consider `perl -Mutf8 -e'my $foo = "é"'`. Here $foo has the “UTF-8 flag” set, but $foo is NOT a “UTF-8 string” because its code points (in this case, only 1) aren’t valid UTF-8.
>
> Except it is valid UTF-8: (at least in my utf8 terminal).
>
> $ perl -MDevel::Peek -Mutf8 -e'my $foo = "é"; Dump($foo)'
> SV = PV(0x153efc0) at 0x155fb38
> REFCNT = 1
> FLAGS = (POK,IsCOW,pPOK,UTF8)
> PV = 0x1563240 "\303\251"\0 [UTF8 "\x{e9}"]
> CUR = 2
> LEN = 10
> COW_REFCNT = 1
>
> So the string is UTF-8.

Again, different definitions. The Perl-visible string contains a single code point, 0xe9. This code point doesn’t correspond to valid UTF-8 bytes, so IMO it doesn’t make sense to call it a “UTF-8 string”. Whether Perl stores that code point as one byte or as two is Perl’s business alone … right?

> I do not understand your point that only the initiated can understand this flag. It means one and only one thing: that the perl internals should assume that the buffer contains utf8 encoded data and that perl should apply unicode semantics when doing character and case-sensitive operations, and that perl can make certain assumptions when it processing the data (eg that is not malformed).

The behaviour you’re talking about is what the unicode_strings and unicode_eval features specifically do away with (i.e., fix), right?

You’re omitting what IMO is the most obvious purpose of the flag: to indicate whether the code points that the PV stores are the plain bytes, or are the UTF-8-decoded code points. This is why you can print() the string in either upgraded or downgraded forms, and it comes out the same.

> BTW, your scheme needs to account for WAS_UTF8 as well. Most people dont know it, but there are actually three types of strings in the perl internals, UTF8-ON, UTF8-OFF, UTF8-OFF + WAS_UTF8. It only manifests in hash keys. But it needs to be accounted for as well in any renaming. Perl dictates that keys which are character-wise equivalent hash the same regardless of the UTF8 flag (or put alternative, the hash should be of the codepoints the string represents NOT the octets that make up that representation). This means UTF8-ON keys are always downgraded on lookup or store in a hash. If the downgrade is successful the key is marked as WAS-UTF8 and the downgraded string is stored and hashed, if it was unsuccessful (eg it contains codepoints above 255) it is marked as UTF8-ON and the original buffer is hashed. When the key is extracted with keys() or each() if the WASUTF8 flag is set the string is upgraded back to the UTF8 form.

Thank you for this. I knew about the was-UTF8 status but didn’t know why it exists.

> I think you need to step back and consider that strings are sequences of octets. Sometimes those octets are ordered such that they can be interpreted as utf8. The UTF-8 flag being on tells perl that it can and should treat the octets as utf8.

C strings are sequences of octets, yes. Perl strings, though, are sequences of code points, not octets. In this they’re more like JavaScript strings than C strings.

> my $foo = "é";
>
> I don't know exactly what that code does without doing an octet level investigation of the data. It could be one octet and in latin-1 or it could be two octets and be Unicode in one of several formats (utf8, utf-16BE utf-16LE) and still be rendered identically in an editor or browser.

Sorry, I assumed we all use UTF-8 terminals. :) But yes, I should have written it as two \x escapes, sorry.

> I also know what happens here:
>
> my $foo="\x{c3}\x{a9}";
> utf8::decode($foo);
> Dump($foo);
>
> SV = PV(0x2303fc0) at 0x2324c98
> REFCNT = 1
> FLAGS = (POK,IsCOW,pPOK,UTF8)
> PV = 0x2328300 "\303\251"\0 [UTF8 "\x{e9}"]
> CUR = 2
> LEN = 10
> COW_REFCNT = 1
>
> That is, i start off with two octets, C3 - A9, which happens to be the encoding for the codepoint E9, which happens to be é.
> I then tell perl to "decode" those octets, which really means I tell perl to check that the octets actually do make up valid utf8. And if perl agrees that indeed these are valid utf8 octets, then it turns the flag on. Now it doesn't matter if you *meant* to construct utf8 in the variable fed to decode, all that matters is that at an octet level those octet happen to make up valid utf8.

I think you’re actually breaking the abstraction here by assuming that Perl implements the decode by setting a flag.

It would be just as legitimate to mutate the PV to store a single octet, 0xe9, and leave the UTF8 flag off. Perl doesn’t do that, of course, because it’s easier just to set a flag, but as long as the string content is the single code point 0xe9 it doesn’t really matter how Perl achieves that.

(Notwithstanding, of course, the abstraction leaks that things like the unicode_strings feature and Sys::Binmode fix.)

There are parts of the code that appear to go the other way and prioritize downgraded storage. Perl_refcounted_he_fetch_pvn(), for example.

-FG

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

kimoto.yuki at gmail

Aug 22, 2021, 4:58 PM

Post #19 of 46 (2408 views)

Personally, I'm starting to agree on the goal of Felipe.

1. Being able to distinguish between Text and Bytes from user
2. Text is Unicode code point which is represented by UTF-8
3. Perl config has default OS text character set and OS file system
character set
4. Perl standard function(print, open, etc) output string by encoding above
3 character set if the string is Text.

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

Aug 30, 2021, 5:18 AM

Post #20 of 46 (2401 views)

(forgot to Cc this to p5p)

----- Forwarded message from Dave Mitchell <davem@iabyn.com> -----

Date: Mon, 30 Aug 2021 13:17:04 +0100
From: Dave Mitchell <davem@iabyn.com>
To: Felipe Gasper <felipe@felipegasper.com>
Subject: Re: Pre-RFC: Rename SVf_UTF8 et al.
Message-ID: <YSzMQJIeURS/AznY@iabyn.com>

On Wed, Aug 18, 2021 at 01:18:34PM -0400, Felipe Gasper wrote:
> PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of confusion regarding the flag’s significance.

The SVf_UTF8 flags has a clear and unambiguous meaning (apart from some
historical bugs): in what manner the codepoints of a string are stored as
a sequence of bytes in memory.

If people are confused by this, renaming it is only going to add to the
cognitive load and confusion.

(I'm assuming the old names are kept as aliases for backwards
compatibility.)

--
The Enterprise successfully ferries an alien VIP from one place to another
without serious incident.
-- Things That Never Happen in "Star Trek" #7

----- End forwarded message -----

--
print+qq&$}$"$/$s$,$a$d$g$s$@$.$q$,$:$.$q$^$,$@$a$~$;$.$q$m&if+map{m,^\d{0\,},,${$::{$'}}=chr($"+=$&||1)}q&10m22,42}6:17a2~2.3@3;^2dg3q/s"&=~m*\d\*.*g

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

felipe at felipegasper

Aug 30, 2021, 7:19 AM

Post #21 of 46 (2401 views)

> On Aug 30, 2021, at 8:18 AM, Dave Mitchell <davem@iabyn.com> wrote:
>
> Date: Mon, 30 Aug 2021 13:17:04 +0100
> From: Dave Mitchell <davem@iabyn.com>
> To: Felipe Gasper <felipe@felipegasper.com>
> Subject: Re: Pre-RFC: Rename SVf_UTF8 et al.
> Message-ID: <YSzMQJIeURS/AznY@iabyn.com>
>
> On Wed, Aug 18, 2021 at 01:18:34PM -0400, Felipe Gasper wrote:
>> PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of confusion regarding the flag’s significance.
>
>
> The SVf_UTF8 flags has a clear and unambiguous meaning (apart from some
> historical bugs): in what manner the codepoints of a string are stored as
> a sequence of bytes in memory.
>
> If people are confused by this, renaming it is only going to add to the
> cognitive load and confusion.

I’ve proposed some fixes for perlre.pod (https://github.com/Perl/perl5/pull/19087). These fix documentation bugs that crept in specifically because of the use of “UTF-8” to refer to “upgraded” strings. It confuses even Perl’s own maintainers.

The fact that “UTF-8 string” can mean two quite-different things causes lots of encoding bugs in the wild. The fact that Perl *can’t* help to fix these worsens the problem.

Ricardo sensed a problem here back in 2016: https://www.youtube.com/watch?v=TmTeXcEixEg&t=940s

… when he referred to the flag as WIDE, in part because the encoding in question is *not*, in fact, UTF-8. Then he said: “Some joker went ahead, and they called that the UTF-8 flag.” Chuckles ensued.

Benefits of changing the internal terminology:

- It clarifies “external”, Perl-visible encoding versus internal codepoint storage. Different terms for different things.
- More abstract terminology for the internals discourages folks from peeking behind the abstraction.
- It’s more correct. Proper UTF-8 forbids quite a lot that Perl’s “lax UTF-8” (by design) allows.

Thanks for reading.

-FG

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

demerphq at gmail

Sep 2, 2021, 6:20 AM

Post #22 of 46 (2398 views)

On Fri, 20 Aug 2021 at 19:48, Felipe Gasper <felipe@felipegasper.com> wrote:

>
> > On Aug 20, 2021, at 1:05 PM, demerphq <demerphq@gmail.com> wrote:
> >
> > On Wed, 18 Aug 2021 at 19:17, Felipe Gasper <felipe@felipegasper.com>
> wrote:
> > Per recent IRC discussion …
> >
> > PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of
> confusion regarding the flag’s significance. Some think it indicates
> whether a given PV stores text versus binary. Some think it means that the
> PV is valid UTF-8. Still others likely hold other inaccurate views.
> >
> > The problem here is the naming. For example, consider `perl -e'my $foo =
> "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its
> code points (assuming use of a UTF-8 terminal) correspond to the bytes that
> encode “é” in UTF-8.
> >
> > Nope. It might contain utf8, but it is not UTF8-ON. Think of it like a
> square/rectangle relationship. All strings are "rectangles", all "squares"
> are rectangles, some strings are squares, but unless SQUARE flag is ON perl
> should assume it is a rectangle, not a square. The SQUARE flag should only
> be set when the rectangle has been proved conclusively to be a square. That
> the SQUARE flag is off does not mean the rectangle is not a square, merely
> that the square has not been proved to be such.
>
> You’re defining “a UTF-8 string” as “a string whose PV is marked as
> UTF-8”. I’m defining it as “a string whose Perl-visible code points happen
> to be valid UTF-8”.
>

I dont find your definition to be very useful, nor descriptive of how perl
manages these matters, so I am not using it. You are confusing different
levels of abstraction. Your definition also would include cases where the
data is already encoded and flagged as utf8. So it doesn't make sense to me.

Here is the set of definitions that I am operating from:

A "string" is a programming concept inside of Perl which is used to
represent "text" buffers of memory. There are three level of abstraction
for strings, two of which are tightly coupled. The three are the codepoint
level, semantic level and encoding level.

At the codepoint levels you can think of strings as variable length arrays
of numbers (codepoints), where the numbers are restricted to 0 to 0x10FFFF.

At the semantics level you can think of these numbers (codepoints) of
representing characters from some form of text with specific rules for
certain operations like case-folding, as well as a well defined mapping to
graphemes which are displayed to our eyes when those numbers are rendered
by a display device like a terminal.

The encoding level of abstraction addresses how those numbers (codepoints)
will be represented as bytes (octets) in memory inside of Perl, and when
you directly write the data to disk or to some other output stream.

There are two sets of codepoint range, semantics and encoding available,
which are controlled by a flag associated with the string called the UTF8
flag. When set this flag indicates that the string can represent codepoints
0 to 0x10FFFF, should have Unicode semantics applied to it, and that its in
memory representation is variable-width utf8. When the flag is not set it
indicates the string can represent codepoints 0 to 255, has ASCII
case-folding semantics, and that its in memory representation is fixed
width octets.

In order to be able to combine these two types of strings we need to define
some operations:

upgrading/downgrading: converting a string from one set of semantics and
encoding to the other while preserving exactly the codepoint level
representation. By tradition we call it upgrading when we go from Latin-1
to Unicode with the result being UTF8 on, and we call it downgrading when
we go from Unicode to Latin1 with the result being UTF8-off. These
operations are NOT symmetrical. It is *not* possible to downgrade every
Unicode string to Latin-1, however it is possible to upgrade every Latin-1
string to Unicode. By tradition upgrade and downgrade functions are noops
when their input is already in the form expected as the result, but this is
by tradition only.

decoding/encoding: converting a string from one form to the other in a way
that transforms the codepoints from one form to a potentially different
form. Traditional we speak of decode_utf8() taking a latin1 string
containing octets that make up a utf8 encoded string, and returning a
string which is UTF8 on which represents the Unicode version of those
octets. For well formed input this results in no change to the underlying
string, but the flag is flipped on. Vice versa we speak of encode_utf8()
which converts its input to a utf8 encoded form, regardless of what form it
was represented internally.

When we are confronted with combining the two forms of string Perl has
little choice but to use the "safe" strategy of "upgrading" the Latin-1
parts to Unicode.

Both the operations of "upgrading" and "decoding" result in Utf8-on
strings, and indeed both can result in not changing their input at all, but
when they do change their input they change it very differently. Most of
the places people get into trouble with strings is when they end up doing
upgrade operations when they should have done a decode operation. This is
because upgrade operations can happen implicitly based on simple rules and
thus can happen "by accident", but decode operations are always explicit so
they never happen without the involvement of the developer in some way.
This is at least partly because upgrade operations do not have any failure
modes but decode operations do.

Most of the time, as long as you are only thinking about codepoints,
developers dont have to worry about this stuff. The places where they do
are when they are reading or writing data, and in some cases when they are
embedding string constants in their code where they want a particular set
of semantics and encoding. As long as people are disciplined to use
decode_utf8() before they use utf8 string data, and encode_utf8 before
they emit it then the complexities above should be transparent to the
developer.

> What you call “a UTF-8 string” is what I propose we call, per existing
> nomenclature (i.e., utf8::upgrade()) “a PV-upgraded string”, with
> corresponding code changes. Then the term “UTF-8 string” makes sense from a
> pure-Perl context without requiring Perl programmers to worry about
> interpreter internals.
>
>
No. The flag does not mean "upgraded" it means "unicode semantics, utf8
encoding". Upgrading is one way to get such a string, and it might even be
the most common, but the most important and likely to be correct way is
explicit decoding.

If we are to rename the flag then we should just rename it as the UNICODE
flag. Would have saved a world of confusion.

> The “UTF-8 flag”, however, is likely *not* set on this string. By
> contrast, consider `perl -Mutf8 -e'my $foo = "é"'`. Here $foo has the
> “UTF-8 flag” set, but $foo is NOT a “UTF-8 string” because its code points
> (in this case, only 1) aren’t valid UTF-8.
> >
> > Except it is valid UTF-8: (at least in my utf8 terminal).
> >
> > $ perl -MDevel::Peek -Mutf8 -e'my $foo = "é"; Dump($foo)'
> > SV = PV(0x153efc0) at 0x155fb38
> > REFCNT = 1
> > FLAGS = (POK,IsCOW,pPOK,UTF8)
> > PV = 0x1563240 "\303\251"\0 [UTF8 "\x{e9}"]
> > CUR = 2
> > LEN = 10
> > COW_REFCNT = 1
> >
> > So the string is UTF-8.
>
> Again, different definitions.

You cant define yourself away from how things actually work. The string is
UTF8 on because perl says so.

> The Perl-visible string contains a single code point, 0xe9. This code
> point doesn’t correspond to valid UTF-8 bytes,

Codepoints and octets (bytes) are abstractions at different levels in Perl
and Unicode so "this codepoint doesnt correspond to valid UTF-8 bytes"
doesn't really make any sense as a sentence. Codepoints are integers from 0
to 0x1FFFF. They can be *encoded* in a variety of ways as octets, for
instance the codepoint E9 has at least 5 different representations at the
octet level under Uncode: "\x{E9}\x{00}" "\x{00}\x{E9}",
"\x{E9}\x{00}\x{00}\x{00}", "\x{00}\x{00}\x{00}\x{E9}", and "\303\251" are
all equally valid ways of representing the codepoint E9. Notice, that the
octet "E9" by itself is NOT a valid way to represent the codepoint E9 in
any Unicode encoding.

The dump above shows correctly the octet and codepoint representation of
the string. The buffer contains "\303\251" which is the UTF8
representation of the codepoint E9, and the flag is on which is why it
understands that this is a single codepoint, not two.

> so IMO it doesn’t make sense to call it a “UTF-8 string”.

After the operations I performed it is a Unicode string encoded in UTF-8,
as represented by the UTF-8 flag in the dump.

> Whether Perl stores that code point as one byte or as two is Perl’s
> business alone … right?
>

Well it would be weird if we stored Unicode data in a form not supported by
Unicode. Dont you think? There is no single octet representation of the
codepoint E9 defined by Unicode as far as I know.

>
> > I do not understand your point that only the initiated can understand
> this flag. It means one and only one thing: that the perl internals should
> assume that the buffer contains utf8 encoded data and that perl should
> apply unicode semantics when doing character and case-sensitive operations,
> and that perl can make certain assumptions when it processing the data (eg
> that is not malformed).
>
> The behaviour you’re talking about is what the unicode_strings and
> unicode_eval features specifically do away with (i.e., fix), right?

Im not familiar with those enough to comment. I assume they relate to what
assumptions Perl should make about strings which are constructed as
literals in the source code, where there is a great deal of ambiguity about
what is going on compared to actual code that constructs such strings,
where things are exact.

>
> You’re omitting what IMO is the most obvious purpose of the flag: to
> indicate whether the code points that the PV stores are the plain bytes, or
> are the UTF-8-decoded code points. This is why you can print() the string
> in either upgraded or downgraded forms, and it comes out the same.
>

Its hard to say what you are referring to here. If you mean codepoints
0-127, then it is unsurprising as the representation of them is equivalent
in ASCII and UTF8 and Latin1. But if you mean a codepoint above the ASCII
plane, then no they should not come out the same. If you are piping that
data to a file I would expect the octets written to that file to be
different. (assuming a binary filehandle with no layers magically
transforming things). If your terminal renders them the same then I assume
it is doing some magic behind the scenes to deal with malformed utf8.

>
> > BTW, your scheme needs to account for WAS_UTF8 as well. Most people dont
> know it, but there are actually three types of strings in the perl
> internals, UTF8-ON, UTF8-OFF, UTF8-OFF + WAS_UTF8. It only manifests in
> hash keys. But it needs to be accounted for as well in any renaming. Perl
> dictates that keys which are character-wise equivalent hash the same
> regardless of the UTF8 flag (or put alternative, the hash should be of the
> codepoints the string represents NOT the octets that make up that
> representation). This means UTF8-ON keys are always downgraded on lookup or
> store in a hash. If the downgrade is successful the key is marked as
> WAS-UTF8 and the downgraded string is stored and hashed, if it was
> unsuccessful (eg it contains codepoints above 255) it is marked as UTF8-ON
> and the original buffer is hashed. When the key is extracted with keys() or
> each() if the WASUTF8 flag is set the string is upgraded back to the UTF8
> form.
>
> Thank you for this. I knew about the was-UTF8 status but didn’t know why
> it exists.
>
> > I think you need to step back and consider that strings are sequences of
> octets. Sometimes those octets are ordered such that they can be
> interpreted as utf8. The UTF-8 flag being on tells perl that it can and
> should treat the octets as utf8.
>
> C strings are sequences of octets, yes. Perl strings, though, are
> sequences of code points, not octets. In this they’re more like JavaScript
> strings than C strings.
>

Perl strings are very similar to C strings when the flag is off, and
JavaScript strings when the flag is on.

>
> > my $foo = "é";
> >
> > I don't know exactly what that code does without doing an octet level
> investigation of the data. It could be one octet and in latin-1 or it could
> be two octets and be Unicode in one of several formats (utf8, utf-16BE
> utf-16LE) and still be rendered identically in an editor or browser.
>
> Sorry, I assumed we all use UTF-8 terminals. :) But yes, I should have
> written it as two \x escapes, sorry.
>

I do, but this email is being rendered by gmail in a browser. Any number of
conversions of the actual bytes on disk could have happened between you and
me. For all I know you might have wrote your email in a text editor using
UTF-32.

>
> > I also know what happens here:
> >
> > my $foo="\x{c3}\x{a9}";
> > utf8::decode($foo);
> > Dump($foo);
> >
> > SV = PV(0x2303fc0) at 0x2324c98
> > REFCNT = 1
> > FLAGS = (POK,IsCOW,pPOK,UTF8)
> > PV = 0x2328300 "\303\251"\0 [UTF8 "\x{e9}"]
> > CUR = 2
> > LEN = 10
> > COW_REFCNT = 1
> >
> > That is, i start off with two octets, C3 - A9, which happens to be the
> encoding for the codepoint E9, which happens to be é.
> > I then tell perl to "decode" those octets, which really means I tell
> perl to check that the octets actually do make up valid utf8. And if perl
> agrees that indeed these are valid utf8 octets, then it turns the flag on.
> Now it doesn't matter if you *meant* to construct utf8 in the variable fed
> to decode, all that matters is that at an octet level those octet happen to
> make up valid utf8.
>
> I think you’re actually breaking the abstraction here by assuming that
> Perl implements the decode by setting a flag.
>
>
No I am not. The flag is there is there to tell the perl internals how to
manipulate the string. decode's task is to take arbitrary strings of octets
and ensure that they can be decoded as valid utf8 and possibly to do some
conversion (eg for forbidden utf8 sequences or other normalization) as it
does so and then SETS THE FLAG. Only once decode is done is the string
"Unicode" and is the string "utf8". Prior to that it was just random
octets. It doesnt need to do anything BUT set the flag because its internal
encoding matches the external encoding in this case. If it was decoding
UTF16LE then it would have do conversion as well.

> It would be just as legitimate to mutate the PV to store a single octet,
> 0xe9, and leave the UTF8 flag off.

Nope. That would mean that Perl would use ASCII/Latin-1 case folding rules
on the result, which would be wrong. It should use Unicode case folding
rules for codepoint E9 if it was decoded as that codepoint. (Change the
example to \x{DF} and you can see these issues in the flesh, \x{DF} should
match "ss" in Unicode, but in ASCII/Latin1 it only matches \x{DF}. The lc()
version of \x{DF} is "ss" but in Latin-1/Ascii there are no multi-byte case
folds). Even more suggestive that Perl doing this would be wrong is that
in fact there is NO valid Unicode encoding of codepoint E9 which is only 1
octet long. So that would be extremely wrong of Perl to use a non Unicode
encoding of unicode data dont you think? Also, what would perl do when the
codepoint doesn't fit into a single octet? Your argument might have some
merit if you were arguing that Perl could have decoded it into
"\x{E9}\0\0\0" and set the UTF-32 flag, but as stated it doesn't make sense.

> Perl doesn’t do that, of course, because it’s easier just to set a flag,
> but as long as the string content is the single code point 0xe9 it doesn’t
> really matter how Perl achieves that.
>

Yes, Perl deliberately chose to use Utf8 internally for the same reason
Unicode defined utf8 the way it did, so that all of the existing ASCII data
would still be valid when interpreted as Unicode, thus avoiding storage and
performance penalties alternative schemes might impose.

(Notwithstanding, of course, the abstraction leaks that things like the
> unicode_strings feature and Sys::Binmode fix.)
>
> There are parts of the code that appear to go the other way and prioritize
> downgraded storage. Perl_refcounted_he_fetch_pvn(), for example.
>

I would not have put it like that. With hashing you dont have a lot of
choices if you want the unicode form of latin-1 strings to hash the same.
You can either decode to the codepoint, and then use a codepoint by
codepoint hashing algorithm, which is slow and actually as far as I know
there arent any published hash algorithms to do this. So to stay safe with
the hash function you can downgrade strings which can be downgraded and
then hash the result, or you can upgrade the strings and hash the upgraded
form. Upgrade strings are on average larger than downgraded equivalents, so
hashing them is more expensive, and there is an assumption that most keys
will actually be ASCII so they don't need to be downgraded. When you
consider that perl was an early adopter of Unicode and was bolting it on to
a latin-1 codebase the bias seems pretty reasonable.

cheers,
yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

demerphq at gmail

Sep 2, 2021, 6:52 AM

Post #23 of 46 (2398 views)

On Mon, 23 Aug 2021 at 01:59, Yuki Kimoto <kimoto.yuki@gmail.com> wrote:

> Personally, I'm starting to agree on the goal of Felipe.
>
> 1. Being able to distinguish between Text and Bytes from user
>

It seems like what you want is to redefine our use of Unicode-string/UTF-8
flag to be "Text" and then to call the other form "Bytes", but that doesn't
make sense. We define non-utf8 data to be implicitly ASCII/Latin-1. ASCII
because of case folding rules. Latin-1 because of conversion to Unicode,
which defines codepoints 0-255 to be equivalent to the codepoints 0.255 in
latin1. And we implicitly assume the equivalency in our operations.

> 2. Text is Unicode code point which is represented by UTF-8
>

chr(65) returns a latin-1 (eg NON-UTF8 flagged) character/string "A" which
happens to be octet identical but not flag identical to the Unicode
character "A". Are you suggesting that chr() doesn't return Text? Wouldn't
that be weird? And in concatenation what is supposed to happen when you
have Bytes . Text? Is that even legal in your scheme?

Take this further, is an operation like lc() even legal on "Bytes"?
Currently: lc(chr(65)) eq "a". Since chr(65) doesnt return a Unicode
character, and thus is not Text, shouldnt the lc() die? Or would you also
want to change that?

> 3. Perl config has default OS text character set and OS file system
> character set
>

As far as I know the assumption that all non-Unicode data is Latin-1 is
baked into Perl in a very firm way. So I dont see how this could be related
to the OS.

> 4. Perl standard function(print, open, etc) output string by encoding
> above 3 character set if the string is Text.
>

I dont see how we could change this. Anyone who cares exactly how data is
emitted to disk or any other "wire" format should be using Encode to
explicitly encode their data.

Perl strings are what perl strings are. I find that the people who have
trouble with them are usually the ones who like to pretend they work
differently than they do, instead of just respecting how they work and
being very explicit when they need to care, which for me personally has
been pretty rarely, eg, specialized output code or processing code.
(Parsing emails is a good place where you can get burned with encoding
issues and learn a lot.)

Having said that I have seen a lot of people for one reason or another get
encoding wrong in various ways, especially with MySQL or other over-wire
situations. Double encoding errors are common (eg where people accidentally
upgrade already encoded but flag-off utf8 data). At work we have a function
called recurse_decode_utf8() which takes a string and does its best to
"reduce" it to its minimal form by repeatedly turning off the utf8 flag,
and then executing decode_utf8() on the string and then downgrade until the
decode operation throws an error. Widespread use of this function o string
data almost completely eliminated all of our utf8 problems. (Ill post the
code in another mail.)

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

demerphq at gmail

Sep 2, 2021, 7:02 AM

Post #24 of 46 (2398 views)

On Thu, 2 Sept 2021 at 15:20, demerphq <demerphq@gmail.com> wrote:

> On Fri, 20 Aug 2021 at 19:48, Felipe Gasper <felipe@felipegasper.com>
> wrote:
>
>>
>> > On Aug 20, 2021, at 1:05 PM, demerphq <demerphq@gmail.com> wrote:
>> >
>> > On Wed, 18 Aug 2021 at 19:17, Felipe Gasper <felipe@felipegasper.com>
>> wrote:
>> > Per recent IRC discussion …
>> >
>> > PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of
>> confusion regarding the flag’s significance. Some think it indicates
>> whether a given PV stores text versus binary. Some think it means that the
>> PV is valid UTF-8. Still others likely hold other inaccurate views.
>> >
>> > The problem here is the naming. For example, consider `perl -e'my $foo
>> = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that
>> its code points (assuming use of a UTF-8 terminal) correspond to the bytes
>> that encode “é” in UTF-8.
>> >
>> > Nope. It might contain utf8, but it is not UTF8-ON. Think of it like a
>> square/rectangle relationship. All strings are "rectangles", all "squares"
>> are rectangles, some strings are squares, but unless SQUARE flag is ON perl
>> should assume it is a rectangle, not a square. The SQUARE flag should only
>> be set when the rectangle has been proved conclusively to be a square. That
>> the SQUARE flag is off does not mean the rectangle is not a square, merely
>> that the square has not been proved to be such.
>>
>> You’re defining “a UTF-8 string” as “a string whose PV is marked as
>> UTF-8”. I’m defining it as “a string whose Perl-visible code points happen
>> to be valid UTF-8”.
>>
>
> I dont find your definition to be very useful, nor descriptive of how perl
> manages these matters, so I am not using it. You are confusing different
> levels of abstraction. Your definition also would include cases where the
> data is already encoded and flagged as utf8. So it doesn't make sense to me.
>
> Here is the set of definitions that I am operating from:
>
> A "string" is a programming concept inside of Perl which is used to
> represent "text" buffers of memory. There are three level of abstraction
> for strings, two of which are tightly coupled. The three are the codepoint
> level, semantic level and encoding level.
>
> At the codepoint levels you can think of strings as variable length arrays
> of numbers (codepoints), where the numbers are restricted to 0 to 0x10FFFF.
>
> At the semantics level you can think of these numbers (codepoints) of
> representing characters from some form of text with specific rules for
> certain operations like case-folding, as well as a well defined mapping to
> graphemes which are displayed to our eyes when those numbers are rendered
> by a display device like a terminal.
>
> The encoding level of abstraction addresses how those numbers (codepoints)
> will be represented as bytes (octets) in memory inside of Perl, and when
> you directly write the data to disk or to some other output stream.
>
> There are two sets of codepoint range, semantics and encoding available,
> which are controlled by a flag associated with the string called the UTF8
> flag. When set this flag indicates that the string can represent codepoints
> 0 to 0x10FFFF, should have Unicode semantics applied to it, and that its in
> memory representation is variable-width utf8. When the flag is not set it
> indicates the string can represent codepoints 0 to 255, has ASCII
> case-folding semantics, and that its in memory representation is fixed
> width octets.
>
> In order to be able to combine these two types of strings we need to
> define some operations:
>
> upgrading/downgrading: converting a string from one set of semantics and
> encoding to the other while preserving exactly the codepoint level
> representation. By tradition we call it upgrading when we go from Latin-1
> to Unicode with the result being UTF8 on, and we call it downgrading when
> we go from Unicode to Latin1 with the result being UTF8-off. These
> operations are NOT symmetrical. It is *not* possible to downgrade every
> Unicode string to Latin-1, however it is possible to upgrade every Latin-1
> string to Unicode. By tradition upgrade and downgrade functions are noops
> when their input is already in the form expected as the result, but this is
> by tradition only.
>
> decoding/encoding: converting a string from one form to the other in a way
> that transforms the codepoints from one form to a potentially different
> form. Traditional we speak of decode_utf8() taking a latin1 string
> containing octets that make up a utf8 encoded string, and returning a
> string which is UTF8 on which represents the Unicode version of those
> octets. For well formed input this results in no change to the underlying
> string, but the flag is flipped on. Vice versa we speak of encode_utf8()
> which converts its input to a utf8 encoded form, regardless of what form it
> was represented internally.
>
> When we are confronted with combining the two forms of string Perl has
> little choice but to use the "safe" strategy of "upgrading" the Latin-1
> parts to Unicode.
>
> Both the operations of "upgrading" and "decoding" result in Utf8-on
> strings, and indeed both can result in not changing their input at all, but
> when they do change their input they change it very differently. Most of
> the places people get into trouble with strings is when they end up doing
> upgrade operations when they should have done a decode operation. This is
> because upgrade operations can happen implicitly based on simple rules and
> thus can happen "by accident", but decode operations are always explicit so
> they never happen without the involvement of the developer in some way.
> This is at least partly because upgrade operations do not have any failure
> modes but decode operations do.
>
> Most of the time, as long as you are only thinking about codepoints,
> developers dont have to worry about this stuff. The places where they do
> are when they are reading or writing data, and in some cases when they are
> embedding string constants in their code where they want a particular set
> of semantics and encoding. As long as people are disciplined to use
> decode_utf8() before they use utf8 string data, and encode_utf8 before
> they emit it then the complexities above should be transparent to the
> developer.
>

I was rereading this and I thought of something to add here. Part of the
confusion with Perl strings is that we try to hide the flag. We dont really
want people to look at it and think about it. Instead we provide a handful
of verbs which can be used to force the string to the shape we want, or
throw an error if we cant (or sometimes be a no-op).

I mean, if I want to be sure i have a latin-1 string then i would do
something like:

eval { utf8::downgrade($str); 1 } or warn "Cant downgrade string!";

And if want to be user I have a utf8 string then I would do something like:

utf8::upgrade($str);

I wonder if we made accessing the flag state more socially acceptable
whether people would find this less confusing.

Yves

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

felipe at felipegasper

Sep 2, 2021, 7:06 AM

Post #25 of 46 (2398 views)

> On Sep 2, 2021, at 10:02 AM, demerphq <demerphq@gmail.com> wrote:
>
> On Thu, 2 Sept 2021 at 15:20, demerphq <demerphq@gmail.com> wrote:
> On Fri, 20 Aug 2021 at 19:48, Felipe Gasper <felipe@felipegasper.com> wrote:
>
> > On Aug 20, 2021, at 1:05 PM, demerphq <demerphq@gmail.com> wrote:
> >
> > On Wed, 18 Aug 2021 at 19:17, Felipe Gasper <felipe@felipegasper.com> wrote:
> > Per recent IRC discussion …
> >
> > PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of confusion regarding the flag’s significance. Some think it indicates whether a given PV stores text versus binary. Some think it means that the PV is valid UTF-8. Still others likely hold other inaccurate views.
> >
> > The problem here is the naming. For example, consider `perl -e'my $foo = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its code points (assuming use of a UTF-8 terminal) correspond to the bytes that encode “é” in UTF-8.
> >
> > Nope. It might contain utf8, but it is not UTF8-ON. Think of it like a square/rectangle relationship. All strings are "rectangles", all "squares" are rectangles, some strings are squares, but unless SQUARE flag is ON perl should assume it is a rectangle, not a square. The SQUARE flag should only be set when the rectangle has been proved conclusively to be a square. That the SQUARE flag is off does not mean the rectangle is not a square, merely that the square has not been proved to be such.
>
> You’re defining “a UTF-8 string” as “a string whose PV is marked as UTF-8”. I’m defining it as “a string whose Perl-visible code points happen to be valid UTF-8”.
>
> I dont find your definition to be very useful, nor descriptive of how perl manages these matters, so I am not using it. You are confusing different levels of abstraction. Your definition also would include cases where the data is already encoded and flagged as utf8. So it doesn't make sense to me.
>
> Here is the set of definitions that I am operating from:
>
> A "string" is a programming concept inside of Perl which is used to represent "text" buffers of memory. There are three level of abstraction for strings, two of which are tightly coupled. The three are the codepoint level, semantic level and encoding level.
>
> At the codepoint levels you can think of strings as variable length arrays of numbers (codepoints), where the numbers are restricted to 0 to 0x10FFFF.
>
> At the semantics level you can think of these numbers (codepoints) of representing characters from some form of text with specific rules for certain operations like case-folding, as well as a well defined mapping to graphemes which are displayed to our eyes when those numbers are rendered by a display device like a terminal.
>
> The encoding level of abstraction addresses how those numbers (codepoints) will be represented as bytes (octets) in memory inside of Perl, and when you directly write the data to disk or to some other output stream.
>
> There are two sets of codepoint range, semantics and encoding available, which are controlled by a flag associated with the string called the UTF8 flag. When set this flag indicates that the string can represent codepoints 0 to 0x10FFFF, should have Unicode semantics applied to it, and that its in memory representation is variable-width utf8. When the flag is not set it indicates the string can represent codepoints 0 to 255, has ASCII case-folding semantics, and that its in memory representation is fixed width octets.
>
> In order to be able to combine these two types of strings we need to define some operations:
>
> upgrading/downgrading: converting a string from one set of semantics and encoding to the other while preserving exactly the codepoint level representation. By tradition we call it upgrading when we go from Latin-1 to Unicode with the result being UTF8 on, and we call it downgrading when we go from Unicode to Latin1 with the result being UTF8-off. These operations are NOT symmetrical. It is *not* possible to downgrade every Unicode string to Latin-1, however it is possible to upgrade every Latin-1 string to Unicode. By tradition upgrade and downgrade functions are noops when their input is already in the form expected as the result, but this is by tradition only.
>
> decoding/encoding: converting a string from one form to the other in a way that transforms the codepoints from one form to a potentially different form. Traditional we speak of decode_utf8() taking a latin1 string containing octets that make up a utf8 encoded string, and returning a string which is UTF8 on which represents the Unicode version of those octets. For well formed input this results in no change to the underlying string, but the flag is flipped on. Vice versa we speak of encode_utf8() which converts its input to a utf8 encoded form, regardless of what form it was represented internally.
>
> When we are confronted with combining the two forms of string Perl has little choice but to use the "safe" strategy of "upgrading" the Latin-1 parts to Unicode.
>
> Both the operations of "upgrading" and "decoding" result in Utf8-on strings, and indeed both can result in not changing their input at all, but when they do change their input they change it very differently. Most of the places people get into trouble with strings is when they end up doing upgrade operations when they should have done a decode operation. This is because upgrade operations can happen implicitly based on simple rules and thus can happen "by accident", but decode operations are always explicit so they never happen without the involvement of the developer in some way. This is at least partly because upgrade operations do not have any failure modes but decode operations do.
>
> Most of the time, as long as you are only thinking about codepoints, developers dont have to worry about this stuff. The places where they do are when they are reading or writing data, and in some cases when they are embedding string constants in their code where they want a particular set of semantics and encoding. As long as people are disciplined to use decode_utf8() before they use utf8 string data, and encode_utf8 before they emit it then the complexities above should be transparent to the developer.
>
>
> I was rereading this and I thought of something to add here. Part of the confusion with Perl strings is that we try to hide the flag. We dont really want people to look at it and think about it. Instead we provide a handful of verbs which can be used to force the string to the shape we want, or throw an error if we cant (or sometimes be a no-op).
>
> I mean, if I want to be sure i have a latin-1 string then i would do something like:
>
> eval { utf8::downgrade($str); 1 } or warn "Cant downgrade string!";
>
> And if want to be user I have a utf8 string then I would do something like:
>
> utf8::upgrade($str);
>
> I wonder if we made accessing the flag state more socially acceptable whether people would find this less confusing.

----
use v5.34;
use Sys::Binmode;

# When would I ever need to look at the flag here?
----

(Responses to the other stuff pending.)

-FG