Mailing List Archive: Pre-RFC: Rename SVf_UTF8 et al.

Pre-RFC: Rename SVf_UTF8 et al.

felipe at felipegasper

Aug 18, 2021, 10:18 AM

Post #1 of 46 (2419 views)

Per recent IRC discussion …

PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of confusion regarding the flag’s significance. Some think it indicates whether a given PV stores text versus binary. Some think it means that the PV is valid UTF-8. Still others likely hold other inaccurate views.

The problem here is the naming. For example, consider `perl -e'my $foo = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its code points (assuming use of a UTF-8 terminal) correspond to the bytes that encode “é” in UTF-8. The “UTF-8 flag”, however, is likely *not* set on this string. By contrast, consider `perl -Mutf8 -e'my $foo = "é"'`. Here $foo has the “UTF-8 flag” set, but $foo is NOT a “UTF-8 string” because its code points (in this case, only 1) aren’t valid UTF-8.

The fact that quite often a “UTF-8 string” lacks the “UTF-8 flag”, and a “UTF-8-flagged” string is (usually) *not* a “UTF-8 string”, makes little sense except to the “highly initiated”.

Another problem is “UTF-8” doesn’t really describe the “upgraded” format. This format is what Perl historically called “lax UTF-8” and is now widely called “generalized UTF-8”, which includes unpaired surrogates and code points above Unicode’s maximum.

PROPOSAL: Rename the following identifiers in code and documentation, leaving macros for the old ones as aliases:
- SVf_UTF8 -> SVf_PVUPGRADED
- SvUTF8 -> Sv_PVUPGRADED
- SvUTF8_on -> Sv_PVUPGRADED_on
- SvUTF8_off -> Sv_PVUPGRADED_off
- SvPOK_only_UTF8 -> SvPOK_only_UPGRADED

Note that flags like REFCOUNTED_HE_KEY_UTF8 do not need a rename because these indicate an actual (if incomplete/invalidated) UTF-8 decoding step.

BENEFITS: Over time, this rename will minimize the confusion between Perl’s upgraded-PV storage format versus UTF-8. The rename may also compel current users of the language who hold mistaken mental models of the flag’s purpose to reexamine their understanding, hopefully for the better.

POTENTIAL COMPLICATIONS: The mismatch between amended documentation and existing documentation may cause confusion; it should, though, be an auspicious confusion that eventually clarifies rather than misleads.

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

Aug 18, 2021, 12:50 PM

Post #2 of 46 (2419 views)

On Wed, 18 Aug 2021 13:18:34 -0400
Felipe Gasper <felipe@felipegasper.com> wrote:

> Per recent IRC discussion ?
>
> PROBLEM: The naming of Perl?s ?UTF-8 flag? is a continual source of confusion regarding the flag?s significance. Some think it indicates whether a given PV stores text versus binary. Some think it means that the PV is valid UTF-8. Still others likely hold other inaccurate views.
>
> The problem here is the naming. For example, consider `perl -e'my $foo = "?"'`. In this code $foo is a ?UTF-8 string? by virtue of the fact that its code points (assuming use of a UTF-8 terminal) correspond to the bytes that encode ??? in UTF-8. The ?UTF-8 flag?, however, is likely *not* set on this string. By contrast, consider `perl -Mutf8 -e'my $foo = "?"'`. Here $foo has the ?UTF-8 flag? set, but $foo is NOT a ?UTF-8 string? because its code points (in this case, only 1) aren?t valid UTF-8.
>
> The fact that quite often a ?UTF-8 string? lacks the ?UTF-8 flag?, and a ?UTF-8-flagged? string is (usually) *not* a ?UTF-8 string?, makes little sense except to the ?highly initiated?.
>
> Another problem is ?UTF-8? doesn?t really describe the ?upgraded? format. This format is what Perl historically called ?lax UTF-8? and is now widely called ?generalized UTF-8?, which includes unpaired surrogates and code points above Unicode?s maximum.
>
> PROPOSAL: Rename the following identifiers in code and documentation, leaving macros for the old ones as aliases:
> - SVf_UTF8 -> SVf_PVUPGRADED
> - SvUTF8 -> Sv_PVUPGRADED
> - SvUTF8_on -> Sv_PVUPGRADED_on
> - SvUTF8_off -> Sv_PVUPGRADED_off
> - SvPOK_only_UTF8 -> SvPOK_only_UPGRADED
>
> Note that flags like REFCOUNTED_HE_KEY_UTF8 do not need a rename because these indicate an actual (if incomplete/invalidated) UTF-8 decoding step.
>
> BENEFITS: Over time, this rename will minimize the confusion between Perl?s upgraded-PV storage format versus UTF-8. The rename may also compel current users of the language who hold mistaken mental models of the flag?s purpose to reexamine their understanding, hopefully for the better.
>
> POTENTIAL COMPLICATIONS: The mismatch between amended documentation and existing documentation may cause confusion; it should, though, be an auspicious confusion that eventually clarifies rather than misleads.

utf8::is_utf8 probably should be renamed too. Anyway, +1 from me.

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

grinnz at gmail

Aug 18, 2021, 1:08 PM

Post #3 of 46 (2419 views)

On Wed, Aug 18, 2021 at 3:50 PM Tomasz Konojacki <me@xenu.pl> wrote:

> On Wed, 18 Aug 2021 13:18:34 -0400
> Felipe Gasper <felipe@felipegasper.com> wrote:
>
> > Per recent IRC discussion …
> >
> > PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of
> confusion regarding the flag’s significance. Some think it indicates
> whether a given PV stores text versus binary. Some think it means that the
> PV is valid UTF-8. Still others likely hold other inaccurate views.
> >
> > The problem here is the naming. For example, consider `perl -e'my $foo =
> "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its
> code points (assuming use of a UTF-8 terminal) correspond to the bytes that
> encode “é” in UTF-8. The “UTF-8 flag”, however, is likely *not* set on this
> string. By contrast, consider `perl -Mutf8 -e'my $foo = "é"'`. Here $foo
> has the “UTF-8 flag” set, but $foo is NOT a “UTF-8 string” because its code
> points (in this case, only 1) aren’t valid UTF-8.
> >
> > The fact that quite often a “UTF-8 string” lacks the “UTF-8 flag”, and a
> “UTF-8-flagged” string is (usually) *not* a “UTF-8 string”, makes little
> sense except to the “highly initiated”.
> >
> > Another problem is “UTF-8” doesn’t really describe the “upgraded”
> format. This format is what Perl historically called “lax UTF-8” and is now
> widely called “generalized UTF-8”, which includes unpaired surrogates and
> code points above Unicode’s maximum.
> >
> > PROPOSAL: Rename the following identifiers in code and documentation,
> leaving macros for the old ones as aliases:
> > - SVf_UTF8 -> SVf_PVUPGRADED
> > - SvUTF8 -> Sv_PVUPGRADED
> > - SvUTF8_on -> Sv_PVUPGRADED_on
> > - SvUTF8_off -> Sv_PVUPGRADED_off
> > - SvPOK_only_UTF8 -> SvPOK_only_UPGRADED
> >
> > Note that flags like REFCOUNTED_HE_KEY_UTF8 do not need a rename because
> these indicate an actual (if incomplete/invalidated) UTF-8 decoding step.
> >
> > BENEFITS: Over time, this rename will minimize the confusion between
> Perl’s upgraded-PV storage format versus UTF-8. The rename may also compel
> current users of the language who hold mistaken mental models of the flag’s
> purpose to reexamine their understanding, hopefully for the better.
> >
> > POTENTIAL COMPLICATIONS: The mismatch between amended documentation and
> existing documentation may cause confusion; it should, though, be an
> auspicious confusion that eventually clarifies rather than misleads.
>
> utf8::is_utf8 probably should be renamed too. Anyway, +1 from me.
>
>
Frankly it (and upgrade/downgrade) shouldn't even be in the utf8::
namespace, it's named that for internal reasons not interface reasons.

-Dan

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

public at khwilliamson

Aug 18, 2021, 1:13 PM

Post #4 of 46 (2419 views)

On 8/18/21 2:08 PM, Dan Book wrote:
> On Wed, Aug 18, 2021 at 3:50 PM Tomasz Konojacki <me@xenu.pl
> <mailto:me@xenu.pl>> wrote:
>
> On Wed, 18 Aug 2021 13:18:34 -0400
> Felipe Gasper <felipe@felipegasper.com
> <mailto:felipe@felipegasper.com>> wrote:
>
> > Per recent IRC discussion …
> >
> > PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source
> of confusion regarding the flag’s significance. Some think it
> indicates whether a given PV stores text versus binary. Some think
> it means that the PV is valid UTF-8. Still others likely hold other
> inaccurate views.
> >
> > The problem here is the naming. For example, consider `perl -e'my
> $foo = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the
> fact that its code points (assuming use of a UTF-8 terminal)
> correspond to the bytes that encode “é” in UTF-8. The “UTF-8 flag”,
> however, is likely *not* set on this string. By contrast, consider
> `perl -Mutf8 -e'my $foo = "é"'`. Here $foo has the “UTF-8 flag” set,
> but $foo is NOT a “UTF-8 string” because its code points (in this
> case, only 1) aren’t valid UTF-8.
> >
> > The fact that quite often a “UTF-8 string” lacks the “UTF-8
> flag”, and a “UTF-8-flagged” string is (usually) *not* a “UTF-8
> string”, makes little sense except to the “highly initiated”.
> >
> > Another problem is “UTF-8” doesn’t really describe the “upgraded”
> format. This format is what Perl historically called “lax UTF-8” and
> is now widely called “generalized UTF-8”, which includes unpaired
> surrogates and code points above Unicode’s maximum.
> >
> > PROPOSAL: Rename the following identifiers in code and
> documentation, leaving macros for the old ones as aliases:
> > - SVf_UTF8 -> SVf_PVUPGRADED
> > - SvUTF8 -> Sv_PVUPGRADED
> > - SvUTF8_on -> Sv_PVUPGRADED_on
> > - SvUTF8_off -> Sv_PVUPGRADED_off
> > - SvPOK_only_UTF8 -> SvPOK_only_UPGRADED
> >
> > Note that flags like REFCOUNTED_HE_KEY_UTF8 do not need a rename
> because these indicate an actual (if incomplete/invalidated) UTF-8
> decoding step.
> >
> > BENEFITS: Over time, this rename will minimize the confusion
> between Perl’s upgraded-PV storage format versus UTF-8. The rename
> may also compel current users of the language who hold mistaken
> mental models of the flag’s purpose to reexamine their
> understanding, hopefully for the better.
> >
> > POTENTIAL COMPLICATIONS: The mismatch between amended
> documentation and existing documentation may cause confusion; it
> should, though, be an auspicious confusion that eventually clarifies
> rather than misleads.
>
> utf8::is_utf8 probably should be renamed too. Anyway, +1 from me.
>
> Frankly it (and upgrade/downgrade) shouldn't even be in the utf8::
> namespace, it's named that for internal reasons not interface reasons.
>
> -Dan

Upgrade and downgrade tell me nothing. I don't object to renaming, but
something better than these needs to be found

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

grinnz at gmail

Aug 18, 2021, 1:19 PM

Post #5 of 46 (2419 views)

On Wed, Aug 18, 2021 at 4:13 PM Karl Williamson <public@khwilliamson.com>
wrote:

> On 8/18/21 2:08 PM, Dan Book wrote:
> > On Wed, Aug 18, 2021 at 3:50 PM Tomasz Konojacki <me@xenu.pl
> > <mailto:me@xenu.pl>> wrote:
> >
> > On Wed, 18 Aug 2021 13:18:34 -0400
> > Felipe Gasper <felipe@felipegasper.com
> > <mailto:felipe@felipegasper.com>> wrote:
> >
> > > Per recent IRC discussion …
> > >
> > > PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source
> > of confusion regarding the flag’s significance. Some think it
> > indicates whether a given PV stores text versus binary. Some think
> > it means that the PV is valid UTF-8. Still others likely hold other
> > inaccurate views.
> > >
> > > The problem here is the naming. For example, consider `perl -e'my
> > $foo = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the
> > fact that its code points (assuming use of a UTF-8 terminal)
> > correspond to the bytes that encode “é” in UTF-8. The “UTF-8 flag”,
> > however, is likely *not* set on this string. By contrast, consider
> > `perl -Mutf8 -e'my $foo = "é"'`. Here $foo has the “UTF-8 flag” set,
> > but $foo is NOT a “UTF-8 string” because its code points (in this
> > case, only 1) aren’t valid UTF-8.
> > >
> > > The fact that quite often a “UTF-8 string” lacks the “UTF-8
> > flag”, and a “UTF-8-flagged” string is (usually) *not* a “UTF-8
> > string”, makes little sense except to the “highly initiated”.
> > >
> > > Another problem is “UTF-8” doesn’t really describe the “upgraded”
> > format. This format is what Perl historically called “lax UTF-8” and
> > is now widely called “generalized UTF-8”, which includes unpaired
> > surrogates and code points above Unicode’s maximum.
> > >
> > > PROPOSAL: Rename the following identifiers in code and
> > documentation, leaving macros for the old ones as aliases:
> > > - SVf_UTF8 -> SVf_PVUPGRADED
> > > - SvUTF8 -> Sv_PVUPGRADED
> > > - SvUTF8_on -> Sv_PVUPGRADED_on
> > > - SvUTF8_off -> Sv_PVUPGRADED_off
> > > - SvPOK_only_UTF8 -> SvPOK_only_UPGRADED
> > >
> > > Note that flags like REFCOUNTED_HE_KEY_UTF8 do not need a rename
> > because these indicate an actual (if incomplete/invalidated) UTF-8
> > decoding step.
> > >
> > > BENEFITS: Over time, this rename will minimize the confusion
> > between Perl’s upgraded-PV storage format versus UTF-8. The rename
> > may also compel current users of the language who hold mistaken
> > mental models of the flag’s purpose to reexamine their
> > understanding, hopefully for the better.
> > >
> > > POTENTIAL COMPLICATIONS: The mismatch between amended
> > documentation and existing documentation may cause confusion; it
> > should, though, be an auspicious confusion that eventually clarifies
> > rather than misleads.
> >
> > utf8::is_utf8 probably should be renamed too. Anyway, +1 from me.
> >
> > Frankly it (and upgrade/downgrade) shouldn't even be in the utf8::
> > namespace, it's named that for internal reasons not interface reasons.
> >
> > -Dan
>
> Upgrade and downgrade tell me nothing. I don't object to renaming, but
> something better than these needs to be found
>

It is related to the two possible string formats. Do you know of any other
name for them than UTF8/non-UTF8 (which is a misleading name to expose to
the logical string layer, which may separately be UTF-8 encoded or not) or
upgraded/downgraded?

-Dan

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

fawaka at gmail

Aug 18, 2021, 1:24 PM

Post #6 of 46 (2419 views)

On Wed, Aug 18, 2021 at 7:17 PM Felipe Gasper <felipe@felipegasper.com>
wrote:

> Per recent IRC discussion …
>
> PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of
> confusion regarding the flag’s significance. Some think it indicates
> whether a given PV stores text versus binary. Some think it means that the
> PV is valid UTF-8. Still others likely hold other inaccurate views.
>
> The problem here is the naming. For example, consider `perl -e'my $foo =
> "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its
> code points (assuming use of a UTF-8 terminal) correspond to the bytes that
> encode “é” in UTF-8. The “UTF-8 flag”, however, is likely *not* set on this
> string. By contrast, consider `perl -Mutf8 -e'my $foo = "é"'`. Here $foo
> has the “UTF-8 flag” set, but $foo is NOT a “UTF-8 string” because its code
> points (in this case, only 1) aren’t valid UTF-8.
>
> The fact that quite often a “UTF-8 string” lacks the “UTF-8 flag”, and a
> “UTF-8-flagged” string is (usually) *not* a “UTF-8 string”, makes little
> sense except to the “highly initiated”.
>
> Another problem is “UTF-8” doesn’t really describe the “upgraded” format.
> This format is what Perl historically called “lax UTF-8” and is now widely
> called “generalized UTF-8”, which includes unpaired surrogates and code
> points above Unicode’s maximum.
>
> PROPOSAL: Rename the following identifiers in code and documentation,
> leaving macros for the old ones as aliases:
> - SVf_UTF8 -> SVf_PVUPGRADED
> - SvUTF8 -> Sv_PVUPGRADED
> - SvUTF8_on -> Sv_PVUPGRADED_on
> - SvUTF8_off -> Sv_PVUPGRADED_off
> - SvPOK_only_UTF8 -> SvPOK_only_UPGRADED
>
> Note that flags like REFCOUNTED_HE_KEY_UTF8 do not need a rename because
> these indicate an actual (if incomplete/invalidated) UTF-8 decoding step.
>
> BENEFITS: Over time, this rename will minimize the confusion between
> Perl’s upgraded-PV storage format versus UTF-8. The rename may also compel
> current users of the language who hold mistaken mental models of the flag’s
> purpose to reexamine their understanding, hopefully for the better.
>
> POTENTIAL COMPLICATIONS: The mismatch between amended documentation and
> existing documentation may cause confusion; it should, though, be an
> auspicious confusion that eventually clarifies rather than misleads.

I would disagree. Perl code should not have to care/see what the internal
encoding is (it's breaking the encapsulation, really), but perl's internals
very much do and should care about the internal encoding.

So to me this logic only makes sense for the perl-visible side of things
(e.g. utf8::upgrade), not on the C-side.

Leon

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

felipe at felipegasper

Aug 18, 2021, 1:25 PM

Post #7 of 46 (2419 views)

> On Aug 18, 2021, at 4:19 PM, Dan Book <grinnz@gmail.com> wrote:
>
> On Wed, Aug 18, 2021 at 4:13 PM Karl Williamson <public@khwilliamson.com> wrote:
> On 8/18/21 2:08 PM, Dan Book wrote:
> > On Wed, Aug 18, 2021 at 3:50 PM Tomasz Konojacki <me@xenu.pl
> > <mailto:me@xenu.pl>> wrote:
> >
> > On Wed, 18 Aug 2021 13:18:34 -0400
> > Felipe Gasper <felipe@felipegasper.com
> > <mailto:felipe@felipegasper.com>> wrote:
> >
> > > Per recent IRC discussion …
> > >
> > > PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source
> > of confusion regarding the flag’s significance. Some think it
> > indicates whether a given PV stores text versus binary. Some think
> > it means that the PV is valid UTF-8. Still others likely hold other
> > inaccurate views.
> > >
> > > The problem here is the naming. For example, consider `perl -e'my
> > $foo = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the
> > fact that its code points (assuming use of a UTF-8 terminal)
> > correspond to the bytes that encode “é” in UTF-8. The “UTF-8 flag”,
> > however, is likely *not* set on this string. By contrast, consider
> > `perl -Mutf8 -e'my $foo = "é"'`. Here $foo has the “UTF-8 flag” set,
> > but $foo is NOT a “UTF-8 string” because its code points (in this
> > case, only 1) aren’t valid UTF-8.
> > >
> > > The fact that quite often a “UTF-8 string” lacks the “UTF-8
> > flag”, and a “UTF-8-flagged” string is (usually) *not* a “UTF-8
> > string”, makes little sense except to the “highly initiated”.
> > >
> > > Another problem is “UTF-8” doesn’t really describe the “upgraded”
> > format. This format is what Perl historically called “lax UTF-8” and
> > is now widely called “generalized UTF-8”, which includes unpaired
> > surrogates and code points above Unicode’s maximum.
> > >
> > > PROPOSAL: Rename the following identifiers in code and
> > documentation, leaving macros for the old ones as aliases:
> > > - SVf_UTF8 -> SVf_PVUPGRADED
> > > - SvUTF8 -> Sv_PVUPGRADED
> > > - SvUTF8_on -> Sv_PVUPGRADED_on
> > > - SvUTF8_off -> Sv_PVUPGRADED_off
> > > - SvPOK_only_UTF8 -> SvPOK_only_UPGRADED
> > >
> > > Note that flags like REFCOUNTED_HE_KEY_UTF8 do not need a rename
> > because these indicate an actual (if incomplete/invalidated) UTF-8
> > decoding step.
> > >
> > > BENEFITS: Over time, this rename will minimize the confusion
> > between Perl’s upgraded-PV storage format versus UTF-8. The rename
> > may also compel current users of the language who hold mistaken
> > mental models of the flag’s purpose to reexamine their
> > understanding, hopefully for the better.
> > >
> > > POTENTIAL COMPLICATIONS: The mismatch between amended
> > documentation and existing documentation may cause confusion; it
> > should, though, be an auspicious confusion that eventually clarifies
> > rather than misleads.
> >
> > utf8::is_utf8 probably should be renamed too. Anyway, +1 from me.
> >
> > Frankly it (and upgrade/downgrade) shouldn't even be in the utf8::
> > namespace, it's named that for internal reasons not interface reasons.
> >
> > -Dan
>
> Upgrade and downgrade tell me nothing. I don't object to renaming, but
> something better than these needs to be found
>
> It is related to the two possible string formats. Do you know of any other name for them than UTF8/non-UTF8 (which is a misleading name to expose to the logical string layer, which may separately be UTF-8 encoded or not) or upgraded/downgraded?

RJBS called it “the wide flag” in a presentation some years back. SVf_WIDEPV may clash with the “wide character” warning, though.

SVf_BIGPV?

-F

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

grinnz at gmail

Aug 18, 2021, 1:31 PM

Post #8 of 46 (2419 views)

On Wed, Aug 18, 2021 at 4:24 PM Leon Timmermans <fawaka@gmail.com> wrote:

> On Wed, Aug 18, 2021 at 7:17 PM Felipe Gasper <felipe@felipegasper.com>
> wrote:
>
>> Per recent IRC discussion …
>>
>> PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of
>> confusion regarding the flag’s significance. Some think it indicates
>> whether a given PV stores text versus binary. Some think it means that the
>> PV is valid UTF-8. Still others likely hold other inaccurate views.
>>
>> The problem here is the naming. For example, consider `perl -e'my $foo =
>> "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its
>> code points (assuming use of a UTF-8 terminal) correspond to the bytes that
>> encode “é” in UTF-8. The “UTF-8 flag”, however, is likely *not* set on this
>> string. By contrast, consider `perl -Mutf8 -e'my $foo = "é"'`. Here $foo
>> has the “UTF-8 flag” set, but $foo is NOT a “UTF-8 string” because its code
>> points (in this case, only 1) aren’t valid UTF-8.
>>
>> The fact that quite often a “UTF-8 string” lacks the “UTF-8 flag”, and a
>> “UTF-8-flagged” string is (usually) *not* a “UTF-8 string”, makes little
>> sense except to the “highly initiated”.
>>
>> Another problem is “UTF-8” doesn’t really describe the “upgraded” format.
>> This format is what Perl historically called “lax UTF-8” and is now widely
>> called “generalized UTF-8”, which includes unpaired surrogates and code
>> points above Unicode’s maximum.
>>
>> PROPOSAL: Rename the following identifiers in code and documentation,
>> leaving macros for the old ones as aliases:
>> - SVf_UTF8 -> SVf_PVUPGRADED
>> - SvUTF8 -> Sv_PVUPGRADED
>> - SvUTF8_on -> Sv_PVUPGRADED_on
>> - SvUTF8_off -> Sv_PVUPGRADED_off
>> - SvPOK_only_UTF8 -> SvPOK_only_UPGRADED
>>
>> Note that flags like REFCOUNTED_HE_KEY_UTF8 do not need a rename because
>> these indicate an actual (if incomplete/invalidated) UTF-8 decoding step.
>>
>> BENEFITS: Over time, this rename will minimize the confusion between
>> Perl’s upgraded-PV storage format versus UTF-8. The rename may also compel
>> current users of the language who hold mistaken mental models of the flag’s
>> purpose to reexamine their understanding, hopefully for the better.
>>
>> POTENTIAL COMPLICATIONS: The mismatch between amended documentation and
>> existing documentation may cause confusion; it should, though, be an
>> auspicious confusion that eventually clarifies rather than misleads.
>
>
> I would disagree. Perl code should not have to care/see what the internal
> encoding is (it's breaking the encapsulation, really), but perl's internals
> very much do and should care about the internal encoding.
>
> So to me this logic only makes sense for the perl-visible side of things
> (e.g. utf8::upgrade), not on the C-side.
>

I would agree except that people not working on the internals also have to
use these functions (for XS code), and thus misuse them because they think
they're related to the logical contents of the string.

-Dan

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

felipe at felipegasper

Aug 18, 2021, 1:35 PM

Post #9 of 46 (2419 views)

> On Aug 18, 2021, at 4:24 PM, Leon Timmermans <fawaka@gmail.com> wrote:
>
> On Wed, Aug 18, 2021 at 7:17 PM Felipe Gasper <felipe@felipegasper.com> wrote:
> Per recent IRC discussion …
>
> PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of confusion regarding the flag’s significance. Some think it indicates whether a given PV stores text versus binary. Some think it means that the PV is valid UTF-8. Still others likely hold other inaccurate views.
>
> The problem here is the naming. For example, consider `perl -e'my $foo = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its code points (assuming use of a UTF-8 terminal) correspond to the bytes that encode “é” in UTF-8. The “UTF-8 flag”, however, is likely *not* set on this string. By contrast, consider `perl -Mutf8 -e'my $foo = "é"'`. Here $foo has the “UTF-8 flag” set, but $foo is NOT a “UTF-8 string” because its code points (in this case, only 1) aren’t valid UTF-8.
>
> The fact that quite often a “UTF-8 string” lacks the “UTF-8 flag”, and a “UTF-8-flagged” string is (usually) *not* a “UTF-8 string”, makes little sense except to the “highly initiated”.
>
> Another problem is “UTF-8” doesn’t really describe the “upgraded” format. This format is what Perl historically called “lax UTF-8” and is now widely called “generalized UTF-8”, which includes unpaired surrogates and code points above Unicode’s maximum.
>
> PROPOSAL: Rename the following identifiers in code and documentation, leaving macros for the old ones as aliases:
> - SVf_UTF8 -> SVf_PVUPGRADED
> - SvUTF8 -> Sv_PVUPGRADED
> - SvUTF8_on -> Sv_PVUPGRADED_on
> - SvUTF8_off -> Sv_PVUPGRADED_off
> - SvPOK_only_UTF8 -> SvPOK_only_UPGRADED
>
> Note that flags like REFCOUNTED_HE_KEY_UTF8 do not need a rename because these indicate an actual (if incomplete/invalidated) UTF-8 decoding step.
>
> BENEFITS: Over time, this rename will minimize the confusion between Perl’s upgraded-PV storage format versus UTF-8. The rename may also compel current users of the language who hold mistaken mental models of the flag’s purpose to reexamine their understanding, hopefully for the better.
>
> POTENTIAL COMPLICATIONS: The mismatch between amended documentation and existing documentation may cause confusion; it should, though, be an auspicious confusion that eventually clarifies rather than misleads.
>
> I would disagree. Perl code should not have to care/see what the internal encoding is (it's breaking the encapsulation, really), but perl's internals very much do and should care about the internal encoding.

This isn’t really true, though. Pure Perl code also frequently has to care about the internal encoding due to the many instances where Perl itself leaks it.

Example:
-----
perl -Mutf8 -MJSON::PP -e'my $foo = JSON::PP::decode_json( JSON::PP::encode_json(["é"]) )->[0]; exec "echo", $foo'
-----
This *should* print mojibake, but it happens to print “é” because of the leak.

When/if that leaky behaviour gets fixed -- 5.36 feature bundle, maybe? -- then it’ll make more sense to consider the PV encoding a wholly internal matter.

-FG

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

kimoto.yuki at gmail

Aug 18, 2021, 6:09 PM

Post #10 of 46 (2419 views)

2021-8-19? 2:17 Felipe Gasper <felipe@felipegasper.com> wrote:

> Per recent IRC discussion …
>
> PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of
> confusion regarding the flag’s significance. Some think it indicates
> whether a given PV stores text versus binary. Some think it means that the
> PV is valid UTF-8. Still others likely hold other inaccurate views.
> h, be an auspicious confusion that eventually clarifies rather than
> misleads.

I feel that the starting point for this discussion is that people
misunderstand that the current Perl implementation can distinguish between
binary and text.

On this point, I agree with Feripe.

People likely to believe

utf8::is_utf8 : 0 : this string is binary
utf8::is_utf8 : 1 : this string is text

However, this is completely wrong.

Current Perl can't make this distinction.

Perl freely changes this interpretation for performance and use.

The meaning is as follows.

?Interpretation considered bytes
?Interpretation considered UTF-8 characters

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

sergey.aleynikov at gmail

Aug 20, 2021, 12:04 AM

Post #11 of 46 (2419 views)

??, 18 ???. 2021 ?. ? 20:17, Felipe Gasper <felipe@felipegasper.com>:
> The problem here is the naming. For example, consider `perl -e'my $foo = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its code points (assuming use of a UTF-8 terminal) correspond to the bytes that encode “é” in UTF-8. The “UTF-8 flag”, however, is likely *not* set on this string

There's no likeness. For literal string, there're deterministic rules
set (though they may not be documented).

>Here $foo has the “UTF-8 flag” set, but $foo is NOT a “UTF-8 string” because its code points (in this case, only 1) aren’t valid UTF-8.

Maybe I don't understand you, but perl can't have invalid UTF8 in
literals under 'use utf8'.

> PROPOSAL: Rename the following identifiers in code and documentation, leaving macros for the old ones as aliases:

Which will only bring more confusion going forward. If you want to
fight SVf_UTF8 confusion, the problem lies not in it's name, but in
the logic behind it. You're trying to shove this issue under the rug,
but what really makes things this messy is this flag's mere existence
(and it still might be better than Python's choice for theirs Unicode
strings). -1 from me.

Best regards,
Sergey Aleynikov

>
> Per recent IRC discussion …
>
> PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of confusion regarding the flag’s significance. Some think it indicates whether a given PV stores text versus binary. Some think it means that the PV is valid UTF-8. Still others likely hold other inaccurate views.
>
> The problem here is the naming. For example, consider `perl -e'my $foo = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its code points (assuming use of a UTF-8 terminal) correspond to the bytes that encode “é” in UTF-8. The “UTF-8 flag”, however, is likely *not* set on this string. By contrast, consider `perl -Mutf8 -e'my $foo = "é"'`. Here $foo has the “UTF-8 flag” set, but $foo is NOT a “UTF-8 string” because its code points (in this case, only 1) aren’t valid UTF-8.
>
> The fact that quite often a “UTF-8 string” lacks the “UTF-8 flag”, and a “UTF-8-flagged” string is (usually) *not* a “UTF-8 string”, makes little sense except to the “highly initiated”.
>
> Another problem is “UTF-8” doesn’t really describe the “upgraded” format. This format is what Perl historically called “lax UTF-8” and is now widely called “generalized UTF-8”, which includes unpaired surrogates and code points above Unicode’s maximum.
>
> PROPOSAL: Rename the following identifiers in code and documentation, leaving macros for the old ones as aliases:
> - SVf_UTF8 -> SVf_PVUPGRADED
> - SvUTF8 -> Sv_PVUPGRADED
> - SvUTF8_on -> Sv_PVUPGRADED_on
> - SvUTF8_off -> Sv_PVUPGRADED_off
> - SvPOK_only_UTF8 -> SvPOK_only_UPGRADED
>
> Note that flags like REFCOUNTED_HE_KEY_UTF8 do not need a rename because these indicate an actual (if incomplete/invalidated) UTF-8 decoding step.
>
> BENEFITS: Over time, this rename will minimize the confusion between Perl’s upgraded-PV storage format versus UTF-8. The rename may also compel current users of the language who hold mistaken mental models of the flag’s purpose to reexamine their understanding, hopefully for the better.
>
> POTENTIAL COMPLICATIONS: The mismatch between amended documentation and existing documentation may cause confusion; it should, though, be an auspicious confusion that eventually clarifies rather than misleads.

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

Aug 20, 2021, 1:42 AM

Post #12 of 46 (2419 views)

On Fri, Aug 20, 2021 at 9:05 AM Sergey Aleynikov
<sergey.aleynikov@gmail.com> wrote:
>
> ??, 18 ???. 2021 ?. ? 20:17, Felipe Gasper <felipe@felipegasper.com>:
> > The problem here is the naming. For example, consider `perl -e'my $foo = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its code points (assuming use of a UTF-8 terminal) correspond to the bytes that encode “é” in UTF-8. The “UTF-8 flag”, however, is likely *not* set on this string
>
> There's no likeness. For literal string, there're deterministic rules
> set (though they may not be documented).
>
> >Here $foo has the “UTF-8 flag” set, but $foo is NOT a “UTF-8 string” because its code points (in this case, only 1) aren’t valid UTF-8.
>
> Maybe I don't understand you, but perl can't have invalid UTF8 in
> literals under 'use utf8'.

But the contents of the string are not "UTF-8". UTF-8 is byte encoding
for Unicode codepoints. From a language perspective (not considering
perl's implementation), the contents of the string is a single
codepoint. It is not a UTF-8 byte sequence.

>
> > PROPOSAL: Rename the following identifiers in code and documentation, leaving macros for the old ones as aliases:
>
> Which will only bring more confusion going forward. If you want to
> fight SVf_UTF8 confusion, the problem lies not in it's name, but in
> the logic behind it. You're trying to shove this issue under the rug,
> but what really makes things this messy is this flag's mere existence
> (and it still might be better than Python's choice for theirs Unicode
> strings). -1 from me.
>
> Best regards,
> Sergey Aleynikov
>
> >
> > Per recent IRC discussion …
> >
> > PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of confusion regarding the flag’s significance. Some think it indicates whether a given PV stores text versus binary. Some think it means that the PV is valid UTF-8. Still others likely hold other inaccurate views.
> >
> > The problem here is the naming. For example, consider `perl -e'my $foo = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its code points (assuming use of a UTF-8 terminal) correspond to the bytes that encode “é” in UTF-8. The “UTF-8 flag”, however, is likely *not* set on this string. By contrast, consider `perl -Mutf8 -e'my $foo = "é"'`. Here $foo has the “UTF-8 flag” set, but $foo is NOT a “UTF-8 string” because its code points (in this case, only 1) aren’t valid UTF-8.
> >
> > The fact that quite often a “UTF-8 string” lacks the “UTF-8 flag”, and a “UTF-8-flagged” string is (usually) *not* a “UTF-8 string”, makes little sense except to the “highly initiated”.
> >
> > Another problem is “UTF-8” doesn’t really describe the “upgraded” format. This format is what Perl historically called “lax UTF-8” and is now widely called “generalized UTF-8”, which includes unpaired surrogates and code points above Unicode’s maximum.
> >
> > PROPOSAL: Rename the following identifiers in code and documentation, leaving macros for the old ones as aliases:
> > - SVf_UTF8 -> SVf_PVUPGRADED
> > - SvUTF8 -> Sv_PVUPGRADED
> > - SvUTF8_on -> Sv_PVUPGRADED_on
> > - SvUTF8_off -> Sv_PVUPGRADED_off
> > - SvPOK_only_UTF8 -> SvPOK_only_UPGRADED
> >
> > Note that flags like REFCOUNTED_HE_KEY_UTF8 do not need a rename because these indicate an actual (if incomplete/invalidated) UTF-8 decoding step.
> >
> > BENEFITS: Over time, this rename will minimize the confusion between Perl’s upgraded-PV storage format versus UTF-8. The rename may also compel current users of the language who hold mistaken mental models of the flag’s purpose to reexamine their understanding, hopefully for the better.
> >
> > POTENTIAL COMPLICATIONS: The mismatch between amended documentation and existing documentation may cause confusion; it should, though, be an auspicious confusion that eventually clarifies rather than misleads.

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

fawaka at gmail

Aug 20, 2021, 1:48 AM

Post #13 of 46 (2419 views)

>
> I would agree except that people not working on the internals also have to
> use these functions (for XS code), and thus misuse them because they think
> they're related to the logical contents of the string.
>

Any code dealing with strings on a C level will need to know if it is
encoding in UTF8 or something different. Changing the internal name to
upgraded would make sense if we could change the internal implementation
(e.g. to UTF16) without *everything* exploding.

Leon

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

felipe at felipegasper

Aug 20, 2021, 6:20 AM

Post #14 of 46 (2419 views)

> On Aug 20, 2021, at 4:48 AM, Leon Timmermans <fawaka@gmail.com> wrote:
>
> I would agree except that people not working on the internals also have to use these functions (for XS code), and thus misuse them because they think they're related to the logical contents of the string.
>
> Any code dealing with strings on a C level will need to know if it is encoding in UTF8 or something different.

A bit off-topic, but worth bearing in mind: XS modules that merely interface between Perl and an external library (e.g., libcurl, libunbound) can avoid SvPV et al. in favour of the variants that preserve the abstraction.

It’s probably an overly simplistic ideal, but in theory it seems the only things that would need to care about what we currently call SVf_UTF8 are things that must read or manipulate Perl’s internals directly. Everything else -- even XS modules and embedding C applications -- can respect the abstraction.

> Changing the internal name to upgraded would make sense if we could change the internal implementation (e.g. to UTF16) without *everything* exploding.

That’s true, but the fact that the internal encoding can’t pragmatically change doesn’t invalidate the benefits of strengthening the abstraction, which include:

- Less confusion about what a “UTF-8 string” is: it’ll be clearer that upgraded/downgraded and (Perl-visible) UTF-8-ness are orthogonal qualities.

- More abstract terminology will discourage folks from trying to think too deeply about Perl’s internals.

-F

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

felipe at felipegasper

Aug 20, 2021, 6:24 AM

Post #15 of 46 (2419 views)

> On Aug 20, 2021, at 3:04 AM, Sergey Aleynikov <sergey.aleynikov@gmail.com> wrote:
>
> ??, 18 ???. 2021 ?. ? 20:17, Felipe Gasper <felipe@felipegasper.com>:
>> The problem here is the naming. For example, consider `perl -e'my $foo = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its code points (assuming use of a UTF-8 terminal) correspond to the bytes that encode “é” in UTF-8. The “UTF-8 flag”, however, is likely *not* set on this string
>
> There's no likeness. For literal string, there're deterministic rules
> set (though they may not be documented).

They’re not documented; ergo, they can change at any time. This is by design, right? A Perl application should not have to think about how Perl stores its code points?

> what really makes things this messy is this flag's mere existence
> (and it still might be better than Python's choice for theirs Unicode
> strings).

Out of curiosity, what do you think would be the ideal? Store all strings internally as UTF-8, à la Rust?

-F

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

demerphq at gmail

Aug 20, 2021, 10:05 AM

Post #16 of 46 (2419 views)

On Wed, 18 Aug 2021 at 19:17, Felipe Gasper <felipe@felipegasper.com> wrote:

> Per recent IRC discussion …
>
> PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of
> confusion regarding the flag’s significance. Some think it indicates
> whether a given PV stores text versus binary. Some think it means that the
> PV is valid UTF-8. Still others likely hold other inaccurate views.
>
> The problem here is the naming. For example, consider `perl -e'my $foo =
> "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its
> code points (assuming use of a UTF-8 terminal) correspond to the bytes that
> encode “é” in UTF-8.

Nope. It might contain utf8, but it is not UTF8-ON. Think of it like a
square/rectangle relationship. All strings are "rectangles", all "squares"
are rectangles, some strings are squares, but unless SQUARE flag is ON perl
should assume it is a rectangle, not a square. The SQUARE flag should
only be set when the rectangle has been proved conclusively to be a square.
That the SQUARE flag is off does not mean the rectangle is not a square,
merely that the square has not been proved to be such.

The “UTF-8 flag”, however, is likely *not* set on this string. By contrast,
> consider `perl -Mutf8 -e'my $foo = "é"'`. Here $foo has the “UTF-8 flag”
> set, but $foo is NOT a “UTF-8 string” because its code points (in this
> case, only 1) aren’t valid UTF-8.
>

Except it is valid UTF-8: (at least in my utf8 terminal).

$ perl -MDevel::Peek -Mutf8 -e'my $foo = "é"; Dump($foo)'
SV = PV(0x153efc0) at 0x155fb38
REFCNT = 1
FLAGS = (POK,IsCOW,pPOK,UTF8)
PV = 0x1563240 "\303\251"\0 [UTF8 "\x{e9}"]
CUR = 2
LEN = 10
COW_REFCNT = 1

So the string is UTF-8.

You cannot get the UTF-8 flag on without using XS tricks and have the
buffer contain non-utf8. It is that simple. (Sure you can do it with
Encode::_utf8_on() but that is XS.)

I do not understand your point that only the initiated can understand this
flag. It means one and only one thing: that the perl internals should
assume that the buffer contains utf8 encoded data and that perl should
apply unicode semantics when doing character and case-sensitive operations,
and that perl can make certain assumptions when it processing the data (eg
that is not malformed).

When it is off it does not mean that the data cannot be utf8 data, merely
that Perl cannot and should not assume it is utf8 data, and should not try
to interpret it as utf8 data when the string is used in character
operations, and that when it is used in case-sensitive operations it should
use the tradition limited case-insensitive logic from ASCII.

Personally I think renaming this flag will just increase confusion, not
decrease.

BTW, your scheme needs to account for WAS_UTF8 as well. Most people dont
know it, but there are actually three types of strings in the perl
internals, UTF8-ON, UTF8-OFF, UTF8-OFF + WAS_UTF8. It only manifests in
hash keys. But it needs to be accounted for as well in any renaming. Perl
dictates that keys which are character-wise equivalent hash the same
regardless of the UTF8 flag (or put alternative, the hash should be of the
codepoints the string represents NOT the octets that make up that
representation). This means UTF8-ON keys are always downgraded on lookup or
store in a hash. If the downgrade is successful the key is marked as
WAS-UTF8 and the downgraded string is stored and hashed, if it was
unsuccessful (eg it contains codepoints above 255) it is marked as UTF8-ON
and the original buffer is hashed. When the key is extracted with keys() or
each() if the WASUTF8 flag is set the string is upgraded back to the UTF8
form.

I think you need to step back and consider that strings are sequences of
octets. Sometimes those octets are ordered such that they can be
interpreted as utf8. The UTF-8 flag being on tells perl that it can and
should treat the octets as utf8.

You used examples that involve source code which I think might be confusing
you, as it introduces weird issues related to what character set your
terminal thinks it is using, and what format the text in the file is stored
in, and what operating system is in use. If you stick to examples that
only use code then all of that ambiguity goes away and it should be easy to
understand. Eg when you say:

my $foo = "é";

I don't know exactly what that code does without doing an octet level
investigation of the data. It could be one octet and in latin-1 or it could
be two octets and be Unicode in one of several formats (utf8, utf-16BE
utf-16LE) and still be rendered identically in an editor or browser.

However if you say:

my $foo= chr(0xe9); # é

I know exactly what is going on, and what $foo should contain.

I also know what happens here:

my $foo="\x{c3}\x{a9}";
utf8::decode($foo);
Dump($foo);

SV = PV(0x2303fc0) at 0x2324c98
REFCNT = 1
FLAGS = (POK,IsCOW,pPOK,UTF8)
PV = 0x2328300 "\303\251"\0 [UTF8 "\x{e9}"]
CUR = 2
LEN = 10
COW_REFCNT = 1

That is, i start off with two octets, C3 - A9, which happens to be the
encoding for the codepoint E9, which happens to be é.
I then tell perl to "decode" those octets, which really means I tell perl
to check that the octets actually do make up valid utf8. And if perl agrees
that indeed these are valid utf8 octets, then it turns the flag on. Now it
doesn't matter if you *meant* to construct utf8 in the variable fed to
decode, all that matters is that at an octet level those octet happen to
make up valid utf8.

Try

my $foo="\x{c3}\x{a9}\x{c3}";
utf8::decode($foo);
Dump($foo);
SV = PV(0x23040a0) at 0x23249f8
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x2329350 "\303\251\303"\0
CUR = 3
LEN = 10

So here we can see that perl did nothing with this version of $foo, because
it did not contain a valid utf8 sequence. \x{c3} can never be the last byte
in valid utf8, it always must be followed by something, so perl did not
turn the UTF8 flag on.

Work the problem like this a while and you will see that really this
subject is pretty simple, and there is a tremendous amount of fud about it
when in fact it is really simple. The flag says that the buffer contains
valid octets that are not illegal utf8, and that perl should apply
utf8/unicode semantics when doing "character" operations on the string. The
flag being off means that when doing character operations it should assume
fixed width octet operations, and it should use ASCII case-folding rules.
That is it. The flag being off does not *ever* mean the data is NOT utf8,
it simply means that data has not been *validated* as utf8 and thus perl
cannot use utf8 rules to process it. That is it.

cheers,
Yves

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

grinnz at gmail

Aug 20, 2021, 10:16 AM

Post #17 of 46 (2419 views)

On Fri, Aug 20, 2021 at 1:06 PM demerphq <demerphq@gmail.com> wrote:

> On Wed, 18 Aug 2021 at 19:17, Felipe Gasper <felipe@felipegasper.com>
> wrote:
>
>> Per recent IRC discussion …
>>
>> PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of
>> confusion regarding the flag’s significance. Some think it indicates
>> whether a given PV stores text versus binary. Some think it means that the
>> PV is valid UTF-8. Still others likely hold other inaccurate views.
>>
>> The problem here is the naming. For example, consider `perl -e'my $foo =
>> "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its
>> code points (assuming use of a UTF-8 terminal) correspond to the bytes that
>> encode “é” in UTF-8.
>
>
> Nope. It might contain utf8, but it is not UTF8-ON. Think of it like a
> square/rectangle relationship. All strings are "rectangles", all "squares"
> are rectangles, some strings are squares, but unless SQUARE flag is ON perl
> should assume it is a rectangle, not a square. The SQUARE flag should
> only be set when the rectangle has been proved conclusively to be a square.
> That the SQUARE flag is off does not mean the rectangle is not a square,
> merely that the square has not been proved to be such.
>
>
> The “UTF-8 flag”, however, is likely *not* set on this string. By
>> contrast, consider `perl -Mutf8 -e'my $foo = "é"'`. Here $foo has the
>> “UTF-8 flag” set, but $foo is NOT a “UTF-8 string” because its code points
>> (in this case, only 1) aren’t valid UTF-8.
>>
>
> Except it is valid UTF-8: (at least in my utf8 terminal).
>
> $ perl -MDevel::Peek -Mutf8 -e'my $foo = "é"; Dump($foo)'
> SV = PV(0x153efc0) at 0x155fb38
> REFCNT = 1
> FLAGS = (POK,IsCOW,pPOK,UTF8)
> PV = 0x1563240 "\303\251"\0 [UTF8 "\x{e9}"]
> CUR = 2
> LEN = 10
> COW_REFCNT = 1
>
> So the string is UTF-8.
>

The premise of this email seems to be about the internals of the string.
That is not the contents of the string (which is "\x{e9}" in this example).
Please re-evaluate in that context.

-Dan

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

felipe at felipegasper

Aug 20, 2021, 10:48 AM

Post #18 of 46 (2419 views)

> On Aug 20, 2021, at 1:05 PM, demerphq <demerphq@gmail.com> wrote:
>
> On Wed, 18 Aug 2021 at 19:17, Felipe Gasper <felipe@felipegasper.com> wrote:
> Per recent IRC discussion …
>
> PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of confusion regarding the flag’s significance. Some think it indicates whether a given PV stores text versus binary. Some think it means that the PV is valid UTF-8. Still others likely hold other inaccurate views.
>
> The problem here is the naming. For example, consider `perl -e'my $foo = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its code points (assuming use of a UTF-8 terminal) correspond to the bytes that encode “é” in UTF-8.
>
> Nope. It might contain utf8, but it is not UTF8-ON. Think of it like a square/rectangle relationship. All strings are "rectangles", all "squares" are rectangles, some strings are squares, but unless SQUARE flag is ON perl should assume it is a rectangle, not a square. The SQUARE flag should only be set when the rectangle has been proved conclusively to be a square. That the SQUARE flag is off does not mean the rectangle is not a square, merely that the square has not been proved to be such.

You’re defining “a UTF-8 string” as “a string whose PV is marked as UTF-8”. I’m defining it as “a string whose Perl-visible code points happen to be valid UTF-8”.

What you call “a UTF-8 string” is what I propose we call, per existing nomenclature (i.e., utf8::upgrade()) “a PV-upgraded string”, with corresponding code changes. Then the term “UTF-8 string” makes sense from a pure-Perl context without requiring Perl programmers to worry about interpreter internals.

> The “UTF-8 flag”, however, is likely *not* set on this string. By contrast, consider `perl -Mutf8 -e'my $foo = "é"'`. Here $foo has the “UTF-8 flag” set, but $foo is NOT a “UTF-8 string” because its code points (in this case, only 1) aren’t valid UTF-8.
>
> Except it is valid UTF-8: (at least in my utf8 terminal).
>
> $ perl -MDevel::Peek -Mutf8 -e'my $foo = "é"; Dump($foo)'
> SV = PV(0x153efc0) at 0x155fb38
> REFCNT = 1
> FLAGS = (POK,IsCOW,pPOK,UTF8)
> PV = 0x1563240 "\303\251"\0 [UTF8 "\x{e9}"]
> CUR = 2
> LEN = 10
> COW_REFCNT = 1
>
> So the string is UTF-8.

Again, different definitions. The Perl-visible string contains a single code point, 0xe9. This code point doesn’t correspond to valid UTF-8 bytes, so IMO it doesn’t make sense to call it a “UTF-8 string”. Whether Perl stores that code point as one byte or as two is Perl’s business alone … right?

> I do not understand your point that only the initiated can understand this flag. It means one and only one thing: that the perl internals should assume that the buffer contains utf8 encoded data and that perl should apply unicode semantics when doing character and case-sensitive operations, and that perl can make certain assumptions when it processing the data (eg that is not malformed).

The behaviour you’re talking about is what the unicode_strings and unicode_eval features specifically do away with (i.e., fix), right?

You’re omitting what IMO is the most obvious purpose of the flag: to indicate whether the code points that the PV stores are the plain bytes, or are the UTF-8-decoded code points. This is why you can print() the string in either upgraded or downgraded forms, and it comes out the same.

> BTW, your scheme needs to account for WAS_UTF8 as well. Most people dont know it, but there are actually three types of strings in the perl internals, UTF8-ON, UTF8-OFF, UTF8-OFF + WAS_UTF8. It only manifests in hash keys. But it needs to be accounted for as well in any renaming. Perl dictates that keys which are character-wise equivalent hash the same regardless of the UTF8 flag (or put alternative, the hash should be of the codepoints the string represents NOT the octets that make up that representation). This means UTF8-ON keys are always downgraded on lookup or store in a hash. If the downgrade is successful the key is marked as WAS-UTF8 and the downgraded string is stored and hashed, if it was unsuccessful (eg it contains codepoints above 255) it is marked as UTF8-ON and the original buffer is hashed. When the key is extracted with keys() or each() if the WASUTF8 flag is set the string is upgraded back to the UTF8 form.

Thank you for this. I knew about the was-UTF8 status but didn’t know why it exists.

> I think you need to step back and consider that strings are sequences of octets. Sometimes those octets are ordered such that they can be interpreted as utf8. The UTF-8 flag being on tells perl that it can and should treat the octets as utf8.

C strings are sequences of octets, yes. Perl strings, though, are sequences of code points, not octets. In this they’re more like JavaScript strings than C strings.

> my $foo = "é";
>
> I don't know exactly what that code does without doing an octet level investigation of the data. It could be one octet and in latin-1 or it could be two octets and be Unicode in one of several formats (utf8, utf-16BE utf-16LE) and still be rendered identically in an editor or browser.

Sorry, I assumed we all use UTF-8 terminals. :) But yes, I should have written it as two \x escapes, sorry.

> I also know what happens here:
>
> my $foo="\x{c3}\x{a9}";
> utf8::decode($foo);
> Dump($foo);
>
> SV = PV(0x2303fc0) at 0x2324c98
> REFCNT = 1
> FLAGS = (POK,IsCOW,pPOK,UTF8)
> PV = 0x2328300 "\303\251"\0 [UTF8 "\x{e9}"]
> CUR = 2
> LEN = 10
> COW_REFCNT = 1
>
> That is, i start off with two octets, C3 - A9, which happens to be the encoding for the codepoint E9, which happens to be é.
> I then tell perl to "decode" those octets, which really means I tell perl to check that the octets actually do make up valid utf8. And if perl agrees that indeed these are valid utf8 octets, then it turns the flag on. Now it doesn't matter if you *meant* to construct utf8 in the variable fed to decode, all that matters is that at an octet level those octet happen to make up valid utf8.

I think you’re actually breaking the abstraction here by assuming that Perl implements the decode by setting a flag.

It would be just as legitimate to mutate the PV to store a single octet, 0xe9, and leave the UTF8 flag off. Perl doesn’t do that, of course, because it’s easier just to set a flag, but as long as the string content is the single code point 0xe9 it doesn’t really matter how Perl achieves that.

(Notwithstanding, of course, the abstraction leaks that things like the unicode_strings feature and Sys::Binmode fix.)

There are parts of the code that appear to go the other way and prioritize downgraded storage. Perl_refcounted_he_fetch_pvn(), for example.

-FG

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

kimoto.yuki at gmail

Aug 22, 2021, 4:58 PM

Post #19 of 46 (2415 views)

Personally, I'm starting to agree on the goal of Felipe.

1. Being able to distinguish between Text and Bytes from user
2. Text is Unicode code point which is represented by UTF-8
3. Perl config has default OS text character set and OS file system
character set
4. Perl standard function(print, open, etc) output string by encoding above
3 character set if the string is Text.

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

Aug 30, 2021, 5:18 AM

Post #20 of 46 (2408 views)

(forgot to Cc this to p5p)

----- Forwarded message from Dave Mitchell <davem@iabyn.com> -----

Date: Mon, 30 Aug 2021 13:17:04 +0100
From: Dave Mitchell <davem@iabyn.com>
To: Felipe Gasper <felipe@felipegasper.com>
Subject: Re: Pre-RFC: Rename SVf_UTF8 et al.
Message-ID: <YSzMQJIeURS/AznY@iabyn.com>

On Wed, Aug 18, 2021 at 01:18:34PM -0400, Felipe Gasper wrote:
> PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of confusion regarding the flag’s significance.

The SVf_UTF8 flags has a clear and unambiguous meaning (apart from some
historical bugs): in what manner the codepoints of a string are stored as
a sequence of bytes in memory.

If people are confused by this, renaming it is only going to add to the
cognitive load and confusion.

(I'm assuming the old names are kept as aliases for backwards
compatibility.)

--
The Enterprise successfully ferries an alien VIP from one place to another
without serious incident.
-- Things That Never Happen in "Star Trek" #7

----- End forwarded message -----

--
print+qq&$}$"$/$s$,$a$d$g$s$@$.$q$,$:$.$q$^$,$@$a$~$;$.$q$m&if+map{m,^\d{0\,},,${$::{$'}}=chr($"+=$&||1)}q&10m22,42}6:17a2~2.3@3;^2dg3q/s"&=~m*\d\*.*g

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

felipe at felipegasper

Aug 30, 2021, 7:19 AM

Post #21 of 46 (2408 views)

> On Aug 30, 2021, at 8:18 AM, Dave Mitchell <davem@iabyn.com> wrote:
>
> Date: Mon, 30 Aug 2021 13:17:04 +0100
> From: Dave Mitchell <davem@iabyn.com>
> To: Felipe Gasper <felipe@felipegasper.com>
> Subject: Re: Pre-RFC: Rename SVf_UTF8 et al.
> Message-ID: <YSzMQJIeURS/AznY@iabyn.com>
>
> On Wed, Aug 18, 2021 at 01:18:34PM -0400, Felipe Gasper wrote:
>> PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of confusion regarding the flag’s significance.
>
>
> The SVf_UTF8 flags has a clear and unambiguous meaning (apart from some
> historical bugs): in what manner the codepoints of a string are stored as
> a sequence of bytes in memory.
>
> If people are confused by this, renaming it is only going to add to the
> cognitive load and confusion.

I’ve proposed some fixes for perlre.pod (https://github.com/Perl/perl5/pull/19087). These fix documentation bugs that crept in specifically because of the use of “UTF-8” to refer to “upgraded” strings. It confuses even Perl’s own maintainers.

The fact that “UTF-8 string” can mean two quite-different things causes lots of encoding bugs in the wild. The fact that Perl *can’t* help to fix these worsens the problem.

Ricardo sensed a problem here back in 2016: https://www.youtube.com/watch?v=TmTeXcEixEg&t=940s

… when he referred to the flag as WIDE, in part because the encoding in question is *not*, in fact, UTF-8. Then he said: “Some joker went ahead, and they called that the UTF-8 flag.” Chuckles ensued.

Benefits of changing the internal terminology:

- It clarifies “external”, Perl-visible encoding versus internal codepoint storage. Different terms for different things.
- More abstract terminology for the internals discourages folks from peeking behind the abstraction.
- It’s more correct. Proper UTF-8 forbids quite a lot that Perl’s “lax UTF-8” (by design) allows.

Thanks for reading.

-FG

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

demerphq at gmail

Sep 2, 2021, 6:20 AM

Post #22 of 46 (2405 views)

On Fri, 20 Aug 2021 at 19:48, Felipe Gasper <felipe@felipegasper.com> wrote:

>
> > On Aug 20, 2021, at 1:05 PM, demerphq <demerphq@gmail.com> wrote:
> >
> > On Wed, 18 Aug 2021 at 19:17, Felipe Gasper <felipe@felipegasper.com>
> wrote:
> > Per recent IRC discussion …
> >
> > PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of
> confusion regarding the flag’s significance. Some think it indicates
> whether a given PV stores text versus binary. Some think it means that the
> PV is valid UTF-8. Still others likely hold other inaccurate views.
> >
> > The problem here is the naming. For example, consider `perl -e'my $foo =
> "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its
> code points (assuming use of a UTF-8 terminal) correspond to the bytes that
> encode “é” in UTF-8.
> >
> > Nope. It might contain utf8, but it is not UTF8-ON. Think of it like a
> square/rectangle relationship. All strings are "rectangles", all "squares"
> are rectangles, some strings are squares, but unless SQUARE flag is ON perl
> should assume it is a rectangle, not a square. The SQUARE flag should only
> be set when the rectangle has been proved conclusively to be a square. That
> the SQUARE flag is off does not mean the rectangle is not a square, merely
> that the square has not been proved to be such.
>
> You’re defining “a UTF-8 string” as “a string whose PV is marked as
> UTF-8”. I’m defining it as “a string whose Perl-visible code points happen
> to be valid UTF-8”.
>

I dont find your definition to be very useful, nor descriptive of how perl
manages these matters, so I am not using it. You are confusing different
levels of abstraction. Your definition also would include cases where the
data is already encoded and flagged as utf8. So it doesn't make sense to me.

Here is the set of definitions that I am operating from:

A "string" is a programming concept inside of Perl which is used to
represent "text" buffers of memory. There are three level of abstraction
for strings, two of which are tightly coupled. The three are the codepoint
level, semantic level and encoding level.

At the codepoint levels you can think of strings as variable length arrays
of numbers (codepoints), where the numbers are restricted to 0 to 0x10FFFF.

At the semantics level you can think of these numbers (codepoints) of
representing characters from some form of text with specific rules for
certain operations like case-folding, as well as a well defined mapping to
graphemes which are displayed to our eyes when those numbers are rendered
by a display device like a terminal.

The encoding level of abstraction addresses how those numbers (codepoints)
will be represented as bytes (octets) in memory inside of Perl, and when
you directly write the data to disk or to some other output stream.

There are two sets of codepoint range, semantics and encoding available,
which are controlled by a flag associated with the string called the UTF8
flag. When set this flag indicates that the string can represent codepoints
0 to 0x10FFFF, should have Unicode semantics applied to it, and that its in
memory representation is variable-width utf8. When the flag is not set it
indicates the string can represent codepoints 0 to 255, has ASCII
case-folding semantics, and that its in memory representation is fixed
width octets.

In order to be able to combine these two types of strings we need to define
some operations:

upgrading/downgrading: converting a string from one set of semantics and
encoding to the other while preserving exactly the codepoint level
representation. By tradition we call it upgrading when we go from Latin-1
to Unicode with the result being UTF8 on, and we call it downgrading when
we go from Unicode to Latin1 with the result being UTF8-off. These
operations are NOT symmetrical. It is *not* possible to downgrade every
Unicode string to Latin-1, however it is possible to upgrade every Latin-1
string to Unicode. By tradition upgrade and downgrade functions are noops
when their input is already in the form expected as the result, but this is
by tradition only.

decoding/encoding: converting a string from one form to the other in a way
that transforms the codepoints from one form to a potentially different
form. Traditional we speak of decode_utf8() taking a latin1 string
containing octets that make up a utf8 encoded string, and returning a
string which is UTF8 on which represents the Unicode version of those
octets. For well formed input this results in no change to the underlying
string, but the flag is flipped on. Vice versa we speak of encode_utf8()
which converts its input to a utf8 encoded form, regardless of what form it
was represented internally.

When we are confronted with combining the two forms of string Perl has
little choice but to use the "safe" strategy of "upgrading" the Latin-1
parts to Unicode.

Both the operations of "upgrading" and "decoding" result in Utf8-on
strings, and indeed both can result in not changing their input at all, but
when they do change their input they change it very differently. Most of
the places people get into trouble with strings is when they end up doing
upgrade operations when they should have done a decode operation. This is
because upgrade operations can happen implicitly based on simple rules and
thus can happen "by accident", but decode operations are always explicit so
they never happen without the involvement of the developer in some way.
This is at least partly because upgrade operations do not have any failure
modes but decode operations do.

Most of the time, as long as you are only thinking about codepoints,
developers dont have to worry about this stuff. The places where they do
are when they are reading or writing data, and in some cases when they are
embedding string constants in their code where they want a particular set
of semantics and encoding. As long as people are disciplined to use
decode_utf8() before they use utf8 string data, and encode_utf8 before
they emit it then the complexities above should be transparent to the
developer.

> What you call “a UTF-8 string” is what I propose we call, per existing
> nomenclature (i.e., utf8::upgrade()) “a PV-upgraded string”, with
> corresponding code changes. Then the term “UTF-8 string” makes sense from a
> pure-Perl context without requiring Perl programmers to worry about
> interpreter internals.
>
>
No. The flag does not mean "upgraded" it means "unicode semantics, utf8
encoding". Upgrading is one way to get such a string, and it might even be
the most common, but the most important and likely to be correct way is
explicit decoding.

If we are to rename the flag then we should just rename it as the UNICODE
flag. Would have saved a world of confusion.

> The “UTF-8 flag”, however, is likely *not* set on this string. By
> contrast, consider `perl -Mutf8 -e'my $foo = "é"'`. Here $foo has the
> “UTF-8 flag” set, but $foo is NOT a “UTF-8 string” because its code points
> (in this case, only 1) aren’t valid UTF-8.
> >
> > Except it is valid UTF-8: (at least in my utf8 terminal).
> >
> > $ perl -MDevel::Peek -Mutf8 -e'my $foo = "é"; Dump($foo)'
> > SV = PV(0x153efc0) at 0x155fb38
> > REFCNT = 1
> > FLAGS = (POK,IsCOW,pPOK,UTF8)
> > PV = 0x1563240 "\303\251"\0 [UTF8 "\x{e9}"]
> > CUR = 2
> > LEN = 10
> > COW_REFCNT = 1
> >
> > So the string is UTF-8.
>
> Again, different definitions.

You cant define yourself away from how things actually work. The string is
UTF8 on because perl says so.

> The Perl-visible string contains a single code point, 0xe9. This code
> point doesn’t correspond to valid UTF-8 bytes,

Codepoints and octets (bytes) are abstractions at different levels in Perl
and Unicode so "this codepoint doesnt correspond to valid UTF-8 bytes"
doesn't really make any sense as a sentence. Codepoints are integers from 0
to 0x1FFFF. They can be *encoded* in a variety of ways as octets, for
instance the codepoint E9 has at least 5 different representations at the
octet level under Uncode: "\x{E9}\x{00}" "\x{00}\x{E9}",
"\x{E9}\x{00}\x{00}\x{00}", "\x{00}\x{00}\x{00}\x{E9}", and "\303\251" are
all equally valid ways of representing the codepoint E9. Notice, that the
octet "E9" by itself is NOT a valid way to represent the codepoint E9 in
any Unicode encoding.

The dump above shows correctly the octet and codepoint representation of
the string. The buffer contains "\303\251" which is the UTF8
representation of the codepoint E9, and the flag is on which is why it
understands that this is a single codepoint, not two.

> so IMO it doesn’t make sense to call it a “UTF-8 string”.

After the operations I performed it is a Unicode string encoded in UTF-8,
as represented by the UTF-8 flag in the dump.

> Whether Perl stores that code point as one byte or as two is Perl’s
> business alone … right?
>

Well it would be weird if we stored Unicode data in a form not supported by
Unicode. Dont you think? There is no single octet representation of the
codepoint E9 defined by Unicode as far as I know.

>
> > I do not understand your point that only the initiated can understand
> this flag. It means one and only one thing: that the perl internals should
> assume that the buffer contains utf8 encoded data and that perl should
> apply unicode semantics when doing character and case-sensitive operations,
> and that perl can make certain assumptions when it processing the data (eg
> that is not malformed).
>
> The behaviour you’re talking about is what the unicode_strings and
> unicode_eval features specifically do away with (i.e., fix), right?

Im not familiar with those enough to comment. I assume they relate to what
assumptions Perl should make about strings which are constructed as
literals in the source code, where there is a great deal of ambiguity about
what is going on compared to actual code that constructs such strings,
where things are exact.

>
> You’re omitting what IMO is the most obvious purpose of the flag: to
> indicate whether the code points that the PV stores are the plain bytes, or
> are the UTF-8-decoded code points. This is why you can print() the string
> in either upgraded or downgraded forms, and it comes out the same.
>

Its hard to say what you are referring to here. If you mean codepoints
0-127, then it is unsurprising as the representation of them is equivalent
in ASCII and UTF8 and Latin1. But if you mean a codepoint above the ASCII
plane, then no they should not come out the same. If you are piping that
data to a file I would expect the octets written to that file to be
different. (assuming a binary filehandle with no layers magically
transforming things). If your terminal renders them the same then I assume
it is doing some magic behind the scenes to deal with malformed utf8.

>
> > BTW, your scheme needs to account for WAS_UTF8 as well. Most people dont
> know it, but there are actually three types of strings in the perl
> internals, UTF8-ON, UTF8-OFF, UTF8-OFF + WAS_UTF8. It only manifests in
> hash keys. But it needs to be accounted for as well in any renaming. Perl
> dictates that keys which are character-wise equivalent hash the same
> regardless of the UTF8 flag (or put alternative, the hash should be of the
> codepoints the string represents NOT the octets that make up that
> representation). This means UTF8-ON keys are always downgraded on lookup or
> store in a hash. If the downgrade is successful the key is marked as
> WAS-UTF8 and the downgraded string is stored and hashed, if it was
> unsuccessful (eg it contains codepoints above 255) it is marked as UTF8-ON
> and the original buffer is hashed. When the key is extracted with keys() or
> each() if the WASUTF8 flag is set the string is upgraded back to the UTF8
> form.
>
> Thank you for this. I knew about the was-UTF8 status but didn’t know why
> it exists.
>
> > I think you need to step back and consider that strings are sequences of
> octets. Sometimes those octets are ordered such that they can be
> interpreted as utf8. The UTF-8 flag being on tells perl that it can and
> should treat the octets as utf8.
>
> C strings are sequences of octets, yes. Perl strings, though, are
> sequences of code points, not octets. In this they’re more like JavaScript
> strings than C strings.
>

Perl strings are very similar to C strings when the flag is off, and
JavaScript strings when the flag is on.

>
> > my $foo = "é";
> >
> > I don't know exactly what that code does without doing an octet level
> investigation of the data. It could be one octet and in latin-1 or it could
> be two octets and be Unicode in one of several formats (utf8, utf-16BE
> utf-16LE) and still be rendered identically in an editor or browser.
>
> Sorry, I assumed we all use UTF-8 terminals. :) But yes, I should have
> written it as two \x escapes, sorry.
>

I do, but this email is being rendered by gmail in a browser. Any number of
conversions of the actual bytes on disk could have happened between you and
me. For all I know you might have wrote your email in a text editor using
UTF-32.

>
> > I also know what happens here:
> >
> > my $foo="\x{c3}\x{a9}";
> > utf8::decode($foo);
> > Dump($foo);
> >
> > SV = PV(0x2303fc0) at 0x2324c98
> > REFCNT = 1
> > FLAGS = (POK,IsCOW,pPOK,UTF8)
> > PV = 0x2328300 "\303\251"\0 [UTF8 "\x{e9}"]
> > CUR = 2
> > LEN = 10
> > COW_REFCNT = 1
> >
> > That is, i start off with two octets, C3 - A9, which happens to be the
> encoding for the codepoint E9, which happens to be é.
> > I then tell perl to "decode" those octets, which really means I tell
> perl to check that the octets actually do make up valid utf8. And if perl
> agrees that indeed these are valid utf8 octets, then it turns the flag on.
> Now it doesn't matter if you *meant* to construct utf8 in the variable fed
> to decode, all that matters is that at an octet level those octet happen to
> make up valid utf8.
>
> I think you’re actually breaking the abstraction here by assuming that
> Perl implements the decode by setting a flag.
>
>
No I am not. The flag is there is there to tell the perl internals how to
manipulate the string. decode's task is to take arbitrary strings of octets
and ensure that they can be decoded as valid utf8 and possibly to do some
conversion (eg for forbidden utf8 sequences or other normalization) as it
does so and then SETS THE FLAG. Only once decode is done is the string
"Unicode" and is the string "utf8". Prior to that it was just random
octets. It doesnt need to do anything BUT set the flag because its internal
encoding matches the external encoding in this case. If it was decoding
UTF16LE then it would have do conversion as well.

> It would be just as legitimate to mutate the PV to store a single octet,
> 0xe9, and leave the UTF8 flag off.

Nope. That would mean that Perl would use ASCII/Latin-1 case folding rules
on the result, which would be wrong. It should use Unicode case folding
rules for codepoint E9 if it was decoded as that codepoint. (Change the
example to \x{DF} and you can see these issues in the flesh, \x{DF} should
match "ss" in Unicode, but in ASCII/Latin1 it only matches \x{DF}. The lc()
version of \x{DF} is "ss" but in Latin-1/Ascii there are no multi-byte case
folds). Even more suggestive that Perl doing this would be wrong is that
in fact there is NO valid Unicode encoding of codepoint E9 which is only 1
octet long. So that would be extremely wrong of Perl to use a non Unicode
encoding of unicode data dont you think? Also, what would perl do when the
codepoint doesn't fit into a single octet? Your argument might have some
merit if you were arguing that Perl could have decoded it into
"\x{E9}\0\0\0" and set the UTF-32 flag, but as stated it doesn't make sense.

> Perl doesn’t do that, of course, because it’s easier just to set a flag,
> but as long as the string content is the single code point 0xe9 it doesn’t
> really matter how Perl achieves that.
>

Yes, Perl deliberately chose to use Utf8 internally for the same reason
Unicode defined utf8 the way it did, so that all of the existing ASCII data
would still be valid when interpreted as Unicode, thus avoiding storage and
performance penalties alternative schemes might impose.

(Notwithstanding, of course, the abstraction leaks that things like the
> unicode_strings feature and Sys::Binmode fix.)
>
> There are parts of the code that appear to go the other way and prioritize
> downgraded storage. Perl_refcounted_he_fetch_pvn(), for example.
>

I would not have put it like that. With hashing you dont have a lot of
choices if you want the unicode form of latin-1 strings to hash the same.
You can either decode to the codepoint, and then use a codepoint by
codepoint hashing algorithm, which is slow and actually as far as I know
there arent any published hash algorithms to do this. So to stay safe with
the hash function you can downgrade strings which can be downgraded and
then hash the result, or you can upgrade the strings and hash the upgraded
form. Upgrade strings are on average larger than downgraded equivalents, so
hashing them is more expensive, and there is an assumption that most keys
will actually be ASCII so they don't need to be downgraded. When you
consider that perl was an early adopter of Unicode and was bolting it on to
a latin-1 codebase the bias seems pretty reasonable.

cheers,
yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

demerphq at gmail

Sep 2, 2021, 6:52 AM

Post #23 of 46 (2405 views)

On Mon, 23 Aug 2021 at 01:59, Yuki Kimoto <kimoto.yuki@gmail.com> wrote:

> Personally, I'm starting to agree on the goal of Felipe.
>
> 1. Being able to distinguish between Text and Bytes from user
>

It seems like what you want is to redefine our use of Unicode-string/UTF-8
flag to be "Text" and then to call the other form "Bytes", but that doesn't
make sense. We define non-utf8 data to be implicitly ASCII/Latin-1. ASCII
because of case folding rules. Latin-1 because of conversion to Unicode,
which defines codepoints 0-255 to be equivalent to the codepoints 0.255 in
latin1. And we implicitly assume the equivalency in our operations.

> 2. Text is Unicode code point which is represented by UTF-8
>

chr(65) returns a latin-1 (eg NON-UTF8 flagged) character/string "A" which
happens to be octet identical but not flag identical to the Unicode
character "A". Are you suggesting that chr() doesn't return Text? Wouldn't
that be weird? And in concatenation what is supposed to happen when you
have Bytes . Text? Is that even legal in your scheme?

Take this further, is an operation like lc() even legal on "Bytes"?
Currently: lc(chr(65)) eq "a". Since chr(65) doesnt return a Unicode
character, and thus is not Text, shouldnt the lc() die? Or would you also
want to change that?

> 3. Perl config has default OS text character set and OS file system
> character set
>

As far as I know the assumption that all non-Unicode data is Latin-1 is
baked into Perl in a very firm way. So I dont see how this could be related
to the OS.

> 4. Perl standard function(print, open, etc) output string by encoding
> above 3 character set if the string is Text.
>

I dont see how we could change this. Anyone who cares exactly how data is
emitted to disk or any other "wire" format should be using Encode to
explicitly encode their data.

Perl strings are what perl strings are. I find that the people who have
trouble with them are usually the ones who like to pretend they work
differently than they do, instead of just respecting how they work and
being very explicit when they need to care, which for me personally has
been pretty rarely, eg, specialized output code or processing code.
(Parsing emails is a good place where you can get burned with encoding
issues and learn a lot.)

Having said that I have seen a lot of people for one reason or another get
encoding wrong in various ways, especially with MySQL or other over-wire
situations. Double encoding errors are common (eg where people accidentally
upgrade already encoded but flag-off utf8 data). At work we have a function
called recurse_decode_utf8() which takes a string and does its best to
"reduce" it to its minimal form by repeatedly turning off the utf8 flag,
and then executing decode_utf8() on the string and then downgrade until the
decode operation throws an error. Widespread use of this function o string
data almost completely eliminated all of our utf8 problems. (Ill post the
code in another mail.)

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

demerphq at gmail

Sep 2, 2021, 7:02 AM

Post #24 of 46 (2405 views)

On Thu, 2 Sept 2021 at 15:20, demerphq <demerphq@gmail.com> wrote:

> On Fri, 20 Aug 2021 at 19:48, Felipe Gasper <felipe@felipegasper.com>
> wrote:
>
>>
>> > On Aug 20, 2021, at 1:05 PM, demerphq <demerphq@gmail.com> wrote:
>> >
>> > On Wed, 18 Aug 2021 at 19:17, Felipe Gasper <felipe@felipegasper.com>
>> wrote:
>> > Per recent IRC discussion …
>> >
>> > PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of
>> confusion regarding the flag’s significance. Some think it indicates
>> whether a given PV stores text versus binary. Some think it means that the
>> PV is valid UTF-8. Still others likely hold other inaccurate views.
>> >
>> > The problem here is the naming. For example, consider `perl -e'my $foo
>> = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that
>> its code points (assuming use of a UTF-8 terminal) correspond to the bytes
>> that encode “é” in UTF-8.
>> >
>> > Nope. It might contain utf8, but it is not UTF8-ON. Think of it like a
>> square/rectangle relationship. All strings are "rectangles", all "squares"
>> are rectangles, some strings are squares, but unless SQUARE flag is ON perl
>> should assume it is a rectangle, not a square. The SQUARE flag should only
>> be set when the rectangle has been proved conclusively to be a square. That
>> the SQUARE flag is off does not mean the rectangle is not a square, merely
>> that the square has not been proved to be such.
>>
>> You’re defining “a UTF-8 string” as “a string whose PV is marked as
>> UTF-8”. I’m defining it as “a string whose Perl-visible code points happen
>> to be valid UTF-8”.
>>
>
> I dont find your definition to be very useful, nor descriptive of how perl
> manages these matters, so I am not using it. You are confusing different
> levels of abstraction. Your definition also would include cases where the
> data is already encoded and flagged as utf8. So it doesn't make sense to me.
>
> Here is the set of definitions that I am operating from:
>
> A "string" is a programming concept inside of Perl which is used to
> represent "text" buffers of memory. There are three level of abstraction
> for strings, two of which are tightly coupled. The three are the codepoint
> level, semantic level and encoding level.
>
> At the codepoint levels you can think of strings as variable length arrays
> of numbers (codepoints), where the numbers are restricted to 0 to 0x10FFFF.
>
> At the semantics level you can think of these numbers (codepoints) of
> representing characters from some form of text with specific rules for
> certain operations like case-folding, as well as a well defined mapping to
> graphemes which are displayed to our eyes when those numbers are rendered
> by a display device like a terminal.
>
> The encoding level of abstraction addresses how those numbers (codepoints)
> will be represented as bytes (octets) in memory inside of Perl, and when
> you directly write the data to disk or to some other output stream.
>
> There are two sets of codepoint range, semantics and encoding available,
> which are controlled by a flag associated with the string called the UTF8
> flag. When set this flag indicates that the string can represent codepoints
> 0 to 0x10FFFF, should have Unicode semantics applied to it, and that its in
> memory representation is variable-width utf8. When the flag is not set it
> indicates the string can represent codepoints 0 to 255, has ASCII
> case-folding semantics, and that its in memory representation is fixed
> width octets.
>
> In order to be able to combine these two types of strings we need to
> define some operations:
>
> upgrading/downgrading: converting a string from one set of semantics and
> encoding to the other while preserving exactly the codepoint level
> representation. By tradition we call it upgrading when we go from Latin-1
> to Unicode with the result being UTF8 on, and we call it downgrading when
> we go from Unicode to Latin1 with the result being UTF8-off. These
> operations are NOT symmetrical. It is *not* possible to downgrade every
> Unicode string to Latin-1, however it is possible to upgrade every Latin-1
> string to Unicode. By tradition upgrade and downgrade functions are noops
> when their input is already in the form expected as the result, but this is
> by tradition only.
>
> decoding/encoding: converting a string from one form to the other in a way
> that transforms the codepoints from one form to a potentially different
> form. Traditional we speak of decode_utf8() taking a latin1 string
> containing octets that make up a utf8 encoded string, and returning a
> string which is UTF8 on which represents the Unicode version of those
> octets. For well formed input this results in no change to the underlying
> string, but the flag is flipped on. Vice versa we speak of encode_utf8()
> which converts its input to a utf8 encoded form, regardless of what form it
> was represented internally.
>
> When we are confronted with combining the two forms of string Perl has
> little choice but to use the "safe" strategy of "upgrading" the Latin-1
> parts to Unicode.
>
> Both the operations of "upgrading" and "decoding" result in Utf8-on
> strings, and indeed both can result in not changing their input at all, but
> when they do change their input they change it very differently. Most of
> the places people get into trouble with strings is when they end up doing
> upgrade operations when they should have done a decode operation. This is
> because upgrade operations can happen implicitly based on simple rules and
> thus can happen "by accident", but decode operations are always explicit so
> they never happen without the involvement of the developer in some way.
> This is at least partly because upgrade operations do not have any failure
> modes but decode operations do.
>
> Most of the time, as long as you are only thinking about codepoints,
> developers dont have to worry about this stuff. The places where they do
> are when they are reading or writing data, and in some cases when they are
> embedding string constants in their code where they want a particular set
> of semantics and encoding. As long as people are disciplined to use
> decode_utf8() before they use utf8 string data, and encode_utf8 before
> they emit it then the complexities above should be transparent to the
> developer.
>

I was rereading this and I thought of something to add here. Part of the
confusion with Perl strings is that we try to hide the flag. We dont really
want people to look at it and think about it. Instead we provide a handful
of verbs which can be used to force the string to the shape we want, or
throw an error if we cant (or sometimes be a no-op).

I mean, if I want to be sure i have a latin-1 string then i would do
something like:

eval { utf8::downgrade($str); 1 } or warn "Cant downgrade string!";

And if want to be user I have a utf8 string then I would do something like:

utf8::upgrade($str);

I wonder if we made accessing the flag state more socially acceptable
whether people would find this less confusing.

Yves

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

felipe at felipegasper

Sep 2, 2021, 7:06 AM

Post #25 of 46 (2405 views)

> On Sep 2, 2021, at 10:02 AM, demerphq <demerphq@gmail.com> wrote:
>
> On Thu, 2 Sept 2021 at 15:20, demerphq <demerphq@gmail.com> wrote:
> On Fri, 20 Aug 2021 at 19:48, Felipe Gasper <felipe@felipegasper.com> wrote:
>
> > On Aug 20, 2021, at 1:05 PM, demerphq <demerphq@gmail.com> wrote:
> >
> > On Wed, 18 Aug 2021 at 19:17, Felipe Gasper <felipe@felipegasper.com> wrote:
> > Per recent IRC discussion …
> >
> > PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of confusion regarding the flag’s significance. Some think it indicates whether a given PV stores text versus binary. Some think it means that the PV is valid UTF-8. Still others likely hold other inaccurate views.
> >
> > The problem here is the naming. For example, consider `perl -e'my $foo = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that its code points (assuming use of a UTF-8 terminal) correspond to the bytes that encode “é” in UTF-8.
> >
> > Nope. It might contain utf8, but it is not UTF8-ON. Think of it like a square/rectangle relationship. All strings are "rectangles", all "squares" are rectangles, some strings are squares, but unless SQUARE flag is ON perl should assume it is a rectangle, not a square. The SQUARE flag should only be set when the rectangle has been proved conclusively to be a square. That the SQUARE flag is off does not mean the rectangle is not a square, merely that the square has not been proved to be such.
>
> You’re defining “a UTF-8 string” as “a string whose PV is marked as UTF-8”. I’m defining it as “a string whose Perl-visible code points happen to be valid UTF-8”.
>
> I dont find your definition to be very useful, nor descriptive of how perl manages these matters, so I am not using it. You are confusing different levels of abstraction. Your definition also would include cases where the data is already encoded and flagged as utf8. So it doesn't make sense to me.
>
> Here is the set of definitions that I am operating from:
>
> A "string" is a programming concept inside of Perl which is used to represent "text" buffers of memory. There are three level of abstraction for strings, two of which are tightly coupled. The three are the codepoint level, semantic level and encoding level.
>
> At the codepoint levels you can think of strings as variable length arrays of numbers (codepoints), where the numbers are restricted to 0 to 0x10FFFF.
>
> At the semantics level you can think of these numbers (codepoints) of representing characters from some form of text with specific rules for certain operations like case-folding, as well as a well defined mapping to graphemes which are displayed to our eyes when those numbers are rendered by a display device like a terminal.
>
> The encoding level of abstraction addresses how those numbers (codepoints) will be represented as bytes (octets) in memory inside of Perl, and when you directly write the data to disk or to some other output stream.
>
> There are two sets of codepoint range, semantics and encoding available, which are controlled by a flag associated with the string called the UTF8 flag. When set this flag indicates that the string can represent codepoints 0 to 0x10FFFF, should have Unicode semantics applied to it, and that its in memory representation is variable-width utf8. When the flag is not set it indicates the string can represent codepoints 0 to 255, has ASCII case-folding semantics, and that its in memory representation is fixed width octets.
>
> In order to be able to combine these two types of strings we need to define some operations:
>
> upgrading/downgrading: converting a string from one set of semantics and encoding to the other while preserving exactly the codepoint level representation. By tradition we call it upgrading when we go from Latin-1 to Unicode with the result being UTF8 on, and we call it downgrading when we go from Unicode to Latin1 with the result being UTF8-off. These operations are NOT symmetrical. It is *not* possible to downgrade every Unicode string to Latin-1, however it is possible to upgrade every Latin-1 string to Unicode. By tradition upgrade and downgrade functions are noops when their input is already in the form expected as the result, but this is by tradition only.
>
> decoding/encoding: converting a string from one form to the other in a way that transforms the codepoints from one form to a potentially different form. Traditional we speak of decode_utf8() taking a latin1 string containing octets that make up a utf8 encoded string, and returning a string which is UTF8 on which represents the Unicode version of those octets. For well formed input this results in no change to the underlying string, but the flag is flipped on. Vice versa we speak of encode_utf8() which converts its input to a utf8 encoded form, regardless of what form it was represented internally.
>
> When we are confronted with combining the two forms of string Perl has little choice but to use the "safe" strategy of "upgrading" the Latin-1 parts to Unicode.
>
> Both the operations of "upgrading" and "decoding" result in Utf8-on strings, and indeed both can result in not changing their input at all, but when they do change their input they change it very differently. Most of the places people get into trouble with strings is when they end up doing upgrade operations when they should have done a decode operation. This is because upgrade operations can happen implicitly based on simple rules and thus can happen "by accident", but decode operations are always explicit so they never happen without the involvement of the developer in some way. This is at least partly because upgrade operations do not have any failure modes but decode operations do.
>
> Most of the time, as long as you are only thinking about codepoints, developers dont have to worry about this stuff. The places where they do are when they are reading or writing data, and in some cases when they are embedding string constants in their code where they want a particular set of semantics and encoding. As long as people are disciplined to use decode_utf8() before they use utf8 string data, and encode_utf8 before they emit it then the complexities above should be transparent to the developer.
>
>
> I was rereading this and I thought of something to add here. Part of the confusion with Perl strings is that we try to hide the flag. We dont really want people to look at it and think about it. Instead we provide a handful of verbs which can be used to force the string to the shape we want, or throw an error if we cant (or sometimes be a no-op).
>
> I mean, if I want to be sure i have a latin-1 string then i would do something like:
>
> eval { utf8::downgrade($str); 1 } or warn "Cant downgrade string!";
>
> And if want to be user I have a utf8 string then I would do something like:
>
> utf8::upgrade($str);
>
> I wonder if we made accessing the flag state more socially acceptable whether people would find this less confusing.

----
use v5.34;
use Sys::Binmode;

# When would I ever need to look at the flag here?
----

(Responses to the other stuff pending.)

-FG

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

grinnz at gmail

Sep 2, 2021, 7:39 AM

Post #26 of 46 (1735 views)

There is way too much written here so I will be responding as I can.

On Thu, Sep 2, 2021 at 9:21 AM demerphq <demerphq@gmail.com> wrote:

> On Fri, 20 Aug 2021 at 19:48, Felipe Gasper <felipe@felipegasper.com>
> wrote:
>
>>
>> > On Aug 20, 2021, at 1:05 PM, demerphq <demerphq@gmail.com> wrote:
>> >
>> > On Wed, 18 Aug 2021 at 19:17, Felipe Gasper <felipe@felipegasper.com>
>> wrote:
>> > Per recent IRC discussion …
>> >
>> > PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of
>> confusion regarding the flag’s significance. Some think it indicates
>> whether a given PV stores text versus binary. Some think it means that the
>> PV is valid UTF-8. Still others likely hold other inaccurate views.
>> >
>> > The problem here is the naming. For example, consider `perl -e'my $foo
>> = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that
>> its code points (assuming use of a UTF-8 terminal) correspond to the bytes
>> that encode “é” in UTF-8.
>> >
>> > Nope. It might contain utf8, but it is not UTF8-ON. Think of it like a
>> square/rectangle relationship. All strings are "rectangles", all "squares"
>> are rectangles, some strings are squares, but unless SQUARE flag is ON perl
>> should assume it is a rectangle, not a square. The SQUARE flag should only
>> be set when the rectangle has been proved conclusively to be a square. That
>> the SQUARE flag is off does not mean the rectangle is not a square, merely
>> that the square has not been proved to be such.
>>
>> You’re defining “a UTF-8 string” as “a string whose PV is marked as
>> UTF-8”. I’m defining it as “a string whose Perl-visible code points happen
>> to be valid UTF-8”.
>>
>
> I dont find your definition to be very useful, nor descriptive of how perl
> manages these matters, so I am not using it. You are confusing different
> levels of abstraction. Your definition also would include cases where the
> data is already encoded and flagged as utf8. So it doesn't make sense to me.
>
> Here is the set of definitions that I am operating from:
>
> A "string" is a programming concept inside of Perl which is used to
> represent "text" buffers of memory. There are three level of abstraction
> for strings, two of which are tightly coupled. The three are the codepoint
> level, semantic level and encoding level.
>
> At the codepoint levels you can think of strings as variable length arrays
> of numbers (codepoints), where the numbers are restricted to 0 to 0x10FFFF.
>
> At the semantics level you can think of these numbers (codepoints) of
> representing characters from some form of text with specific rules for
> certain operations like case-folding, as well as a well defined mapping to
> graphemes which are displayed to our eyes when those numbers are rendered
> by a display device like a terminal.
>
> The encoding level of abstraction addresses how those numbers (codepoints)
> will be represented as bytes (octets) in memory inside of Perl, and when
> you directly write the data to disk or to some other output stream.
>
> There are two sets of codepoint range, semantics and encoding available,
> which are controlled by a flag associated with the string called the UTF8
> flag. When set this flag indicates that the string can represent codepoints
> 0 to 0x10FFFF, should have Unicode semantics applied to it, and that its in
> memory representation is variable-width utf8. When the flag is not set it
> indicates the string can represent codepoints 0 to 255, has ASCII
> case-folding semantics, and that its in memory representation is fixed
> width octets.
>
> In order to be able to combine these two types of strings we need to
> define some operations:
>
> upgrading/downgrading: converting a string from one set of semantics and
> encoding to the other while preserving exactly the codepoint level
> representation. By tradition we call it upgrading when we go from Latin-1
> to Unicode with the result being UTF8 on, and we call it downgrading when
> we go from Unicode to Latin1 with the result being UTF8-off. These
> operations are NOT symmetrical. It is *not* possible to downgrade every
> Unicode string to Latin-1, however it is possible to upgrade every Latin-1
> string to Unicode. By tradition upgrade and downgrade functions are noops
> when their input is already in the form expected as the result, but this is
> by tradition only.
>
> decoding/encoding: converting a string from one form to the other in a way
> that transforms the codepoints from one form to a potentially different
> form. Traditional we speak of decode_utf8() taking a latin1 string
> containing octets that make up a utf8 encoded string, and returning a
> string which is UTF8 on which represents the Unicode version of those
> octets. For well formed input this results in no change to the underlying
> string, but the flag is flipped on. Vice versa we speak of encode_utf8()
> which converts its input to a utf8 encoded form, regardless of what form it
> was represented internally.
>

This is incorrect. Decode converts a string of bytes at the logical level
(upgraded or downgraded does not matter) and returns a string of characters
at the logical level (upgraded or downgraded does not matter). It may
commonly use upgraded or downgraded strings as the input or output for
efficiency but this is not required.

>
>
>> What you call “a UTF-8 string” is what I propose we call, per existing
>> nomenclature (i.e., utf8::upgrade()) “a PV-upgraded string”, with
>> corresponding code changes. Then the term “UTF-8 string” makes sense from a
>> pure-Perl context without requiring Perl programmers to worry about
>> interpreter internals.
>>
>>
> No. The flag does not mean "upgraded" it means "unicode semantics, utf8
> encoding". Upgrading is one way to get such a string, and it might even be
> the most common, but the most important and likely to be correct way is
> explicit decoding.
>
> If we are to rename the flag then we should just rename it as the UNICODE
> flag. Would have saved a world of confusion.
>

This is exactly what we have defined as "upgraded". Decoding does not
define the internal format of the resulting string at all. The only
internal format which is upgraded is when the UTF8 flag is on.

>
>> Whether Perl stores that code point as one byte or as two is Perl’s
>> business alone … right?
>>
>
> Well it would be weird if we stored Unicode data in a form not supported
> by Unicode. Dont you think? There is no single octet representation of the
> codepoint E9 defined by Unicode as far as I know.
>
>
>>
>> > I do not understand your point that only the initiated can understand
>> this flag. It means one and only one thing: that the perl internals should
>> assume that the buffer contains utf8 encoded data and that perl should
>> apply unicode semantics when doing character and case-sensitive operations,
>> and that perl can make certain assumptions when it processing the data (eg
>> that is not malformed).
>>
>> The behaviour you’re talking about is what the unicode_strings and
>> unicode_eval features specifically do away with (i.e., fix), right?
>
>
> Im not familiar with those enough to comment. I assume they relate to what
> assumptions Perl should make about strings which are constructed as
> literals in the source code, where there is a great deal of ambiguity about
> what is going on compared to actual code that constructs such strings,
> where things are exact.
>

They do not. They relate to consistently applying unicode rules to the
logical contents of the strings (in practice, making sure to work with
upgraded strings internally). The only mechanism that affects the
interpretation of literal strings is "use utf8".

>
>
>>
>> You’re omitting what IMO is the most obvious purpose of the flag: to
>> indicate whether the code points that the PV stores are the plain bytes, or
>> are the UTF-8-decoded code points. This is why you can print() the string
>> in either upgraded or downgraded forms, and it comes out the same.
>>
>
> Its hard to say what you are referring to here. If you mean codepoints
> 0-127, then it is unsurprising as the representation of them is equivalent
> in ASCII and UTF8 and Latin1. But if you mean a codepoint above the ASCII
> plane, then no they should not come out the same. If you are piping that
> data to a file I would expect the octets written to that file to be
> different. (assuming a binary filehandle with no layers magically
> transforming things). If your terminal renders them the same then I assume
> it is doing some magic behind the scenes to deal with malformed utf8.
>

Not correct. An upgraded or downgraded string prints identically because
you are printing the logical ordinals which do not change by this
operation. Whether those ordinals are interpreted as bytes or Unicode
characters depends what you are printing to, but in either case the
internally-stored bytes are irrelevant to the user except to determine what
those logical ordinals are.

>
>>
>> > I also know what happens here:
>> >
>> > my $foo="\x{c3}\x{a9}";
>> > utf8::decode($foo);
>> > Dump($foo);
>> >
>> > SV = PV(0x2303fc0) at 0x2324c98
>> > REFCNT = 1
>> > FLAGS = (POK,IsCOW,pPOK,UTF8)
>> > PV = 0x2328300 "\303\251"\0 [UTF8 "\x{e9}"]
>> > CUR = 2
>> > LEN = 10
>> > COW_REFCNT = 1
>> >
>> > That is, i start off with two octets, C3 - A9, which happens to be the
>> encoding for the codepoint E9, which happens to be é.
>> > I then tell perl to "decode" those octets, which really means I tell
>> perl to check that the octets actually do make up valid utf8. And if perl
>> agrees that indeed these are valid utf8 octets, then it turns the flag on.
>> Now it doesn't matter if you *meant* to construct utf8 in the variable fed
>> to decode, all that matters is that at an octet level those octet happen to
>> make up valid utf8.
>>
>> I think you’re actually breaking the abstraction here by assuming that
>> Perl implements the decode by setting a flag.
>>
>>
> No I am not. The flag is there is there to tell the perl internals how to
> manipulate the string. decode's task is to take arbitrary strings of octets
> and ensure that they can be decoded as valid utf8 and possibly to do some
> conversion (eg for forbidden utf8 sequences or other normalization) as it
> does so and then SETS THE FLAG. Only once decode is done is the string
> "Unicode" and is the string "utf8". Prior to that it was just random
> octets. It doesnt need to do anything BUT set the flag because its internal
> encoding matches the external encoding in this case. If it was decoding
> UTF16LE then it would have do conversion as well.
>

Not correct. The flag is there only to tell Perl internals whether the
internal bytes represent the ordinals directly or via UTF-8-like encoding.
The result of decoding can be downgraded, and an upgraded string can be
decoded, these are perfectly cromulent operations if the logical contents
are as expected. A unicode string can exist without ever having been
decoded, all that is required is to call a function that interprets the
ordinals as a unicode string.

>
>
>> It would be just as legitimate to mutate the PV to store a single octet,
>> 0xe9, and leave the UTF8 flag off.
>
>
> Nope. That would mean that Perl would use ASCII/Latin-1 case folding rules
> on the result, which would be wrong. It should use Unicode case folding
> rules for codepoint E9 if it was decoded as that codepoint. (Change the
> example to \x{DF} and you can see these issues in the flesh, \x{DF} should
> match "ss" in Unicode, but in ASCII/Latin1 it only matches \x{DF}. The lc()
> version of \x{DF} is "ss" but in Latin-1/Ascii there are no multi-byte case
> folds). Even more suggestive that Perl doing this would be wrong is that
> in fact there is NO valid Unicode encoding of codepoint E9 which is only 1
> octet long. So that would be extremely wrong of Perl to use a non Unicode
> encoding of unicode data dont you think? Also, what would perl do when the
> codepoint doesn't fit into a single octet? Your argument might have some
> merit if you were arguing that Perl could have decoded it into
> "\x{E9}\0\0\0" and set the UTF-32 flag, but as stated it doesn't make sense.
>

Not correct. Under old rules, yes, the UTF8 flag determined whether Unicode
rules are used in various operations; this was an abstraction break, and so
the unicode_strings feature was added to fix the problem, and enabled in
feature bundles since 5.12

-Dan

>

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

grinnz at gmail

Sep 2, 2021, 7:44 AM

Post #27 of 46 (1735 views)

On Thu, Sep 2, 2021 at 9:53 AM demerphq <demerphq@gmail.com> wrote:

> Having said that I have seen a lot of people for one reason or another get
> encoding wrong in various ways, especially with MySQL or other over-wire
> situations. Double encoding errors are common (eg where people accidentally
> upgrade already encoded but flag-off utf8 data). At work we have a function
> called recurse_decode_utf8() which takes a string and does its best to
> "reduce" it to its minimal form by repeatedly turning off the utf8 flag,
> and then executing decode_utf8() on the string and then downgrade until the
> decode operation throws an error. Widespread use of this function o string
> data almost completely eliminated all of our utf8 problems. (Ill post the
> code in another mail.)
>

If it works for this case fine, but please do not suggest this for general
use. This is guessing, and results in decoding strings which were already
characters (false positives), because there is no way to differentiate a
valid string of UTF-8 bytes from a string of characters whose ordinals
happen to form a valid UTF-8 byte sequence. The correct solution is to fix
your double encoding, and always decode a string the exact number of times
it was encoded. The use of the utf8 flag to "decode" is an unrelated
problem.

-Dan

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

grinnz at gmail

Sep 2, 2021, 8:47 AM

Post #28 of 46 (1734 views)

On Thu, Sep 2, 2021 at 10:02 AM demerphq <demerphq@gmail.com> wrote:

>
> I was rereading this and I thought of something to add here. Part of the
> confusion with Perl strings is that we try to hide the flag. We dont really
> want people to look at it and think about it. Instead we provide a handful
> of verbs which can be used to force the string to the shape we want, or
> throw an error if we cant (or sometimes be a no-op).
>
> I mean, if I want to be sure i have a latin-1 string then i would do
> something like:
>
> eval { utf8::downgrade($str); 1 } or warn "Cant downgrade string!";
>
> And if want to be user I have a utf8 string then I would do something like:
>
> utf8::upgrade($str);
>
> I wonder if we made accessing the flag state more socially acceptable
> whether people would find this less confusing.
>

This is fine and I often recommend use of these functions to workaround
broken abstractions (in Perl, XS or user code mistakenly using the utf8
flag). The problem is relying on the flag state for things it does not
represent, and propagating such issues.

As a side note, latin-1 is a convenient way to refer to downgraded strings
but since we are discussing internals it's important to note that they are
not specifically latin-1 strings, any more than upgraded strings are
specifically Unicode strings. A downgraded string may only consist of
ordinals in the byte range due to being stored that way, but what those
byte ordinals represent (if they even represent bytes) is up to what the
string is used for and whether the unicode_strings feature is in effect.
latin-1 mostly works as a description because the latin-1 code space maps
exactly to the first 255 codepoints of Unicode.

-Dan

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

kimoto.yuki at gmail

Sep 2, 2021, 6:02 PM

Post #29 of 46 (1734 views)

I want to get the basic knowledge to join this discussion.

Would you tell me the following things?

1. Do the following things mean the same or different?

my $bytes = Encode::encode('UTF-8', $string);

utf8::encode($string);
my $bytes = $string;

2. Do the following things mean the same or different?

my $string = Encode::decode('UTF-8', $bytes);

utf8::decode($bytes);
my $string = $bytes;

3. Do the following things mean the same or different?

# Perl
utf8::decode

# XS
sv_utf8_decode

4. Do the following things mean the same or different?

# Perl
utf8::encode

# XS
sv_utf8_encode

My first interest is the difference between the Perl world and the XS world.

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

grinnz at gmail

Sep 2, 2021, 6:30 PM

Post #30 of 46 (1734 views)

On Thu, Sep 2, 2021 at 9:03 PM Yuki Kimoto <kimoto.yuki@gmail.com> wrote:

> I want to get the basic knowledge to join this discussion.
>
> Would you tell me the following things?
>
> 1. Do the following things mean the same or different?
>
> my $bytes = Encode::encode('UTF-8', $string);
>
> utf8::encode($string);
> my $bytes = $string;
>

Similar, with some implementation differences: Encode::encode doesn't
modify $string in place (with those arguments), and utf8::encode does;
Encode::encode with UTF-8 will encode invalid codepoints (such as
surrogates, supercharacters) to replacement characters (with those
arguments) and utf8::encode will naively encode them with Perl's internal
encoding method like other codepoints (which can result in bytestrings
which UTF-8 decoders may consider invalid).

> 2. Do the following things mean the same or different?
>
> my $string = Encode::decode('UTF-8', $bytes);
>
> utf8::decode($bytes);
> my $string = $bytes;
>

Similar as above, but additionally, if the bytes cannot be interpreted as
even Perl's lax internal encoding, utf8::decode will return false and leave
the string unchanged; whereas Encode::decode decodes malformed byte
sequences to replacement characters (with those arguments). Encode::decode
will also decode invalid codepoints to replacement characters, but
utf8::decode will naively accept them.

> 3. Do the following things mean the same or different?
>
> # Perl
> utf8::decode
>
> # XS
> sv_utf8_decode
>

These are the same.

4. Do the following things mean the same or different?
>
> # Perl
> utf8::encode
>
> # XS
> sv_utf8_encode
>

These are the same.

Overall, all of these change the logical contents of the string from bytes
to the Unicode characters they represent, or from Unicode characters to
representative bytes.

-Dan

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

kimoto.yuki at gmail

Sep 2, 2021, 8:04 PM

Post #31 of 46 (1734 views)

2021-9-3 10:30 Dan Book <grinnz@gmail.com> wrote :

> On Thu, Sep 2, 2021 at 9:03 PM Yuki Kimoto <kimoto.yuki@gmail.com> wrote:
>
>> I want to get the basic knowledge to join this discussion.
>>
>> Would you tell me the following things?
>>
>> 1. Do the following things mean the same or different?
>>
>> my $bytes = Encode::encode('UTF-8', $string);
>>
>> utf8::encode($string);
>> my $bytes = $string;
>>
>
> Similar, with some implementation differences: Encode::encode doesn't
> modify $string in place (with those arguments), and utf8::encode does;
> Encode::encode with UTF-8 will encode invalid codepoints (such as
> surrogates, supercharacters) to replacement characters (with those
> arguments) and utf8::encode will naively encode them with Perl's internal
> encoding method like other codepoints (which can result in bytestrings
> which UTF-8 decoders may consider invalid).
>
>
>> 2. Do the following things mean the same or different?
>>
>> my $string = Encode::decode('UTF-8', $bytes);
>>
>> utf8::decode($bytes);
>> my $string = $bytes;
>>
>
> Similar as above, but additionally, if the bytes cannot be interpreted as
> even Perl's lax internal encoding, utf8::decode will return false and leave
> the string unchanged; whereas Encode::decode decodes malformed byte
> sequences to replacement characters (with those arguments). Encode::decode
> will also decode invalid codepoints to replacement characters, but
> utf8::decode will naively accept them.
>
>
>> 3. Do the following things mean the same or different?
>>
>> # Perl
>> utf8::decode
>>
>> # XS
>> sv_utf8_decode
>>
>
> These are the same.
>
> 4. Do the following things mean the same or different?
>>
>> # Perl
>> utf8::encode
>>
>> # XS
>> sv_utf8_encode
>>
>
> These are the same.
>
> Overall, all of these change the logical contents of the string from bytes
> to the Unicode characters they represent, or from Unicode characters to
> representative bytes.
>
> -Dan
>

Dan

Thank you.

I have some time to understand this.

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

demerphq at gmail

Sep 2, 2021, 11:30 PM

Post #32 of 46 (1734 views)

On Thu, 2 Sept 2021 at 16:39, Dan Book <grinnz@gmail.com> wrote:

> There is way too much written here so I will be responding as I can.
>
> On Thu, Sep 2, 2021 at 9:21 AM demerphq <demerphq@gmail.com> wrote:
>
>> On Fri, 20 Aug 2021 at 19:48, Felipe Gasper <felipe@felipegasper.com>
>> wrote:
>>
>>>
>>> > On Aug 20, 2021, at 1:05 PM, demerphq <demerphq@gmail.com> wrote:
>>> >
>>> > On Wed, 18 Aug 2021 at 19:17, Felipe Gasper <felipe@felipegasper.com>
>>> wrote:
>>> > Per recent IRC discussion …
>>> >
>>> > PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of
>>> confusion regarding the flag’s significance. Some think it indicates
>>> whether a given PV stores text versus binary. Some think it means that the
>>> PV is valid UTF-8. Still others likely hold other inaccurate views.
>>> >
>>> > The problem here is the naming. For example, consider `perl -e'my $foo
>>> = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact that
>>> its code points (assuming use of a UTF-8 terminal) correspond to the bytes
>>> that encode “é” in UTF-8.
>>> >
>>> > Nope. It might contain utf8, but it is not UTF8-ON. Think of it like a
>>> square/rectangle relationship. All strings are "rectangles", all "squares"
>>> are rectangles, some strings are squares, but unless SQUARE flag is ON perl
>>> should assume it is a rectangle, not a square. The SQUARE flag should only
>>> be set when the rectangle has been proved conclusively to be a square. That
>>> the SQUARE flag is off does not mean the rectangle is not a square, merely
>>> that the square has not been proved to be such.
>>>
>>> You’re defining “a UTF-8 string” as “a string whose PV is marked as
>>> UTF-8”. I’m defining it as “a string whose Perl-visible code points happen
>>> to be valid UTF-8”.
>>>
>>
>> I dont find your definition to be very useful, nor descriptive of how
>> perl manages these matters, so I am not using it. You are confusing
>> different levels of abstraction. Your definition also would include cases
>> where the data is already encoded and flagged as utf8. So it doesn't make
>> sense to me.
>>
>> Here is the set of definitions that I am operating from:
>>
>> A "string" is a programming concept inside of Perl which is used to
>> represent "text" buffers of memory. There are three level of abstraction
>> for strings, two of which are tightly coupled. The three are the codepoint
>> level, semantic level and encoding level.
>>
>> At the codepoint levels you can think of strings as variable length
>> arrays of numbers (codepoints), where the numbers are restricted to 0 to
>> 0x10FFFF.
>>
>> At the semantics level you can think of these numbers (codepoints) of
>> representing characters from some form of text with specific rules for
>> certain operations like case-folding, as well as a well defined mapping to
>> graphemes which are displayed to our eyes when those numbers are rendered
>> by a display device like a terminal.
>>
>> The encoding level of abstraction addresses how those numbers
>> (codepoints) will be represented as bytes (octets) in memory inside of
>> Perl, and when you directly write the data to disk or to some other output
>> stream.
>>
>> There are two sets of codepoint range, semantics and encoding available,
>> which are controlled by a flag associated with the string called the UTF8
>> flag. When set this flag indicates that the string can represent codepoints
>> 0 to 0x10FFFF, should have Unicode semantics applied to it, and that its in
>> memory representation is variable-width utf8. When the flag is not set it
>> indicates the string can represent codepoints 0 to 255, has ASCII
>> case-folding semantics, and that its in memory representation is fixed
>> width octets.
>>
>> In order to be able to combine these two types of strings we need to
>> define some operations:
>>
>> upgrading/downgrading: converting a string from one set of semantics and
>> encoding to the other while preserving exactly the codepoint level
>> representation. By tradition we call it upgrading when we go from Latin-1
>> to Unicode with the result being UTF8 on, and we call it downgrading when
>> we go from Unicode to Latin1 with the result being UTF8-off. These
>> operations are NOT symmetrical. It is *not* possible to downgrade every
>> Unicode string to Latin-1, however it is possible to upgrade every Latin-1
>> string to Unicode. By tradition upgrade and downgrade functions are noops
>> when their input is already in the form expected as the result, but this is
>> by tradition only.
>>
>> decoding/encoding: converting a string from one form to the other in a
>> way that transforms the codepoints from one form to a potentially different
>> form. Traditional we speak of decode_utf8() taking a latin1 string
>> containing octets that make up a utf8 encoded string, and returning a
>> string which is UTF8 on which represents the Unicode version of those
>> octets. For well formed input this results in no change to the underlying
>> string, but the flag is flipped on. Vice versa we speak of encode_utf8()
>> which converts its input to a utf8 encoded form, regardless of what form it
>> was represented internally.
>>
>
> This is incorrect. Decode converts a string of bytes at the logical level
> (upgraded or downgraded does not matter) and returns a string of characters
> at the logical level (upgraded or downgraded does not matter). It may
> commonly use upgraded or downgraded strings as the input or output for
> efficiency but this is not required.
>

Nope *you* are wrong. Decoding does not use upgrading or downgrading.
Decoding utf8 is logically equivalent to an upgrade operation when the
string contains only codepoints 0-127. For any codepoint ABOVE that it does
something very different.

>
>>
>>
>>> What you call “a UTF-8 string” is what I propose we call, per existing
>>> nomenclature (i.e., utf8::upgrade()) “a PV-upgraded string”, with
>>> corresponding code changes. Then the term “UTF-8 string” makes sense from a
>>> pure-Perl context without requiring Perl programmers to worry about
>>> interpreter internals.
>>>
>>>
>> No. The flag does not mean "upgraded" it means "unicode semantics, utf8
>> encoding". Upgrading is one way to get such a string, and it might even be
>> the most common, but the most important and likely to be correct way is
>> explicit decoding.
>>
>> If we are to rename the flag then we should just rename it as the UNICODE
>> flag. Would have saved a world of confusion.
>>
>
> This is exactly what we have defined as "upgraded". Decoding does not
> define the internal format of the resulting string at all. The only
> internal format which is upgraded is when the UTF8 flag is on.
>

Your definition is wrong then. You seem to have "upgrading" and "decoding"
muddled.

Decoding most definitely DOES define the internal format of the result
string. If you decode utf8 the result is a UTF8 on string. If that string
contained utf8 representing codepoints above 127 then the result will be
different.

If you upgrade the string: "\303\251" you will end up with a utf8 on string
which contains two codepoints, "\303" and "\251". You will NOT end up with
the correct codepoint E9

>
>>
>>> Whether Perl stores that code point as one byte or as two is Perl’s
>>> business alone … right?
>>>
>>
>> Well it would be weird if we stored Unicode data in a form not supported
>> by Unicode. Dont you think? There is no single octet representation of the
>> codepoint E9 defined by Unicode as far as I know.
>>
>>
>>>
>>> > I do not understand your point that only the initiated can understand
>>> this flag. It means one and only one thing: that the perl internals should
>>> assume that the buffer contains utf8 encoded data and that perl should
>>> apply unicode semantics when doing character and case-sensitive operations,
>>> and that perl can make certain assumptions when it processing the data (eg
>>> that is not malformed).
>>>
>>> The behaviour you’re talking about is what the unicode_strings and
>>> unicode_eval features specifically do away with (i.e., fix), right?
>>
>>
>> Im not familiar with those enough to comment. I assume they relate to
>> what assumptions Perl should make about strings which are constructed as
>> literals in the source code, where there is a great deal of ambiguity about
>> what is going on compared to actual code that constructs such strings,
>> where things are exact.
>>
>
> They do not. They relate to consistently applying unicode rules to the
> logical contents of the strings (in practice, making sure to work with
> upgraded strings internally). The only mechanism that affects the
> interpretation of literal strings is "use utf8.
>

Ill read up on this.

>
>
>>
>>
>>>
>>> You’re omitting what IMO is the most obvious purpose of the flag: to
>>> indicate whether the code points that the PV stores are the plain bytes, or
>>> are the UTF-8-decoded code points. This is why you can print() the string
>>> in either upgraded or downgraded forms, and it comes out the same.
>>>
>>
>> Its hard to say what you are referring to here. If you mean codepoints
>> 0-127, then it is unsurprising as the representation of them is equivalent
>> in ASCII and UTF8 and Latin1. But if you mean a codepoint above the ASCII
>> plane, then no they should not come out the same. If you are piping that
>> data to a file I would expect the octets written to that file to be
>> different. (assuming a binary filehandle with no layers magically
>> transforming things). If your terminal renders them the same then I assume
>> it is doing some magic behind the scenes to deal with malformed utf8.
>>
>
> Not correct. An upgraded or downgraded string prints identically because
> you are printing the logical ordinals which do not change by this
> operation. Whether those ordinals are interpreted as bytes or Unicode
> characters depends what you are printing to, but in either case the
> internally-stored bytes are irrelevant to the user except to determine what
> those logical ordinals are
>

Dude, you keep saying I am not correct when what I have said is easily
verifiable.

If you print chr(0xe9) to a filehandle and it does not contain the octet E9
then there is a problem

If you print chr(0xe9) to a utf8 terminal it should render a Unicode
replacement character for a broken utf8 sequence.

If you print an encoded chr(0xe9) then it should rendr the glyph for E9.

If you think anything else is happening then prove it with code.

>
>>
>>>
>>> > I also know what happens here:
>>> >
>>> > my $foo="\x{c3}\x{a9}";
>>> > utf8::decode($foo);
>>> > Dump($foo);
>>> >
>>> > SV = PV(0x2303fc0) at 0x2324c98
>>> > REFCNT = 1
>>> > FLAGS = (POK,IsCOW,pPOK,UTF8)
>>> > PV = 0x2328300 "\303\251"\0 [UTF8 "\x{e9}"]
>>> > CUR = 2
>>> > LEN = 10
>>> > COW_REFCNT = 1
>>> >
>>> > That is, i start off with two octets, C3 - A9, which happens to be the
>>> encoding for the codepoint E9, which happens to be é.
>>> > I then tell perl to "decode" those octets, which really means I tell
>>> perl to check that the octets actually do make up valid utf8. And if perl
>>> agrees that indeed these are valid utf8 octets, then it turns the flag on.
>>> Now it doesn't matter if you *meant* to construct utf8 in the variable fed
>>> to decode, all that matters is that at an octet level those octet happen to
>>> make up valid utf8.
>>>
>>> I think you’re actually breaking the abstraction here by assuming that
>>> Perl implements the decode by setting a flag.
>>>
>>>
>> No I am not. The flag is there is there to tell the perl internals how to
>> manipulate the string. decode's task is to take arbitrary strings of octets
>> and ensure that they can be decoded as valid utf8 and possibly to do some
>> conversion (eg for forbidden utf8 sequences or other normalization) as it
>> does so and then SETS THE FLAG. Only once decode is done is the string
>> "Unicode" and is the string "utf8". Prior to that it was just random
>> octets. It doesnt need to do anything BUT set the flag because its internal
>> encoding matches the external encoding in this case. If it was decoding
>> UTF16LE then it would have do conversion as well.
>>
>
> Not correct. The flag is there only to tell Perl internals whether the
> internal bytes represent the ordinals directly or via UTF-8-like encoding.
> The result of decoding can be downgraded, and an upgraded string can be
> decoded,
>

Show me the code. As far as I know decode operations do not operate on
unicode strings. Calling decode_utf8 on a string which is utf8 is a noop.

> these are perfectly cromulent operations if the logical contents are as
> expected. A unicode string can exist without ever having been decoded, all
> that is required is to call a function that interprets the ordinals as a
> unicode string.
>

"A unicode string can exist without ever having been decoded, all that is
required is to call a function that interprets the ordinals as a unicode
string."

And that function that does that interpretation is called decode. You just
contradicted yourself.

>>
>>> It would be just as legitimate to mutate the PV to store a single octet,
>>> 0xe9, and leave the UTF8 flag off.
>>
>>
>> Nope. That would mean that Perl would use ASCII/Latin-1 case folding
>> rules on the result, which would be wrong. It should use Unicode case
>> folding rules for codepoint E9 if it was decoded as that codepoint. (Change
>> the example to \x{DF} and you can see these issues in the flesh, \x{DF}
>> should match "ss" in Unicode, but in ASCII/Latin1 it only matches \x{DF}.
>> The lc() version of \x{DF} is "ss" but in Latin-1/Ascii there are no
>> multi-byte case folds). Even more suggestive that Perl doing this would be
>> wrong is that in fact there is NO valid Unicode encoding of codepoint E9
>> which is only 1 octet long. So that would be extremely wrong of Perl to use
>> a non Unicode encoding of unicode data dont you think? Also, what would
>> perl do when the codepoint doesn't fit into a single octet? Your argument
>> might have some merit if you were arguing that Perl could have decoded it
>> into "\x{E9}\0\0\0" and set the UTF-32 flag, but as stated it doesn't make
>> sense.
>>
>
> Not correct. Under old rules, yes, the UTF8 flag determined whether
> Unicode rules are used in various operations; this was an abstraction
> break, and so the unicode_strings feature was added to fix the problem, and
> enabled in feature bundles since 5.12
>

Ah, ok, so if you *change* the default mode of perl it does something
different than I described, and that makes my comments "incorrect"? What i
described is how "normal" perl without any new features enabled works. If
there are features that change what I have said feel free to use them. But
it doesnt change that what I said is an accurate version of how the perl
internals normally function.

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

perl5-porters at perl

Sep 3, 2021, 12:48 AM

Post #33 of 46 (1734 views)

On Fri, 3 Sept 2021 at 14:30, demerphq <demerphq@gmail.com> wrote:

> On Thu, 2 Sept 2021 at 16:39, Dan Book <grinnz@gmail.com> wrote:
>
>> On Thu, Sep 2, 2021 at 9:21 AM demerphq <demerphq@gmail.com> wrote:
>>
>>> On Fri, 20 Aug 2021 at 19:48, Felipe Gasper <felipe@felipegasper.com>
>>> wrote:
>>>
>>>>
>>>> What you call “a UTF-8 string” is what I propose we call, per existing
>>>> nomenclature (i.e., utf8::upgrade()) “a PV-upgraded string”, with
>>>> corresponding code changes. Then the term “UTF-8 string” makes sense from a
>>>> pure-Perl context without requiring Perl programmers to worry about
>>>> interpreter internals.
>>>>
>>>
>>>>
>>> No. The flag does not mean "upgraded" it means "unicode semantics, utf8
>>> encoding". Upgrading is one way to get such a string, and it might even be
>>> the most common, but the most important and likely to be correct way is
>>> explicit decoding.
>>>
>>> If we are to rename the flag then we should just rename it as the
>>> UNICODE flag. Would have saved a world of confusion.
>>>
>>
>> This is exactly what we have defined as "upgraded". Decoding does not
>> define the internal format of the resulting string at all. The only
>> internal format which is upgraded is when the UTF8 flag is on.
>>
>
> Your definition is wrong then. You seem to have "upgrading" and "decoding"
> muddled.
>
> Decoding most definitely DOES define the internal format of the result
> string. If you decode utf8 the result is a UTF8 on string. If that string
> contained utf8 representing codepoints above 127 then the result will be
> different.
>

Given this:

perl -e'use Devel::Peek; use Encode; print Dump(Encode::decode("UTF-8",
"example"))'
SV = PV(0x55b88281b2b0) at 0x55b88272e4e0
REFCNT = 2
FLAGS = (TEMP,POK,pPOK,UTF8)
PV = 0x55b8827f45d0 "example"\0 [UTF8 "example"]
CUR = 7
LEN = 10

I think the current behaviour is at least inefficient, if perhaps not
outright *wrong*... why would decoding enforce the UTF8 flag?

Put another way, if the resulting string has only codepoints 0..127, why
not leave the flag off so that string operations can be more efficient?

This extends to common cases such as UTF8-safe filter chains:

echo "example" | perl -CSD -lne'use Devel::Peek; s{e$}{es}; print Dump($_)'
SV = PV(0x556c3aab2000) at 0x556c3aaeb3e8
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x556c3aae41b0 "examples"\0 [UTF8 "examples"]
CUR = 8
LEN = 24

If that's not taking the faster pure-ASCII path for input, this would seem
like an easy optimisation opportunity. If the behaviour only happened with
the non-validating `utf8` decoding, then maybe it could be explained away
by not wanting to walk the entire length of the string... but then I'd at
least expect it to be different with the "UTF-8" encoding layer:

echo "example" | perl -lne'use Devel::Peek; BEGIN { binmode STDIN,
":encoding(UTF-8)" } s{e$}{es}; print Dump($_)'
SV = PV(0x55d8d2b76000) at 0x55d8d2baf4a8
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x55d8d2bad300 "examples"\0 [UTF8 "examples"]
CUR = 8
LEN = 24

So yes, decoding does set the UTF8 flag - but I'd argue that it
*shouldn't*, and the current behaviour is somewhere between a historical
accident and an oversight. To be clear, I'd expect the same non-UTF8 status
in the examples so far, as we see from this:

perl -e'use Devel::Peek; use utf8; my $text = "example"; print Dump($text)'
SV = PV(0x55be3864aff0) at 0x55be3866fe60
REFCNT = 1
FLAGS = (POK,IsCOW,pPOK)
PV = 0x55be3867f020 "example"\0
CUR = 7
LEN = 10
COW_REFCNT = 1

What am I missing here?

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

fawaka at gmail

Sep 3, 2021, 6:09 AM

Post #34 of 46 (1734 views)

On Fri, Sep 3, 2021 at 8:30 AM demerphq <demerphq@gmail.com> wrote:

> On Thu, 2 Sept 2021 at 16:39, Dan Book <grinnz@gmail.com> wrote:
>
>> There is way too much written here so I will be responding as I can.
>>
>> On Thu, Sep 2, 2021 at 9:21 AM demerphq <demerphq@gmail.com> wrote:
>>
>>> On Fri, 20 Aug 2021 at 19:48, Felipe Gasper <felipe@felipegasper.com>
>>> wrote:
>>>
>>>>
>>>> > On Aug 20, 2021, at 1:05 PM, demerphq <demerphq@gmail.com> wrote:
>>>> >
>>>> > On Wed, 18 Aug 2021 at 19:17, Felipe Gasper <felipe@felipegasper.com>
>>>> wrote:
>>>> > Per recent IRC discussion …
>>>> >
>>>> > PROBLEM: The naming of Perl’s “UTF-8 flag” is a continual source of
>>>> confusion regarding the flag’s significance. Some think it indicates
>>>> whether a given PV stores text versus binary. Some think it means that the
>>>> PV is valid UTF-8. Still others likely hold other inaccurate views.
>>>> >
>>>> > The problem here is the naming. For example, consider `perl -e'my
>>>> $foo = "é"'`. In this code $foo is a “UTF-8 string” by virtue of the fact
>>>> that its code points (assuming use of a UTF-8 terminal) correspond to the
>>>> bytes that encode “é” in UTF-8.
>>>> >
>>>> > Nope. It might contain utf8, but it is not UTF8-ON. Think of it like
>>>> a square/rectangle relationship. All strings are "rectangles", all
>>>> "squares" are rectangles, some strings are squares, but unless SQUARE flag
>>>> is ON perl should assume it is a rectangle, not a square. The SQUARE flag
>>>> should only be set when the rectangle has been proved conclusively to be a
>>>> square. That the SQUARE flag is off does not mean the rectangle is not a
>>>> square, merely that the square has not been proved to be such.
>>>>
>>>> You’re defining “a UTF-8 string” as “a string whose PV is marked as
>>>> UTF-8”. I’m defining it as “a string whose Perl-visible code points happen
>>>> to be valid UTF-8”.
>>>>
>>>
>>> I dont find your definition to be very useful, nor descriptive of how
>>> perl manages these matters, so I am not using it. You are confusing
>>> different levels of abstraction. Your definition also would include cases
>>> where the data is already encoded and flagged as utf8. So it doesn't make
>>> sense to me.
>>>
>>> Here is the set of definitions that I am operating from:
>>>
>>> A "string" is a programming concept inside of Perl which is used to
>>> represent "text" buffers of memory. There are three level of abstraction
>>> for strings, two of which are tightly coupled. The three are the codepoint
>>> level, semantic level and encoding level.
>>>
>>> At the codepoint levels you can think of strings as variable length
>>> arrays of numbers (codepoints), where the numbers are restricted to 0 to
>>> 0x10FFFF.
>>>
>>> At the semantics level you can think of these numbers (codepoints) of
>>> representing characters from some form of text with specific rules for
>>> certain operations like case-folding, as well as a well defined mapping to
>>> graphemes which are displayed to our eyes when those numbers are rendered
>>> by a display device like a terminal.
>>>
>>> The encoding level of abstraction addresses how those numbers
>>> (codepoints) will be represented as bytes (octets) in memory inside of
>>> Perl, and when you directly write the data to disk or to some other output
>>> stream.
>>>
>>> There are two sets of codepoint range, semantics and encoding available,
>>> which are controlled by a flag associated with the string called the UTF8
>>> flag. When set this flag indicates that the string can represent codepoints
>>> 0 to 0x10FFFF, should have Unicode semantics applied to it, and that its in
>>> memory representation is variable-width utf8. When the flag is not set it
>>> indicates the string can represent codepoints 0 to 255, has ASCII
>>> case-folding semantics, and that its in memory representation is fixed
>>> width octets.
>>>
>>> In order to be able to combine these two types of strings we need to
>>> define some operations:
>>>
>>> upgrading/downgrading: converting a string from one set of semantics and
>>> encoding to the other while preserving exactly the codepoint level
>>> representation. By tradition we call it upgrading when we go from Latin-1
>>> to Unicode with the result being UTF8 on, and we call it downgrading when
>>> we go from Unicode to Latin1 with the result being UTF8-off. These
>>> operations are NOT symmetrical. It is *not* possible to downgrade every
>>> Unicode string to Latin-1, however it is possible to upgrade every Latin-1
>>> string to Unicode. By tradition upgrade and downgrade functions are noops
>>> when their input is already in the form expected as the result, but this is
>>> by tradition only.
>>>
>>> decoding/encoding: converting a string from one form to the other in a
>>> way that transforms the codepoints from one form to a potentially different
>>> form. Traditional we speak of decode_utf8() taking a latin1 string
>>> containing octets that make up a utf8 encoded string, and returning a
>>> string which is UTF8 on which represents the Unicode version of those
>>> octets. For well formed input this results in no change to the underlying
>>> string, but the flag is flipped on. Vice versa we speak of encode_utf8()
>>> which converts its input to a utf8 encoded form, regardless of what form it
>>> was represented internally.
>>>
>>
>> This is incorrect. Decode converts a string of bytes at the logical level
>> (upgraded or downgraded does not matter) and returns a string of characters
>> at the logical level (upgraded or downgraded does not matter). It may
>> commonly use upgraded or downgraded strings as the input or output for
>> efficiency but this is not required.
>>
>
> Nope *you* are wrong. Decoding does not use upgrading or downgrading.
> Decoding utf8 is logically equivalent to an upgrade operation when the
> string contains only codepoints 0-127. For any codepoint ABOVE that it does
> something very different.
>
>
>
>>
>>>
>>>
>>>> What you call “a UTF-8 string” is what I propose we call, per existing
>>>> nomenclature (i.e., utf8::upgrade()) “a PV-upgraded string”, with
>>>> corresponding code changes. Then the term “UTF-8 string” makes sense from a
>>>> pure-Perl context without requiring Perl programmers to worry about
>>>> interpreter internals.
>>>>
>>>>
>>> No. The flag does not mean "upgraded" it means "unicode semantics, utf8
>>> encoding". Upgrading is one way to get such a string, and it might even be
>>> the most common, but the most important and likely to be correct way is
>>> explicit decoding.
>>>
>>> If we are to rename the flag then we should just rename it as the
>>> UNICODE flag. Would have saved a world of confusion.
>>>
>>
>> This is exactly what we have defined as "upgraded". Decoding does not
>> define the internal format of the resulting string at all. The only
>> internal format which is upgraded is when the UTF8 flag is on.
>>
>
> Your definition is wrong then. You seem to have "upgrading" and "decoding"
> muddled.
>
> Decoding most definitely DOES define the internal format of the result
> string. If you decode utf8 the result is a UTF8 on string. If that string
> contained utf8 representing codepoints above 127 then the result will be
> different.
>
> If you upgrade the string: "\303\251" you will end up with a utf8 on
> string which contains two codepoints, "\303" and "\251". You will NOT end
> up with the correct codepoint E9
>

It rather sounds to me like your disagreement is mostly on definitions.
This happens a lot in discussing perl unicode support

>
>>
>>>
>>>
>>>>
>>>> You’re omitting what IMO is the most obvious purpose of the flag: to
>>>> indicate whether the code points that the PV stores are the plain bytes, or
>>>> are the UTF-8-decoded code points. This is why you can print() the string
>>>> in either upgraded or downgraded forms, and it comes out the same.
>>>>
>>>
>>> Its hard to say what you are referring to here. If you mean codepoints
>>> 0-127, then it is unsurprising as the representation of them is equivalent
>>> in ASCII and UTF8 and Latin1. But if you mean a codepoint above the ASCII
>>> plane, then no they should not come out the same. If you are piping that
>>> data to a file I would expect the octets written to that file to be
>>> different. (assuming a binary filehandle with no layers magically
>>> transforming things). If your terminal renders them the same then I assume
>>> it is doing some magic behind the scenes to deal with malformed utf8.
>>>
>>
>> Not correct. An upgraded or downgraded string prints identically because
>> you are printing the logical ordinals which do not change by this
>> operation. Whether those ordinals are interpreted as bytes or Unicode
>> characters depends what you are printing to, but in either case the
>> internally-stored bytes are irrelevant to the user except to determine what
>> those logical ordinals are
>>
>
> Dude, you keep saying I am not correct when what I have said is easily
> verifiable.
>
> If you print chr(0xe9) to a filehandle and it does not contain the octet
> E9 then there is a problem
>
> If you print chr(0xe9) to a utf8 terminal it should render a Unicode
> replacement character for a broken utf8 sequence.
>
> If you print an encoded chr(0xe9) then it should rendr the glyph for E9.
>
> If you think anything else is happening then prove it with code.
>

That is all true in the absence of an :encoding(...) or :utf8 layer.

An upgraded E9 will also still print E9 (and thus be broken utf-8).

>
>>>
>>>>
>>>> > I also know what happens here:
>>>> >
>>>> > my $foo="\x{c3}\x{a9}";
>>>> > utf8::decode($foo);
>>>> > Dump($foo);
>>>> >
>>>> > SV = PV(0x2303fc0) at 0x2324c98
>>>> > REFCNT = 1
>>>> > FLAGS = (POK,IsCOW,pPOK,UTF8)
>>>> > PV = 0x2328300 "\303\251"\0 [UTF8 "\x{e9}"]
>>>> > CUR = 2
>>>> > LEN = 10
>>>> > COW_REFCNT = 1
>>>> >
>>>> > That is, i start off with two octets, C3 - A9, which happens to be
>>>> the encoding for the codepoint E9, which happens to be é.
>>>> > I then tell perl to "decode" those octets, which really means I tell
>>>> perl to check that the octets actually do make up valid utf8. And if perl
>>>> agrees that indeed these are valid utf8 octets, then it turns the flag on.
>>>> Now it doesn't matter if you *meant* to construct utf8 in the variable fed
>>>> to decode, all that matters is that at an octet level those octet happen to
>>>> make up valid utf8.
>>>>
>>>> I think you’re actually breaking the abstraction here by assuming that
>>>> Perl implements the decode by setting a flag.
>>>>
>>>>
>>> No I am not. The flag is there is there to tell the perl internals how
>>> to manipulate the string. decode's task is to take arbitrary strings of
>>> octets and ensure that they can be decoded as valid utf8 and possibly to do
>>> some conversion (eg for forbidden utf8 sequences or other normalization) as
>>> it does so and then SETS THE FLAG. Only once decode is done is the string
>>> "Unicode" and is the string "utf8". Prior to that it was just random
>>> octets. It doesnt need to do anything BUT set the flag because its internal
>>> encoding matches the external encoding in this case. If it was decoding
>>> UTF16LE then it would have do conversion as well.
>>>
>>
>> Not correct. The flag is there only to tell Perl internals whether the
>> internal bytes represent the ordinals directly or via UTF-8-like encoding.
>> The result of decoding can be downgraded, and an upgraded string can be
>> decoded,
>>
>
> Show me the code. As far as I know decode operations do not operate on
> unicode strings. Calling decode_utf8 on a string which is utf8 is a noop.
>

Almost. It will try to downgrade the string, and if that fails it will
return false (and thus noop). It will decode a latin1-safe unicode string.

So «my $s = "\303\251"; utf8::upgrade($s); utf8::decode($s)» will result in
$s being equal to "\x{e9}" (an will be upgraded)

Leon

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

felipe at felipegasper

Sep 3, 2021, 6:34 AM

Post #35 of 46 (1734 views)

> On Sep 3, 2021, at 3:48 AM, Tom Molesworth via perl5-porters <perl5-porters@perl.org> wrote:
>
> On Fri, 3 Sept 2021 at 14:30, demerphq <demerphq@gmail.com> wrote:
> On Thu, 2 Sept 2021 at 16:39, Dan Book <grinnz@gmail.com> wrote:
> On Thu, Sep 2, 2021 at 9:21 AM demerphq <demerphq@gmail.com> wrote:
> On Fri, 20 Aug 2021 at 19:48, Felipe Gasper <felipe@felipegasper.com> wrote:
>
> What you call “a UTF-8 string” is what I propose we call, per existing nomenclature (i.e., utf8::upgrade()) “a PV-upgraded string”, with corresponding code changes. Then the term “UTF-8 string” makes sense from a pure-Perl context without requiring Perl programmers to worry about interpreter internals.
>
>
> No. The flag does not mean "upgraded" it means "unicode semantics, utf8 encoding". Upgrading is one way to get such a string, and it might even be the most common, but the most important and likely to be correct way is explicit decoding.
>
> If we are to rename the flag then we should just rename it as the UNICODE flag. Would have saved a world of confusion.
>
> This is exactly what we have defined as "upgraded". Decoding does not define the internal format of the resulting string at all. The only internal format which is upgraded is when the UTF8 flag is on.
>
> Your definition is wrong then. You seem to have "upgrading" and "decoding" muddled.
>
> Decoding most definitely DOES define the internal format of the result string. If you decode utf8 the result is a UTF8 on string. If that string contained utf8 representing codepoints above 127 then the result will be different.
>
> Given this:
>
> perl -e'use Devel::Peek; use Encode; print Dump(Encode::decode("UTF-8", "example"))'
> SV = PV(0x55b88281b2b0) at 0x55b88272e4e0
> REFCNT = 2
> FLAGS = (TEMP,POK,pPOK,UTF8)
> PV = 0x55b8827f45d0 "example"\0 [UTF8 "example"]
> CUR = 7
> LEN = 10
>
> I think the current behaviour is at least inefficient, if perhaps not outright *wrong*... why would decoding enforce the UTF8 flag?
>
> Put another way, if the resulting string has only codepoints 0..127, why not leave the flag off so that string operations can be more efficient?
>
> This extends to common cases such as UTF8-safe filter chains:
>
> echo "example" | perl -CSD -lne'use Devel::Peek; s{e$}{es}; print Dump($_)'
> SV = PV(0x556c3aab2000) at 0x556c3aaeb3e8
> REFCNT = 1
> FLAGS = (POK,pPOK,UTF8)
> PV = 0x556c3aae41b0 "examples"\0 [UTF8 "examples"]
> CUR = 8
> LEN = 24
>
> If that's not taking the faster pure-ASCII path for input, this would seem like an easy optimisation opportunity. If the behaviour only happened with the non-validating `utf8` decoding, then maybe it could be explained away by not wanting to walk the entire length of the string... but then I'd at least expect it to be different with the "UTF-8" encoding layer:
>
> echo "example" | perl -lne'use Devel::Peek; BEGIN { binmode STDIN, ":encoding(UTF-8)" } s{e$}{es}; print Dump($_)'
> SV = PV(0x55d8d2b76000) at 0x55d8d2baf4a8
> REFCNT = 1
> FLAGS = (POK,pPOK,UTF8)
> PV = 0x55d8d2bad300 "examples"\0 [UTF8 "examples"]
> CUR = 8
> LEN = 24
>
> So yes, decoding does set the UTF8 flag - but I'd argue that it *shouldn't*, and the current behaviour is somewhere between a historical accident and an oversight. To be clear, I'd expect the same non-UTF8 status in the examples so far, as we see from this:
>
> perl -e'use Devel::Peek; use utf8; my $text = "example"; print Dump($text)'
> SV = PV(0x55be3864aff0) at 0x55be3866fe60
> REFCNT = 1
> FLAGS = (POK,IsCOW,pPOK)
> PV = 0x55be3867f020 "example"\0
> CUR = 7
> LEN = 10
> COW_REFCNT = 1
>
> What am I missing here?

Try utf8::decode(); it uses Perl’s internal decoder and behaves as you expect (i.e., it leaves invariant strings alone).

Unicode::UTF8 mimics Encode.pm, though.

These are implementation details, though; the only thing the decoding algorithm itself requires is the correct translation of code points.

-FG

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

felipe at felipegasper

Sep 3, 2021, 6:44 AM

Post #36 of 46 (1734 views)

> On Sep 3, 2021, at 2:30 AM, demerphq <demerphq@gmail.com> wrote:
>
> On Thu, 2 Sept 2021 at 16:39, Dan Book <grinnz@gmail.com> wrote:
> There is way too much written here so I will be responding as I can.
>
> On Thu, Sep 2, 2021 at 9:21 AM demerphq <demerphq@gmail.com> wrote:
> On Fri, 20 Aug 2021 at 19:48, Felipe Gasper <felipe@felipegasper.com> wrote:
>
> > On Aug 20, 2021, at 1:05 PM, demerphq <demerphq@gmail.com> wrote:
> >
>
> decoding/encoding: converting a string from one form to the other in a way that transforms the codepoints from one form to a potentially different form. Traditional we speak of decode_utf8() taking a latin1 string containing octets that make up a utf8 encoded string, and returning a string which is UTF8 on which represents the Unicode version of those octets. For well formed input this results in no change to the underlying string, but the flag is flipped on. Vice versa we speak of encode_utf8() which converts its input to a utf8 encoded form, regardless of what form it was represented internally.
>
> This is incorrect. Decode converts a string of bytes at the logical level (upgraded or downgraded does not matter) and returns a string of characters at the logical level (upgraded or downgraded does not matter). It may commonly use upgraded or downgraded strings as the input or output for efficiency but this is not required.
>
> Nope *you* are wrong. Decoding does not use upgrading or downgrading. Decoding utf8 is logically equivalent to an upgrade operation when the string contains only codepoints 0-127. For any codepoint ABOVE that it does something very different.

Decoding doesn’t *use* upgrading nor downgrading, but it accepts either and may output either.

> Decoding most definitely DOES define the internal format of the result string. If you decode utf8 the result is a UTF8 on string. If that string contained utf8 representing codepoints above 127 then the result will be different.

This is wrong. Example:

> perl -MDevel::Peek -e'my $foo = "e"; utf8::decode($foo); Dump $foo'
SV = PV(0x7fdb0a804c70) at 0x7fdb0b00ccd0
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x7fdb0a501cc0 "e"\0
CUR = 1
LEN = 10

As an *implementation detail*, utf8::decode *happens* to set the flag when given UTF-8 for code points 128-255:

> perl -MDevel::Peek -e'my $foo = "é"; utf8::decode($foo); Dump $foo'
SV = PV(0x7fd600804c70) at 0x7fd6008162d0
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x7fd6018026e0 "\303\251"\0 [UTF8 "\x{e9}"]
CUR = 2
LEN = 10

… but it would be just as valid -- and would print() the same way -- if utf8::decode() modified the PV to contain just \xe9.

> FG: You’re omitting what IMO is the most obvious purpose of the flag: to indicate whether the code points that the PV stores are the plain bytes, or are the UTF-8-decoded code points. This is why you can print() the string in either upgraded or downgraded forms, and it comes out the same.
>
> Yves: Its hard to say what you are referring to here. If you mean codepoints 0-127, then it is unsurprising as the representation of them is equivalent in ASCII and UTF8 and Latin1. But if you mean a codepoint above the ASCII plane, then no they should not come out the same. If you are piping that data to a file I would expect the octets written to that file to be different. (assuming a binary filehandle with no layers magically transforming things). If your terminal renders them the same then I assume it is doing some magic behind the scenes to deal with malformed utf8.
>
> DB: Not correct. An upgraded or downgraded string prints identically because you are printing the logical ordinals which do not change by this operation. Whether those ordinals are interpreted as bytes or Unicode characters depends what you are printing to, but in either case the internally-stored bytes are irrelevant to the user except to determine what those logical ordinals are
>
> Yves: Dude, you keep saying I am not correct when what I have said is easily verifiable.
>
> If you print chr(0xe9) to a filehandle and it does not contain the octet E9 then there is a problem
>
> If you print chr(0xe9) to a utf8 terminal it should render a Unicode replacement character for a broken utf8 sequence.
>
> If you print an encoded chr(0xe9) then it should rendr the glyph for E9.
>
> If you think anything else is happening then prove it with code.

These illustrate Dan’s point (assuming a UTF-8 terminal):

> perl -e'my $foo = "\xc3\xa9"; print $foo'
é

> perl -e'my $foo = "\xc3\xa9"; utf8::upgrade($foo); print $foo'
é

Upgraded or downgraded doesn’t change the logical content of the string; the important thing is the codepoints.

The cases you’ve mentioned -- pattern matching, system calls, and the like -- where a string’s internal storage *does* matter, e.g.:

> perl -e'my $foo = "é"; exec "echo", $foo'
é

> perl -e'my $foo = "é"; utf8::upgrade($foo); exec "echo", $foo'
Ã©

... are bugs in Perl. This is why the feature bundles enable the features that fix (some of) those bugs. (And why IMO Sys::Binmode should join them.)

> Show me the code. As far as I know decode operations do not operate on unicode strings. Calling decode_utf8 on a string which is utf8 is a noop.

This isn’t true for either definition of “UTF-8 string”. This shows an upgraded string whose codepoints are UTF-8 being decoded:

> perl -MDevel::Peek -e'my $foo = "\xc3\xa9"; utf8::upgrade($foo); Dump $foo; utf8::decode($foo); Dump $foo;'
SV = PV(0x7f93eb804c70) at 0x7f93ec00c8d0
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x7f93eb5019c0 "\303\203\302\251"\0 [UTF8 "\x{c3}\x{a9}"]
CUR = 4
LEN = 10
SV = PV(0x7f93eb804c70) at 0x7f93ec00c8d0
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x7f93eb5019c0 "\303\251"\0 [UTF8 "\x{e9}"]
CUR = 2
LEN = 10

> It would be just as legitimate to mutate the PV to store a single octet, 0xe9, and leave the UTF8 flag off.
>
> Nope. That would mean that Perl would use ASCII/Latin-1 case folding rules on the result, which would be wrong. It should use Unicode case folding rules for codepoint E9 if it was decoded as that codepoint. (Change the example to \x{DF} and you can see these issues in the flesh, \x{DF} should match "ss" in Unicode, but in ASCII/Latin1 it only matches \x{DF}. The lc() version of \x{DF} is "ss" but in Latin-1/Ascii there are no multi-byte case folds). Even more suggestive that Perl doing this would be wrong is that in fact there is NO valid Unicode encoding of codepoint E9 which is only 1 octet long. So that would be extremely wrong of Perl to use a non Unicode encoding of unicode data dont you think? Also, what would perl do when the codepoint doesn't fit into a single octet? Your argument might have some merit if you were arguing that Perl could have decoded it into "\x{E9}\0\0\0" and set the UTF-32 flag, but as stated it doesn't make sense.
>
> Not correct. Under old rules, yes, the UTF8 flag determined whether Unicode rules are used in various operations; this was an abstraction break, and so the unicode_strings feature was added to fix the problem, and enabled in feature bundles since 5.12
>
> Ah, ok, so if you *change* the default mode of perl it does something different than I described, and that makes my comments "incorrect"? What i described is how "normal" perl without any new features enabled works. If there are features that change what I have said feel free to use them. But it doesnt change that what I said is an accurate version of how the perl internals normally function.

The problem is that Perl’s default behaviour is inconsistent: when outputting to filehandles, computing length() or ord(), comparing strings, etc. all code points are the same regardless of the internal storage format. But when doing pattern-matches Perl treats upgraded/wide/UTF8-flagged strings differently from downgraded/narrow/non-flagged ones.

The latter behaviour is considered a bug.

-FG

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

leonerd at leonerd

Sep 3, 2021, 7:24 AM

Post #37 of 46 (1734 views)

On Wed, 18 Aug 2021 13:18:34 -0400
Felipe Gasper <felipe@felipegasper.com> wrote:

> PROPOSAL: Rename the following identifiers in code and documentation,
> leaving macros for the old ones as aliases:
> - SVf_UTF8 -> SVf_PVUPGRADED
> - SvUTF8 -> Sv_PVUPGRADED
> - SvUTF8_on -> Sv_PVUPGRADED_on
> - SvUTF8_off -> Sv_PVUPGRADED_off
> - SvPOK_only_UTF8 -> SvPOK_only_UPGRADED

This got briefly mentioned as a side-comment on PSC today.

Thoughts are "What about WIDE"? As in

SVf_WIDE (though really I'd want to call that SVppv_WIDE)
SvWIDE
SvWIDE_on
etc...

--
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk | https://metacpan.org/author/PEVANS
http://www.leonerd.org.uk/ | https://www.tindie.com/stores/leonerd/

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

perl.p5p at rjbs

Sep 3, 2021, 8:39 AM

Post #38 of 46 (1734 views)

On Fri, Sep 3, 2021, at 10:24 AM, Paul "LeoNerd" Evans wrote:
> Thoughts are "What about WIDE"? As in
>
> SVf_WIDE (though really I'd want to call that SVppv_WIDE)
> SvWIDE
> SvWIDE_on

If we don't like wide because of "wide character in", there's always COOKED.

--
rjbs

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

felipe at felipegasper

Sep 3, 2021, 8:44 AM

Post #39 of 46 (1734 views)

> On Sep 3, 2021, at 10:24 AM, Paul LeoNerd Evans <leonerd@leonerd.org.uk> wrote:
>
> On Wed, 18 Aug 2021 13:18:34 -0400
> Felipe Gasper <felipe@felipegasper.com> wrote:
>
>> PROPOSAL: Rename the following identifiers in code and documentation,
>> leaving macros for the old ones as aliases:
>> - SVf_UTF8 -> SVf_PVUPGRADED
>> - SvUTF8 -> Sv_PVUPGRADED
>> - SvUTF8_on -> Sv_PVUPGRADED_on
>> - SvUTF8_off -> Sv_PVUPGRADED_off
>> - SvPOK_only_UTF8 -> SvPOK_only_UPGRADED
>
> This got briefly mentioned as a side-comment on PSC today.
>
> Thoughts are "What about WIDE"? As in
>
> SVf_WIDE (though really I'd want to call that SVppv_WIDE)
> SvWIDE
> SvWIDE_on
> etc...

IMO it would at least improve on status quo.

My only reservation would be potential confusion with the notion of “wide character” (anything >255); someone might see that flag and think it means there’s a wide character in the string or some such.

-F

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

leonerd at leonerd

Sep 3, 2021, 8:51 AM

Post #40 of 46 (1734 views)

On Fri, 3 Sep 2021 11:44:18 -0400
Felipe Gasper <felipe@felipegasper.com> wrote:

> My only reservation would be potential confusion with the notion of
> “wide character” (anything >255); someone might see that flag and
> think it means there’s a wide character in the string or some such.

I think that's fine. An SVppv_WIDE string might well contain a "wide
character". If the string isn't SVppv_WIDE (i.e. it's "narrow"?), then
it definitely does not.

--
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk | https://metacpan.org/author/PEVANS
http://www.leonerd.org.uk/ | https://www.tindie.com/stores/leonerd/

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

perl.p5p at rjbs

Sep 3, 2021, 8:52 AM

Post #41 of 46 (1734 views)

On Thu, Sep 2, 2021, at 9:20 AM, demerphq wrote:
> No. The flag does not mean "upgraded" it means "unicode semantics, utf8 encoding". Upgrading is one way to get such a string, and it might even be the most common, but the most important and likely to be correct way is explicit decoding.

You wrote a whole lot, but this quote is, I think, a the center of what I have found confusing.

The utf8 flag on a scalar doesn't mean Unicode semantics. That way lies The Unicode Bug. Under the unicode_strings feature, recommended and in the version bundle since v5.12 (2010), all strings have unicode semantics and are treated as a sequence of codepoints when performing textish operations.

perl -E 'say "word" if "\xFF" =~ /\w/'

This string hasn't been upgraded, hasn't been decoded, and prior to unicode_string, would not have matched.

My take here is that unicode_strings is a *bugfix* (fixing the "Unicode Bug"), and it sounds like you are implying that it is not, and that the correct behavior to learn is that the utf8 flag on a scalar is the *correct* way to know whether Unicode semantics would be applied. This surprises me.

--
rjbs

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

felipe at felipegasper

Sep 15, 2021, 1:02 PM

Post #42 of 46 (1729 views)

> On Sep 3, 2021, at 10:24 AM, Paul LeoNerd Evans <leonerd@leonerd.org.uk> wrote:
>
> On Wed, 18 Aug 2021 13:18:34 -0400
> Felipe Gasper <felipe@felipegasper.com> wrote:
>
>> PROPOSAL: Rename the following identifiers in code and documentation,
>> leaving macros for the old ones as aliases:
>> - SVf_UTF8 -> SVf_PVUPGRADED
>> - SvUTF8 -> Sv_PVUPGRADED
>> - SvUTF8_on -> Sv_PVUPGRADED_on
>> - SvUTF8_off -> Sv_PVUPGRADED_off
>> - SvPOK_only_UTF8 -> SvPOK_only_UPGRADED
>
> This got briefly mentioned as a side-comment on PSC today.
>
> Thoughts are "What about WIDE"? As in
>
> SVf_WIDE (though really I'd want to call that SVppv_WIDE)
> SvWIDE
> SvWIDE_on
> etc...

Now that this thread seems to have “settled” a bit, I wonder where this idea stands in the general mindset:

a) Good idea, worth the overhead of renaming a long-established identifier.

b) Good idea, but *not* worth that overhead.

c) Bad idea; the status quo is better than either of the proposed renames.

d) … some other stance?

To recap, arguments in favour include:

1) More accurate: “wide” encoding allows things that UTF-8 proper forbids, so calling it “UTF8” isn’t quite right.

2) The rename discourages thinking of the flag as indicating a “UTF-8 string”--a widely-held misconception.

3) The upheaval would highlight how the abstraction *should* work and hopefully right some lingering misconceptions out and about.

Thank you!

-FG

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

kimoto.yuki at gmail

Sep 16, 2021, 6:09 PM

Post #43 of 46 (1729 views)

2021-9-16 5:02 Felipe Gasper <felipe@felipegasper.com> wrote:

>
>
> 1) More accurate: “wide” encoding allows things that UTF-8 proper forbids,
> so calling it “UTF8” isn’t quite right.
>
>
>
Now I am learning UTF-8 and UNICODE for good ideas.

Can you hear about my categorization of UTF-8?

A. Text - Text means perl text expression

1. Loose UTF-8

This is not valid UTF-8

This contains

3-byte surrogate

4-byte super characters(over U+10FFFF)

This don't contains

latin-1 code

2. Valid UTF-8

This is valid UTF-8

this doesn't contain

3-byte surrogate

4-byte super characters(over U+10FFFF)

3. Valid Minimal UTF-8 (this is for secure)

This is valid and minimal UTF-8(Normalized with the minimum number of
bytes)

? is ? (? doesn't ? + ")

B. Bytes

Any bytes.

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

Nov 12, 2021, 5:00 AM

Post #44 of 46 (1584 views)

Hi Felipe,

We discussed this in a recent PSC meeting, and agreed that we’d like this to progress.

So can you submit this as a formal RFC please?

Cheers,
Neil

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

demerphq at gmail

Nov 12, 2021, 5:21 AM

Post #45 of 46 (1584 views)

I just want to repeat my strong objection to this. I really don't think we
should be forcing this on XS developers or muddying the waters any more.

Please don't do this.

cheers,
Yves

On Fri, 12 Nov 2021 at 14:00, Neil Bowers <neilb@neilb.org> wrote:

> Hi Felipe,
>
> We discussed this in a recent PSC meeting, and agreed that we’d like this
> to progress.
>
> So can you submit this as a formal RFC please?
>
> Cheers,
> Neil
>

--
perl -Mre=debug -e "/just|another|perl|hacker/"

Re: Pre-RFC: Rename SVf_UTF8 et al. [ In reply to ]

fawaka at gmail

Nov 12, 2021, 6:47 AM

Post #46 of 46 (1584 views)

On Wed, Sep 15, 2021 at 10:02 PM Felipe Gasper <felipe@felipegasper.com>
wrote:

>
> > On Sep 3, 2021, at 10:24 AM, Paul LeoNerd Evans <leonerd@leonerd.org.uk>
> wrote:
> >
> > On Wed, 18 Aug 2021 13:18:34 -0400
> > Felipe Gasper <felipe@felipegasper.com> wrote:
> >
> >> PROPOSAL: Rename the following identifiers in code and documentation,
> >> leaving macros for the old ones as aliases:
> >> - SVf_UTF8 -> SVf_PVUPGRADED
> >> - SvUTF8 -> Sv_PVUPGRADED
> >> - SvUTF8_on -> Sv_PVUPGRADED_on
> >> - SvUTF8_off -> Sv_PVUPGRADED_off
> >> - SvPOK_only_UTF8 -> SvPOK_only_UPGRADED
> >
> > This got briefly mentioned as a side-comment on PSC today.
> >
> > Thoughts are "What about WIDE"? As in
> >
> > SVf_WIDE (though really I'd want to call that SVppv_WIDE)
> > SvWIDE
> > SvWIDE_on
> > etc...
>
> Now that this thread seems to have “settled” a bit, I wonder where this
> idea stands in the general mindset:
>
>
> a) Good idea, worth the overhead of renaming a long-established identifier.
>
> b) Good idea, but *not* worth that overhead.
>
> c) Bad idea; the status quo is better than either of the proposed renames.
>
> d) … some other stance?
>

I strongly believe it's not worth the overhead (from an effort and
confusion POV), and less strongly feel it's not a good idea.

Leon