Mailing List Archive: SvUTF8 predictability

SvUTF8 predictability

May 17, 2023, 5:17 AM

Post #1 of 16 (613 views)

In the somewhat sprawling PerlMonks thread "Seeking Perl docs about how
UTF8 flag propagates" [1] the original question seems rather valid:
given that the Unicode bug exists (and is not going away), one cannot
predict the behaviour of Perl code in some cases without knowing whether
the strings involved have SvUTF8_on in their internal representation.

However there is not to my knowledge any documentation that describes
the status of the flag on results of the various operations that return
or modify strings, nor any guarantee that the results will be the same
from one version to the next.

In testing, I was somewhat surprised to find that substr($foo, 0, 1)
will return a string with UTF8_off if the source string is UTF8_on but
has no bytes greater than 0x7f - determined by doing a potentially
expensive (and potentially unnecessary) length() on the source string.

Should we be providing any guarantees in this area, or making explicit
that we offer none? Would it be legitimate, for example, to change
the substr() implementation such that a UTF8_on source always gave
a UTF8_on result? And if we did so, would we document that in the
changelog as a backwards-incompatible change?

My instinct is that we do not want to offer any guarantees (and that
we should state the explicitly). But I don't know where that leaves
someone who wants to look at some existing code (that is not using
any of the various forcing mechanisms described, for example, in
`perldoc -f lc`) and predict how it will behave.

Hugo

[1] https://perlmonks.org/index.pl?node_id=11152194

Re: SvUTF8 predictability [ In reply to ]

demerphq at gmail

May 17, 2023, 6:31 AM

Post #2 of 16 (613 views)

On Wed, 17 May 2023 at 14:42, <hv@crypt.org> wrote:
>
> In the somewhat sprawling PerlMonks thread "Seeking Perl docs about how
> UTF8 flag propagates" [1] the original question seems rather valid:
> given that the Unicode bug exists (and is not going away), one cannot
> predict the behaviour of Perl code in some cases without knowing whether
> the strings involved have SvUTF8_on in their internal representation.
>
> However there is not to my knowledge any documentation that describes
> the status of the flag on results of the various operations that return
> or modify strings, nor any guarantee that the results will be the same
> from one version to the next.
>
> In testing, I was somewhat surprised to find that substr($foo, 0, 1)
> will return a string with UTF8_off if the source string is UTF8_on but
> has no bytes greater than 0x7f - determined by doing a potentially
> expensive (and potentially unnecessary) length() on the source string.

I consider that a bug. I think you should file a bug report for that actually.

> Should we be providing any guarantees in this area, or making explicit
> that we offer none? Would it be legitimate, for example, to change
> the substr() implementation such that a UTF8_on source always gave
> a UTF8_on result? And if we did so, would we document that in the
> changelog as a backwards-incompatible change?

I would document it as a bug-fix.

> My instinct is that we do not want to offer any guarantees (and that
> we should state the explicitly). But I don't know where that leaves
> someone who wants to look at some existing code (that is not using
> any of the various forcing mechanisms described, for example, in
> `perldoc -f lc`) and predict how it will behave.

I think the rule must be that the substring, (which implies split
part) should have the same utf8ness as that which it came from. I
thought that was the rule already frankly. Otherwise there would be
contradictions. Consider something like:

use Test::More;
my @str = ("a\x{100}", "aa");
for my $str (@str) {
utf8::upgrade($str);
my $first_match = ("\x{DF}" . $str)=~/ss/i;
my $second_match = ("\x{DF}" . substr($str,0,1))=~/ss/i;
is($first_match, $second_match);
}

It is a logical contradiction if $first_match doesnt equal
$second_match, ergo this must be a bug.

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

Re: SvUTF8 predictability [ In reply to ]

perl5-porters at perl

May 17, 2023, 6:40 AM

Post #3 of 16 (613 views)

This behaviour is within expected parameters.

SvUTF8 is internal to Perl. Applications shouldn’t care about it. Perl’s documentation already notes this:

https://perldoc.perl.org/perlunifaq#What-is-%22the-UTF8-flag%22?

Now, the counterpoint here is that many of Perl’s own built-ins *require* applications to care about SvUTF8. (See my 2022 Perl conference talk for details.) Sys::Binmode on CPAN fixes this problem and (IMO) should be standard boilerplate in new Perl code.

Perl *could*, yes, preserve the flag. But having the flag on for all-ASCII strings (i.e., all characters <= 0x7f) is inefficient because Perl has to parse the string as UTF-8 in order to use it.

-FG

> On May 17, 2023, at 8:17 AM, hv@crypt.org wrote:
>
> In the somewhat sprawling PerlMonks thread "Seeking Perl docs about how
> UTF8 flag propagates" [1] the original question seems rather valid:
> given that the Unicode bug exists (and is not going away), one cannot
> predict the behaviour of Perl code in some cases without knowing whether
> the strings involved have SvUTF8_on in their internal representation.
>
> However there is not to my knowledge any documentation that describes
> the status of the flag on results of the various operations that return
> or modify strings, nor any guarantee that the results will be the same
> from one version to the next.
>
> In testing, I was somewhat surprised to find that substr($foo, 0, 1)
> will return a string with UTF8_off if the source string is UTF8_on but
> has no bytes greater than 0x7f - determined by doing a potentially
> expensive (and potentially unnecessary) length() on the source string.
>
> Should we be providing any guarantees in this area, or making explicit
> that we offer none? Would it be legitimate, for example, to change
> the substr() implementation such that a UTF8_on source always gave
> a UTF8_on result? And if we did so, would we document that in the
> changelog as a backwards-incompatible change?
>
> My instinct is that we do not want to offer any guarantees (and that
> we should state the explicitly). But I don't know where that leaves
> someone who wants to look at some existing code (that is not using
> any of the various forcing mechanisms described, for example, in
> `perldoc -f lc`) and predict how it will behave.
>
> Hugo
>
> [1] https://perlmonks.org/index.pl?node_id=11152194

Re: SvUTF8 predictability [ In reply to ]

demerphq at gmail

May 17, 2023, 6:43 AM

Post #4 of 16 (613 views)

On Wed, 17 May 2023 at 15:41, Felipe Gasper via perl5-porters
<perl5-porters@perl.org> wrote:
>
> This behaviour is within expected parameters.
>
> SvUTF8 is internal to Perl. Applications shouldn’t care about it. Perl’s documentation already notes this:

I really dont agree. It leads to contradiction in fairly simple
operations which should not contradict each other. The only sane
behavior is that the substring of a string should have the same
utf8ness as the string it came from.

cheers,
Yves

Re: SvUTF8 predictability [ In reply to ]

perl5-porters at perl

May 17, 2023, 7:08 AM

Post #5 of 16 (613 views)

> On May 17, 2023, at 9:43 AM, demerphq <demerphq@gmail.com> wrote:
>
> On Wed, 17 May 2023 at 15:41, Felipe Gasper via perl5-porters
> <perl5-porters@perl.org> wrote:
>>
>> This behaviour is within expected parameters.
>>
>> SvUTF8 is internal to Perl. Applications shouldn’t care about it. Perl’s documentation already notes this:
>
> I really dont agree. It leads to contradiction in fairly simple
> operations which should not contradict each other. The only sane
> behavior is that the substring of a string should have the same
> utf8ness as the string it came from.

The example you gave works if “unicode_strings” is enabled. I assume you know that and think that the vast amount of code out there that leaves “unicode_strings” off is to be considered the norm.

In that vein, though: would changing substr()’s behaviour vis-à-vis SvUTF8 potentially *break* applications that depend on the status quo?

Perl’s documentation seems a bit two-faced here: it says applications shouldn’t care about SvUTF8 but then doesn’t steer people away from Perl’s “sharp edges” where that abstraction breaks. The perlunifaq section I linked before doesn’t mention the “unicode_strings” feature, for example. I’ve previously argued that those inconsistencies should be noted more prominently across the board; e.g., `perldoc exec` should note somewhere that SvUTF8 matters here.

-F

Re: SvUTF8 predictability [ In reply to ]

demerphq at gmail

May 17, 2023, 8:40 AM

Post #6 of 16 (613 views)

On Wed, 17 May 2023 at 16:08, Felipe Gasper <felipe@felipegasper.com> wrote:
>
>
> > On May 17, 2023, at 9:43 AM, demerphq <demerphq@gmail.com> wrote:
> >
> > On Wed, 17 May 2023 at 15:41, Felipe Gasper via perl5-porters
> > <perl5-porters@perl.org> wrote:
> >>
> >> This behaviour is within expected parameters.
> >>
> >> SvUTF8 is internal to Perl. Applications shouldn’t care about it. Perl’s documentation already notes this:
> >
> > I really dont agree. It leads to contradiction in fairly simple
> > operations which should not contradict each other. The only sane
> > behavior is that the substring of a string should have the same
> > utf8ness as the string it came from.
>
> The example you gave works if “unicode_strings” is enabled. I assume you know that and think that the vast amount of code out there that leaves “unicode_strings” off is to be considered the norm.

I think that unicode_strings is orthogonal to us violating the
principle of least surprise when it is not forced on us. Code should
not be inconsistent in that way.

> In that vein, though: would changing substr()’s behaviour vis-à-vis SvUTF8 potentially *break* applications that depend on the status quo?

Likely /something/ will break if we fix this bug. But that doesn't
mean we shouldn't fix it. I suspect that fixing it will actually fix
some code that is broken and the authors don't know it.

> Perl’s documentation seems a bit two-faced here: it says applications shouldn’t care about SvUTF8 but then doesn’t steer people away from Perl’s “sharp edges” where that abstraction breaks. The perlunifaq section I linked before doesn’t mention the “unicode_strings” feature, for example. I’ve previously argued that those inconsistencies should be noted more prominently across the board; e.g., `perldoc exec` should note somewhere that SvUTF8 matters here.

I think that the question of "when should you care about the flag" is
orthagonal to "the general behavior of a piece of code shouldn't
change because of the code points contained within a string". The bug
hv reported is of that nature.

my $str= "abcd\x{100}";
for (1..2) {
for my $o (0 .. length($str)-1) {
my $ok = (+(substr($str, $o, 1)."\x{DF}") =~ /ss/i) ? "yes" : "no";
print $ok,"\n";
}
chop($str);
print "---\n";
}

perl t.pl
yes
yes
yes
yes
yes
---
no
no
no
no
---

My position is that chopping a character from a string should not
result in substr() returning a string with a different flag setting.
To me that is madness. It is action at a distance (the existence of a
codepoint in the string at one position governs the flag of a
substring of another position? wtf?!), and it violates the principle
of least surprise. When something violates two "rules of thumb of good
design" i think occams razor says "this is a bug".

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

Re: SvUTF8 predictability [ In reply to ]

demerphq at gmail

May 17, 2023, 9:03 AM

Post #7 of 16 (613 views)

On Wed, 17 May 2023 at 17:40, demerphq <demerphq@gmail.com> wrote:
> My position is that chopping a character from a string should not
> result in substr() returning a string with a different flag setting.
> To me that is madness. It is action at a distance (the existence of a
> codepoint in the string at one position governs the flag of a
> substring of another position? wtf?!), and it violates the principle
> of least surprise. When something violates two "rules of thumb of good
> design" i think occams razor says "this is a bug".

Another logical contradiction this bug creates is that the substr() of
a string is not equivalent to the chopped version of that string with
the same contents.

my $str1 = "a\x{100}";
my $str2 = substr($str1, 0, 1);
chop $str1;
my $str3 = substr($str1, 0, 1);

I dont see how it is reasonable that $str2 and $str3 are different in
this case. The only reasonable outcome is that all three should be
identical and it shouldnt matter which you choose to use in subsequent
logic.

cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"

Re: SvUTF8 predictability [ In reply to ]

perl5-porters at perl

May 17, 2023, 9:08 AM

Post #8 of 16 (613 views)

> On May 17, 2023, at 11:40 AM, demerphq <demerphq@gmail.com> wrote:
>
> On Wed, 17 May 2023 at 16:08, Felipe Gasper <felipe@felipegasper.com> wrote:
>>
>> In that vein, though: would changing substr()’s behaviour vis-à-vis SvUTF8 potentially *break* applications that depend on the status quo?
>
> Likely /something/ will break if we fix this bug. But that doesn't
> mean we shouldn't fix it. I suspect that fixing it will actually fix
> some code that is broken and the authors don't know it.

This same logic would support the notion of fixing exec, mkdir, et al. Thus far, though, p5p has preferred “the devil we know”.

> My position is that chopping a character from a string should not
> result in substr() returning a string with a different flag setting.
> To me that is madness. It is action at a distance (the existence of a
> codepoint in the string at one position governs the flag of a
> substring of another position? wtf?!), and it violates the principle
> of least surprise. When something violates two "rules of thumb of good
> design" i think occams razor says "this is a bug".

The documentation states that Perl applications shouldn’t care about the flag. If that’s true, then there is no actual change to the string happening except implementation details that aren’t properly of concern to applications.

If it’s wrong, of course, then I agree with you that this is a bug, and the documentation should change. But that would seem quite a big about-face. The “party line” has for some time been that Perl strings are, for applications, just opaque sequences of code points, abstraction leaks notwithstanding.

> my $str1 = "a\x{100}";
> my $str2 = substr($str1, 0, 1);
> chop $str1;
> my $str3 = substr($str1, 0, 1);
>
> I dont see how it is reasonable that $str2 and $str3 are different in
> this case. The only reasonable outcome is that all three should be
> identical and it shouldnt matter which you choose to use in subsequent
> logic.

They *are* identical:

-----
my $str1 = "a\x{100}";
my $str2 = substr($str1, 0, 1);
chop $str1;
my $str3 = substr($str1, 0, 1);

CORE::say $str1 eq $str2;
CORE::say $str2 eq $str3;
CORE::say $str1 eq $str3;
-----
1
1
1
-----

-F

Re: SvUTF8 predictability [ In reply to ]

demerphq at gmail

May 17, 2023, 9:31 AM

Post #9 of 16 (613 views)

On Wed, 17 May 2023 at 18:08, Felipe Gasper <felipe@felipegasper.com> wrote:
>
>
> > On May 17, 2023, at 11:40 AM, demerphq <demerphq@gmail.com> wrote:
> >
> > On Wed, 17 May 2023 at 16:08, Felipe Gasper <felipe@felipegasper.com> wrote:
> >>
> >> In that vein, though: would changing substr()’s behaviour vis-à-vis SvUTF8 potentially *break* applications that depend on the status quo?
> >
> > Likely /something/ will break if we fix this bug. But that doesn't
> > mean we shouldn't fix it. I suspect that fixing it will actually fix
> > some code that is broken and the authors don't know it.
>
> This same logic would support the notion of fixing exec, mkdir, et al. Thus far, though, p5p has preferred “the devil we know”.
>
> > My position is that chopping a character from a string should not
> > result in substr() returning a string with a different flag setting.
> > To me that is madness. It is action at a distance (the existence of a
> > codepoint in the string at one position governs the flag of a
> > substring of another position? wtf?!), and it violates the principle
> > of least surprise. When something violates two "rules of thumb of good
> > design" i think occams razor says "this is a bug".
>
> The documentation states that Perl applications shouldn’t care about the flag. If that’s true, then there is no actual change to the string happening except implementation details that aren’t properly of concern to applications.

It is not true and has never been true.

> If it’s wrong, of course, then I agree with you that this is a bug, and the documentation should change. But that would seem quite a big about-face. The “party line” has for some time been that Perl strings are, for applications, just opaque sequences of code points, abstraction leaks notwithstanding.

The "party" is wrong. This is an example of something being perceived
to be true just because people have repeated it enough times.

In math, one way you can prove something isn't true is to assume it is
true and then show that if it were true it would lead to logical
inconsistencies. IMO this is such a case. Some people shrug off the
logical inconsistencies and call them "abstraction leaks" or paper
over the issues by calling them a "bug", neither of which is really
scientific. It isnt an abstraction leak, it is a deliberate backwards
compatibility measure.

> > my $str1 = "a\x{100}";
> > my $str2 = substr($str1, 0, 1);
> > chop $str1;
> > my $str3 = substr($str1, 0, 1);
> >
> > I dont see how it is reasonable that $str2 and $str3 are different in
> > this case. The only reasonable outcome is that all three should be
> > identical and it shouldnt matter which you choose to use in subsequent
> > logic.
>
> They *are* identical:
>
> -----
> my $str1 = "a\x{100}";
> my $str2 = substr($str1, 0, 1);
> chop $str1;
> my $str3 = substr($str1, 0, 1);
>
> CORE::say $str1 eq $str2;
> CORE::say $str2 eq $str3;
> CORE::say $str1 eq $str3;

String equivalence doesnt test that something is identical to
something else, it checks string equivalence which is a different
concept. Something can be string equivalent and yet completely unlike
something else.

perl -le'package X { use overload "eq" => sub { 1 }; } my $o = bless
[], "X"; if ($o eq "foo") { print "string equivalent" }'
string equivalent

$ cat t2.pl
use Devel::Peek;
my $str1 = "a\x{100}";
my $str2 = substr($str1, 0, 1);
chop $str1;
my $str3 = substr($str1, 0, 1);
print +("\x{DF}" . $_)=~/ss/i ? "yes" : "no","\n" for $str1, $str2, $str3;

$ perl t2.pl
yes
yes
no

So not identical.

cheers,
yves
--
perl -Mre=debug -e "/just|another|perl|hacker/"

Re: SvUTF8 predictability [ In reply to ]

perl5-porters at perl

May 17, 2023, 10:16 AM

Post #10 of 16 (613 views)

On 2023-05-17 15:43, demerphq wrote:
> On Wed, 17 May 2023 at 15:41, Felipe Gasper via perl5-porters
> <perl5-porters@perl.org> wrote:
>> This behaviour is within expected parameters.
>>
>> SvUTF8 is internal to Perl. Applications shouldn’t care about it. Perl’s documentation already notes this:
> I really dont agree. It leads to contradiction in fairly simple
> operations which should not contradict each other. The only sane
> behavior is that the substring of a string should have the same
> utf8ness as the string it came from.

There is sane, and there are expectations. :)

Another way to make things more predictable, is to downgrade where feasible.
(some lexical auto::downgrade mechanism?)

See also https://metacpan.org/pod/Sys::Binmode

-- -- -- -- -- --

Sanest could well be to completely switch to UTF-8 for strings,
and to add raw byte buffers as a separate type.
(raw: also without any layers/localization)

my $b :raw = "\x80"; # 1 byte

my $s = "\x80"; # would then be 2 bytes

-- Ruud

Re: SvUTF8 predictability [ In reply to ]

perl5-porters at perl

May 17, 2023, 10:33 AM

Post #11 of 16 (613 views)

> On May 17, 2023, at 1:16 PM, Ruud H.G. van Tol via perl5-porters <perl5-porters@perl.org> wrote:
>
>
> On 2023-05-17 15:43, demerphq wrote:
>> On Wed, 17 May 2023 at 15:41, Felipe Gasper via perl5-porters
>> <perl5-porters@perl.org> wrote:
>>> This behaviour is within expected parameters.
>>>
>>> SvUTF8 is internal to Perl. Applications shouldn’t care about it. Perl’s documentation already notes this:
>> I really dont agree. It leads to contradiction in fairly simple
>> operations which should not contradict each other. The only sane
>> behavior is that the substring of a string should have the same
>> utf8ness as the string it came from.
>
> There is sane, and there are expectations. :)
>
> Another way to make things more predictable, is to downgrade where feasible.
> (some lexical auto::downgrade mechanism?)

Auto-downgrade would seem problematic in terms of performance.

> Sanest could well be to completely switch to UTF-8 for strings,
> and to add raw byte buffers as a separate type.
> (raw: also without any layers/localization)
>
> my $b :raw = "\x80"; # 1 byte
>
> my $s = "\x80"; # would then be 2 bytes

Reliable text vs. byte differentiation would be a wonderful thing! It’s the ideal path forward. It would arguably be even better than native classes. Perl could prevent lots of subtly-wrong behaviours, use Windows’s Unicode API, and probably make other cool improvements.

I’ve wondered before if it’d be possible by snagging two high bits from the refcount to store a string’s text-vs-bytes-vs-unknown state. Alas, I lack the bandwidth (and knowledge) to work on it effectively.

-FG

Re: SvUTF8 predictability [ In reply to ]

grinnz at gmail

May 17, 2023, 11:26 AM

Post #12 of 16 (613 views)

On Wed, May 17, 2023 at 8:42?AM <hv@crypt.org> wrote:

> In the somewhat sprawling PerlMonks thread "Seeking Perl docs about how
> UTF8 flag propagates" [1] the original question seems rather valid:
> given that the Unicode bug exists (and is not going away), one cannot
> predict the behaviour of Perl code in some cases without knowing whether
> the strings involved have SvUTF8_on in their internal representation.
>
> However there is not to my knowledge any documentation that describes
> the status of the flag on results of the various operations that return
> or modify strings, nor any guarantee that the results will be the same
> from one version to the next.
>
> In testing, I was somewhat surprised to find that substr($foo, 0, 1)
> will return a string with UTF8_off if the source string is UTF8_on but
> has no bytes greater than 0x7f - determined by doing a potentially
> expensive (and potentially unnecessary) length() on the source string.
>
> Should we be providing any guarantees in this area, or making explicit
> that we offer none? Would it be legitimate, for example, to change
> the substr() implementation such that a UTF8_on source always gave
> a UTF8_on result? And if we did so, would we document that in the
> changelog as a backwards-incompatible change?
>
> My instinct is that we do not want to offer any guarantees (and that
> we should state the explicitly). But I don't know where that leaves
> someone who wants to look at some existing code (that is not using
> any of the various forcing mechanisms described, for example, in
> `perldoc -f lc`) and predict how it will behave.
>

Your instinct is correct and patches are welcome to make this clearer.
There are no guarantees to how the result of a string operation will be
stored other than when the upgraded format (SvUTF8 on) is required because
the string contains codepoints greater than 255. Perl can and has changed
what occurs between versions of Perl and usage of the utf8 pragma and the
unicode_strings feature and this is intended and necessary for the
abstraction.

-Dan

Re: SvUTF8 predictability [ In reply to ]

grinnz at gmail

May 17, 2023, 11:29 AM

Post #13 of 16 (613 views)

On Wed, May 17, 2023 at 2:26?PM Dan Book <grinnz@gmail.com> wrote:

> On Wed, May 17, 2023 at 8:42?AM <hv@crypt.org> wrote:
>
>> In the somewhat sprawling PerlMonks thread "Seeking Perl docs about how
>> UTF8 flag propagates" [1] the original question seems rather valid:
>> given that the Unicode bug exists (and is not going away), one cannot
>> predict the behaviour of Perl code in some cases without knowing whether
>> the strings involved have SvUTF8_on in their internal representation.
>>
>> However there is not to my knowledge any documentation that describes
>> the status of the flag on results of the various operations that return
>> or modify strings, nor any guarantee that the results will be the same
>> from one version to the next.
>>
>> In testing, I was somewhat surprised to find that substr($foo, 0, 1)
>> will return a string with UTF8_off if the source string is UTF8_on but
>> has no bytes greater than 0x7f - determined by doing a potentially
>> expensive (and potentially unnecessary) length() on the source string.
>>
>> Should we be providing any guarantees in this area, or making explicit
>> that we offer none? Would it be legitimate, for example, to change
>> the substr() implementation such that a UTF8_on source always gave
>> a UTF8_on result? And if we did so, would we document that in the
>> changelog as a backwards-incompatible change?
>>
>> My instinct is that we do not want to offer any guarantees (and that
>> we should state the explicitly). But I don't know where that leaves
>> someone who wants to look at some existing code (that is not using
>> any of the various forcing mechanisms described, for example, in
>> `perldoc -f lc`) and predict how it will behave.
>>
>
> Your instinct is correct and patches are welcome to make this clearer.
> There are no guarantees to how the result of a string operation will be
> stored other than when the upgraded format (SvUTF8 on) is required because
> the string contains codepoints greater than 255. Perl can and has changed
> what occurs between versions of Perl and usage of the utf8 pragma and the
> unicode_strings feature and this is intended and necessary for the
> abstraction.
>

And "where that leaves someone" is that this property is not relevant
information to anyone who is not investigating the internal storage of Perl
values or debugging the (too common) bugs in XS code and core. Non-buggy
string operations like eq and regex (under unicode_strings) have the same
results regardless of which format a string is stored in. So there is no
need to predict or care.

-Dan

Re: SvUTF8 predictability [ In reply to ]

perl5-porters at perl

May 17, 2023, 11:34 AM

Post #14 of 16 (613 views)

Op 17-05-2023 om 18:31 schreef demerphq:
> $ cat t2.pl
> use Devel::Peek;
> my $str1 = "a\x{100}";
> my $str2 = substr($str1, 0, 1);
> chop $str1;
> my $str3 = substr($str1, 0, 1);
> print +("\x{DF}" . $_)=~/ss/i ? "yes" : "no","\n" for $str1, $str2, $str3;
>
> $ perl t2.pl
> yes
> yes
> no
>

$ cat t/t2.pl
use Devel::Peek;
use feature 'unicode_strings';
my $str1 = "a\x{100}";
my $str2 = substr($str1, 0, 1);
chop $str1;
my $str3 = substr($str1, 0, 1);
print +("\x{DF}" . $_)=~/ss/i ? "yes" : "no","\n" for $str1, $str2, $str3;

$ perl t/t2.pl
yes
yes
yes

We already know about the unicode bug. So there is an argument to be
made here that we should try to avoid it in this instance and classify
the substr behaviour as a bug, but there is also an argument to be made
that it is already fixed, just use unicode_strings.

I'm on the fence, but leaning to the latter.

M4

Re: SvUTF8 predictability [ In reply to ]

May 18, 2023, 12:27 PM

Post #15 of 16 (613 views)

On Wed, May 17, 2023 at 06:31:48PM +0200, demerphq wrote:
> On Wed, 17 May 2023 at 18:08, Felipe Gasper <felipe@felipegasper.com> wrote:
> > The documentation states that Perl applications shouldn’t care about the flag. If that’s true, then there is no actual change to the string happening except implementation details that aren’t properly of concern to applications.
>
> It is not true and has never been true.

Well, its partly true, and its been perl's aspiration for many years. As
in (a hypothetical position statement):

"A well-written perl program should not need to concern itself with how
perl stores a string internally. *Except* for the Unicode Bug, which
says that the semantics of perl can sometimes change depending on how the
string is stored internally, especially the semantics of codepoints
\x80..\xff in things like regexes. Over the years we've tried to reduce
the scope of the Unicode Bug via things like 'use feature
"unicode_strings"' and //u, //a and //aa etc."

The bug in the code you showed is the Unicode Bug rather than a bug in
substr(). Now you could argue that it would help people work round the
Unicode Bug by getting substr() and similar to preserve the UTF8 flag,
but that that's not fixing a bug in susbtr(); that's just working around
a bug in the design of perl circa 5.8.0.

As for Hugo's original question: I don't think its reasonable to document
perl's behaviour vis-a-vis UTF8 flag behaviour. It will vary between
releases, and it may well vary between different code paths (for example
hypothetically rvalue and lvalue substr() might differ). It would also
constrict any future bug fixes or optimisations.

--
Wesley Crusher gets beaten up by his classmates for being a smarmy git,
and consequently has a go at making some friends of his own age for a
change.
-- Things That Never Happen in "Star Trek" #18

Re: SvUTF8 predictability [ In reply to ]

public at khwilliamson

May 18, 2023, 6:08 PM

Post #16 of 16 (613 views)

On 5/17/23 11:33, Felipe Gasper via perl5-porters wrote:
>
>> On May 17, 2023, at 1:16 PM, Ruud H.G. van Tol via perl5-porters <perl5-porters@perl.org> wrote:
>>
>>
>> On 2023-05-17 15:43, demerphq wrote:
>>> On Wed, 17 May 2023 at 15:41, Felipe Gasper via perl5-porters
>>> <perl5-porters@perl.org> wrote:
>>>> This behaviour is within expected parameters.
>>>>
>>>> SvUTF8 is internal to Perl. Applications shouldn’t care about it. Perl’s documentation already notes this:
>>> I really dont agree. It leads to contradiction in fairly simple
>>> operations which should not contradict each other. The only sane
>>> behavior is that the substring of a string should have the same
>>> utf8ness as the string it came from.
>>
>> There is sane, and there are expectations. :)
>>
>> Another way to make things more predictable, is to downgrade where feasible.
>> (some lexical auto::downgrade mechanism?)
>
> Auto-downgrade would seem problematic in terms of performance.
>
>> Sanest could well be to completely switch to UTF-8 for strings,
>> and to add raw byte buffers as a separate type.
>> (raw: also without any layers/localization)
>>
>> my $b :raw = "\x80"; # 1 byte
>>
>> my $s = "\x80"; # would then be 2 bytes
>
> Reliable text vs. byte differentiation would be a wonderful thing! It’s the ideal path forward. It would arguably be even better than native classes. Perl could prevent lots of subtly-wrong behaviours, use Windows’s Unicode API, and probably make other cool improvements.
>
> I’ve wondered before if it’d be possible by snagging two high bits from the refcount to store a string’s text-vs-bytes-vs-unknown state. Alas, I lack the bandwidth (and knowledge) to work on it effectively.
>
> -FG

Here are my current thoughts on a way forward. I'd like to replace this
flag with a 3 bit state with the following values

utf8ness unknown
utf8ness immaterial
not utf8
utf8 but downgradeable
utf8 required
utf8 and unknown if required