Mailing List Archive

CORRECTED Re: Pre-RFC: Rename SVf_UTF8 et al.
> On Sep 3, 2021, at 9:34 AM, Felipe Gasper <felipe@felipegasper.com> wrote:
>
>
>
>> On Sep 3, 2021, at 3:48 AM, Tom Molesworth via perl5-porters <perl5-porters@perl.org> wrote:
>>
>> On Fri, 3 Sept 2021 at 14:30, demerphq <demerphq@gmail.com> wrote:
>> On Thu, 2 Sept 2021 at 16:39, Dan Book <grinnz@gmail.com> wrote:
>> On Thu, Sep 2, 2021 at 9:21 AM demerphq <demerphq@gmail.com> wrote:
>> On Fri, 20 Aug 2021 at 19:48, Felipe Gasper <felipe@felipegasper.com> wrote:
>>
>> What you call “a UTF-8 string” is what I propose we call, per existing nomenclature (i.e., utf8::upgrade()) “a PV-upgraded string”, with corresponding code changes. Then the term “UTF-8 string” makes sense from a pure-Perl context without requiring Perl programmers to worry about interpreter internals.
>>
>>
>> No. The flag does not mean "upgraded" it means "unicode semantics, utf8 encoding". Upgrading is one way to get such a string, and it might even be the most common, but the most important and likely to be correct way is explicit decoding.
>>
>> If we are to rename the flag then we should just rename it as the UNICODE flag. Would have saved a world of confusion.
>>
>> This is exactly what we have defined as "upgraded". Decoding does not define the internal format of the resulting string at all. The only internal format which is upgraded is when the UTF8 flag is on.
>>
>> Your definition is wrong then. You seem to have "upgrading" and "decoding" muddled.
>>
>> Decoding most definitely DOES define the internal format of the result string. If you decode utf8 the result is a UTF8 on string. If that string contained utf8 representing codepoints above 127 then the result will be different.
>>
>> Given this:
>>
>> perl -e'use Devel::Peek; use Encode; print Dump(Encode::decode("UTF-8", "example"))'
>> SV = PV(0x55b88281b2b0) at 0x55b88272e4e0
>> REFCNT = 2
>> FLAGS = (TEMP,POK,pPOK,UTF8)
>> PV = 0x55b8827f45d0 "example"\0 [UTF8 "example"]
>> CUR = 7
>> LEN = 10
>>
>> I think the current behaviour is at least inefficient, if perhaps not outright *wrong*... why would decoding enforce the UTF8 flag?
>>
>> Put another way, if the resulting string has only codepoints 0..127, why not leave the flag off so that string operations can be more efficient?
>>
>> This extends to common cases such as UTF8-safe filter chains:
>>
>> echo "example" | perl -CSD -lne'use Devel::Peek; s{e$}{es}; print Dump($_)'
>> SV = PV(0x556c3aab2000) at 0x556c3aaeb3e8
>> REFCNT = 1
>> FLAGS = (POK,pPOK,UTF8)
>> PV = 0x556c3aae41b0 "examples"\0 [UTF8 "examples"]
>> CUR = 8
>> LEN = 24
>>
>> If that's not taking the faster pure-ASCII path for input, this would seem like an easy optimisation opportunity. If the behaviour only happened with the non-validating `utf8` decoding, then maybe it could be explained away by not wanting to walk the entire length of the string... but then I'd at least expect it to be different with the "UTF-8" encoding layer:
>>
>> echo "example" | perl -lne'use Devel::Peek; BEGIN { binmode STDIN, ":encoding(UTF-8)" } s{e$}{es}; print Dump($_)'
>> SV = PV(0x55d8d2b76000) at 0x55d8d2baf4a8
>> REFCNT = 1
>> FLAGS = (POK,pPOK,UTF8)
>> PV = 0x55d8d2bad300 "examples"\0 [UTF8 "examples"]
>> CUR = 8
>> LEN = 24
>>
>> So yes, decoding does set the UTF8 flag - but I'd argue that it *shouldn't*, and the current behaviour is somewhere between a historical accident and an oversight. To be clear, I'd expect the same non-UTF8 status in the examples so far, as we see from this:
>>
>> perl -e'use Devel::Peek; use utf8; my $text = "example"; print Dump($text)'
>> SV = PV(0x55be3864aff0) at 0x55be3866fe60
>> REFCNT = 1
>> FLAGS = (POK,IsCOW,pPOK)
>> PV = 0x55be3867f020 "example"\0
>> CUR = 7
>> LEN = 10
>> COW_REFCNT = 1
>>
>> What am I missing here?
>
> Try utf8::decode(); it uses Perl’s internal decoder and behaves as you expect (i.e., it leaves invariant strings alone).

CORRECTION/CLARIFICATION: utf8::decode() leaves an invariant *PV* alone (and leaves the flag off).

-FG