In the somewhat sprawling PerlMonks thread "Seeking Perl docs about how
UTF8 flag propagates" [1] the original question seems rather valid:
given that the Unicode bug exists (and is not going away), one cannot
predict the behaviour of Perl code in some cases without knowing whether
the strings involved have SvUTF8_on in their internal representation.
However there is not to my knowledge any documentation that describes
the status of the flag on results of the various operations that return
or modify strings, nor any guarantee that the results will be the same
from one version to the next.
In testing, I was somewhat surprised to find that substr($foo, 0, 1)
will return a string with UTF8_off if the source string is UTF8_on but
has no bytes greater than 0x7f - determined by doing a potentially
expensive (and potentially unnecessary) length() on the source string.
Should we be providing any guarantees in this area, or making explicit
that we offer none? Would it be legitimate, for example, to change
the substr() implementation such that a UTF8_on source always gave
a UTF8_on result? And if we did so, would we document that in the
changelog as a backwards-incompatible change?
My instinct is that we do not want to offer any guarantees (and that
we should state the explicitly). But I don't know where that leaves
someone who wants to look at some existing code (that is not using
any of the various forcing mechanisms described, for example, in
`perldoc -f lc`) and predict how it will behave.
Hugo
[1] https://perlmonks.org/index.pl?node_id=11152194
UTF8 flag propagates" [1] the original question seems rather valid:
given that the Unicode bug exists (and is not going away), one cannot
predict the behaviour of Perl code in some cases without knowing whether
the strings involved have SvUTF8_on in their internal representation.
However there is not to my knowledge any documentation that describes
the status of the flag on results of the various operations that return
or modify strings, nor any guarantee that the results will be the same
from one version to the next.
In testing, I was somewhat surprised to find that substr($foo, 0, 1)
will return a string with UTF8_off if the source string is UTF8_on but
has no bytes greater than 0x7f - determined by doing a potentially
expensive (and potentially unnecessary) length() on the source string.
Should we be providing any guarantees in this area, or making explicit
that we offer none? Would it be legitimate, for example, to change
the substr() implementation such that a UTF8_on source always gave
a UTF8_on result? And if we did so, would we document that in the
changelog as a backwards-incompatible change?
My instinct is that we do not want to offer any guarantees (and that
we should state the explicitly). But I don't know where that leaves
someone who wants to look at some existing code (that is not using
any of the various forcing mechanisms described, for example, in
`perldoc -f lc`) and predict how it will behave.
Hugo
[1] https://perlmonks.org/index.pl?node_id=11152194