Mailing List Archive: Pre-RFC: New C API for converting from UTF-8 to code point

Pre-RFC: New C API for converting from UTF-8 to code point

public at khwilliamson

Jun 28, 2022, 8:17 AM

Post #1 of 8 (514 views)

In response to GH #19897 and GH #19842, I think we need to come up with
a better API to replace the deprecated functions.

One of the issues with the existing API is that the behavior changes
depending on whether warnings are enabled or not; something usually
outside the purview of a module author. There's also the problem in
some cases of having to disambiguate the return being successful or not.

To that end, I'm proposing the following API. For the current draft,
I'm using the name 'next_uvchr'; suggestions welcome

Most code would do the following to process input:
const U8 * s;
const U8 * e;
UV code_point;
PERL_INT_FAST8_T retlen;

while (s < e) {
code_point = next_uvchr(s, e, &retlen);

... process code_point ...

s += retlen;
}

This loop would safely go through the string of bytes s .. e-1, assumed
to be intended to be encoded as Perl extended UTF-8, converting the next
UTF-8 encoded character to its code point equivalent, and storing into
retlen the number of bytes that character occupies.

The program would not have to concern itself with malformed input; the
function would take care of that by itself, returning REPLACEMENT
CHARACTER for each malformed sequence, and setting retlen to be the
offset of the starting position of the next potentially legal character.
If utf8 warnings are on, those would be raised for each iteration that
found a malformation.

If you don't want to consume the input string, just pass NULL instead of
&retlen, or perhaps there could be

peek_uvchr(s, e)

Those few programs that want more control could use

next_uvchr(s, e, &retlen, flags)

'flags' would be any of the ones accepted by utf8n_to_uvchr(), as
documented in perlapi, with the addition of

UTF8_RETURN_NEGATIVE_LENGTH_ON_ERROR (or some such name)

This would change the function to set into retlen the negative value of
how many bytes it consumed, if and only if this character was malformed.
Then the loop innards would look like:

while (s < e) {
code_point = next_uvchr(s, e, &retlen);

if (retlen < 0) {
... process error ...
retlen = -retlen;
}
else {
... process code_point ...
}

s += retlen; // or abs(retlen)
}

The reason for this is that the typical naive handling is safe, and the
same function signature would work for more refined handling.

If you wanted to have complete control of error handling, there would be

AV *msgs;
next_uvchr_msgs(s, e, &retlen, flags, &msgs)

corresponding to the existing utf8n_to_uvchr_msgs(),

ppport.h would port all these back to 5.6.1

Re: Pre-RFC: New C API for converting from UTF-8 to code point [ In reply to ]

tony at develop-help

Jun 28, 2022, 6:55 PM

Post #2 of 8 (514 views)

On Tue, Jun 28, 2022 at 09:17:50AM -0600, Karl Williamson wrote:
> In response to GH #19897 and GH #19842, I think we need to come up with a
> better API to replace the deprecated functions.
>
> One of the issues with the existing API is that the behavior changes
> depending on whether warnings are enabled or not; something usually outside
> the purview of a module author. There's also the problem in some cases of
> having to disambiguate the return being successful or not.

Are these intended to replace just the deprecated functions, or also
as an easier to use version of utf8n_to_uvchr() and its variants?

It might be worth describing how the new APIs differ from the existing
non-deprecated APIs. The obvious difference is start/end vs
start/length, but I think error reporting is handled differently too.

> The program would not have to concern itself with malformed input; the
> function would take care of that by itself, returning REPLACEMENT CHARACTER
> for each malformed sequence, and setting retlen to be the offset of the
> starting position of the next potentially legal character. If utf8 warnings
> are on, those would be raised for each iteration that found a malformation.

Would there be a simple way to prevent this API producing warnings?

Tony

Re: Pre-RFC: New C API for converting from UTF-8 to code point [ In reply to ]

Jun 28, 2022, 7:27 PM

Post #3 of 8 (514 views)

Karl Williamson <public@khwilliamson.com> wrote:
: Then the loop innards would look like:
:
: while (s < e) {
: code_point = next_uvchr(s, e, &retlen);
:
: if (retlen < 0) {

I'd recommend making this `if (retlen <= 0) {` or otherwise handling
the retlen == 0 case, even if you only expect that when s >= e: the
function should be capable of doing something reasonable on an empty
string, which probably means not croaking.

: ... process error ...
: retlen = -retlen;
: }

Hugo

Re: Pre-RFC: New C API for converting from UTF-8 to code point [ In reply to ]

public at khwilliamson

Jun 29, 2022, 3:10 PM

Post #4 of 8 (514 views)

On 6/28/22 20:27, hv@crypt.org wrote:
> Karl Williamson <public@khwilliamson.com> wrote:
> : Then the loop innards would look like:
> :
> : while (s < e) {
> : code_point = next_uvchr(s, e, &retlen);
> :
> : if (retlen < 0) {
>
> I'd recommend making this `if (retlen <= 0) {` or otherwise handling
> the retlen == 0 case, even if you only expect that when s >= e: the
> function should be capable of doing something reasonable on an empty
> string, which probably means not croaking.
>
> : ... process error ...
> : retlen = -retlen;
> : }
>
> Hugo

It is currently illegal to call the existing functions with zero length
input. The functions don't now croak on zero length input, but they do
assert against it, which on DEBUGGING builds is pretty much the same thing.

IIRC the reason is performance to avoid an essentially useless
conditional, which otherwise would penalize everyone. Since this is XS
code calling us, there are plenty of other attack vectors available for
a malicious call.

retlen is guaranteed to never be 0. Even is s==e, the code would
attempt to read a byte, and if it didn't segfault, that byte is
guaranteed to indicate a non-zero length

Re: Pre-RFC: New C API for converting from UTF-8 to code point [ In reply to ]

public at khwilliamson

Jun 29, 2022, 3:39 PM

Post #5 of 8 (514 views)

On 6/28/22 19:55, Tony Cook wrote:
> On Tue, Jun 28, 2022 at 09:17:50AM -0600, Karl Williamson wrote:
>> In response to GH #19897 and GH #19842, I think we need to come up with a
>> better API to replace the deprecated functions.
>>
>> One of the issues with the existing API is that the behavior changes
>> depending on whether warnings are enabled or not; something usually outside
>> the purview of a module author. There's also the problem in some cases of
>> having to disambiguate the return being successful or not.
>
> Are these intended to replace just the deprecated functions, or also
> as an easier to use version of utf8n_to_uvchr() and its variants?

I would want to replace core calls with these, which would simplify the
logic in most places.
>
> It might be worth describing how the new APIs differ from the existing
> non-deprecated APIs. The obvious difference is start/end vs
> start/length, but I think error reporting is handled differently too.

I can add such a description. Some existing functions already have
start/end pointers rather than start/length parameters. I have found
through unhappy experience that it is quite possible for naive functions
to end up calling another with something that should be negative, but
ends up being treated as an extremely large positive number. Using
SSize_t avoids that, but cuts the permissible max length in half.

And some of the current functions that do take a Size_t parameter
promptly add it to the start so they can do 'while (s < e)', which leads
to more elegant loops. So it is slightly more efficient generally to
use the start/end form. If you need to keep the original starting
position, creating a s0 variable initialized to that works well.

The main difference would be that the return code doesn't have to be
disambiguated. Right now you can't tell whether the input was a NUL
character or an error, without testing each return. cpan modules don't
tend to do this, leaving them vulnerable. This API is so that you can
be naive and still be safe.

The other major difference is that the state of 'use warnings' doesn't
affect what the function returns; that it now does is quite problematic.

>
>> The program would not have to concern itself with malformed input; the
>> function would take care of that by itself, returning REPLACEMENT CHARACTER
>> for each malformed sequence, and setting retlen to be the offset of the
>> starting position of the next potentially legal character. If utf8 warnings
>> are on, those would be raised for each iteration that found a malformation.
>
> Would there be a simple way to prevent this API producing warnings?

I had been considering adding a flag to override the state of 'use
warnings' to turn them off unconditionally (but not the inverse). But
the API does already offer that capability without the flag, and you can
get the inverse as well. If you use the '_msgs' form, it doesn't raise
any warnings, but returns a hash of the ones it otherwise would have.
You can discard the hash, or translate it into the language of your
choice, or output it yourself, regardless of 'use warnings'. The hash
also lets you know precisely what the malformation was.

There certainly could be a flag to just not raise any warnings.

>
> Tony

Re: Pre-RFC: New C API for converting from UTF-8 to code point [ In reply to ]

Jun 29, 2022, 4:56 PM

Post #6 of 8 (514 views)

Karl Williamson <public@khwilliamson.com> wrote:
:On 6/28/22 20:27, hv@crypt.org wrote:
:> Karl Williamson <public@khwilliamson.com> wrote:
:> : Then the loop innards would look like:
:> :
:> : while (s < e) {
:> : code_point = next_uvchr(s, e, &retlen);
:> :
:> : if (retlen < 0) {
:>
:> I'd recommend making this `if (retlen <= 0) {` or otherwise handling
:> the retlen == 0 case, even if you only expect that when s >= e: the
:> function should be capable of doing something reasonable on an empty
:> string, which probably means not croaking.
:>
:> : ... process error ...
:> : retlen = -retlen;
:> : }
:
:It is currently illegal to call the existing functions with zero length
:input. The functions don't now croak on zero length input, but they do
:assert against it, which on DEBUGGING builds is pretty much the same thing.

Ok, let's hope module writers habitually develop with a debug build.

And in another reply:
:The main difference would be that the return code doesn't have to be
:disambiguated. Right now you can't tell whether the input was a NUL
:character or an error, without testing each return. cpan modules don't
:tend to do this, leaving them vulnerable. This API is so that you can
:be naive and still be safe.

How naive do you want? Anyone that fails to check for negative retlen
can fail quite unsafely.

I think we've established in previous discussion that you are more
comfortable with this sort of overloaded interface than I am, so I
do not wish to belabour the point. But this does rather remind me of
https://github.com/Perl/perl5/issues/14498, where pretty much every
use of grok_atou in core was buggy. I do recommend reading through
that discussion again to see if there's stuff worth learning that
could be applicable to this API.

Hugo

Re: Pre-RFC: New C API for converting from UTF-8 to code point [ In reply to ]

public at khwilliamson

Jul 1, 2022, 2:31 PM

Post #7 of 8 (511 views)

On 6/29/22 17:56, hv@crypt.org wrote:
> Karl Williamson <public@khwilliamson.com> wrote:
> :On 6/28/22 20:27, hv@crypt.org wrote:
> :> Karl Williamson <public@khwilliamson.com> wrote:
> :> : Then the loop innards would look like:
> :> :
> :> : while (s < e) {
> :> : code_point = next_uvchr(s, e, &retlen);
> :> :
> :> : if (retlen < 0) {
> :>
> :> I'd recommend making this `if (retlen <= 0) {` or otherwise handling
> :> the retlen == 0 case, even if you only expect that when s >= e: the
> :> function should be capable of doing something reasonable on an empty
> :> string, which probably means not croaking.
> :>
> :> : ... process error ...
> :> : retlen = -retlen;
> :> : }
> :
> :It is currently illegal to call the existing functions with zero length
> :input. The functions don't now croak on zero length input, but they do
> :assert against it, which on DEBUGGING builds is pretty much the same thing.
>
> Ok, let's hope module writers habitually develop with a debug build.

I would be amenable to croaking if called improperly outside debugging
builds. I'm not open to the function ever returning 0 length. It's
better to croak than loop indefinitely. I've been there, done that,
with the current interface.
>
> And in another reply:
> :The main difference would be that the return code doesn't have to be
> :disambiguated. Right now you can't tell whether the input was a NUL
> :character or an error, without testing each return. cpan modules don't
> :tend to do this, leaving them vulnerable. This API is so that you can
> :be naive and still be safe.
>
> How naive do you want? Anyone that fails to check for negative retlen
> can fail quite unsafely.

I'm not clear as to what you are thinking here. By safe, I mean
sanitized, returning a value that won't be interpreted wrongly due to
invalid UTF-8. This could be from a noisy line, an unintentional error,
or a deliberate attack. The function very deliberately is designed so
that, if called in the prescribed manner, the caller doesn't have to
concern itself with those things. The input is not what it should be,
but it has been sanitized to not be misinterpretable. Unlike parsing a
string to calculate a number, the REPLACEMENT CHARACTER is a value
reserved by Unicode to have just this particular meaning. It has no
other use.

Most code wouldn't know what to do in the event of an error.
Serializers, for example, just pass through the data. It's going to not
be the original, but it will be sanitized. There's generally no real
added value to them knowing there was an error, and having to deal with
it. Discarding an invalid value has been shown to make one vulnerable
to attack. The only real option is then to use the REPLACEMENT
CHARACTER, which is what this interface does with no muss or fuss on the
part of the caller.

Code that reads from a buffer that is being filled by another process
does however want to look for the specific error of the final character
so far being incomplete. Such code would call the _msgs form of the
function and if it is that specific error, do something like sleep and
try again later. For all other errors, it would just accept the
proffered REPLACEMENT CHARACTER.

Or higher level code could ask for a retransmission. The interface
allows for code to get the necessary information. But most code really
can't do anything. Take pattern matching. If the pattern is looking
for a particular character, and it gets a malformed one, it doesn't know
what the intended one was, nor does it have a way to ask for the value
again. If the pattern was to match '.' at this point, it will match,
because the whatever the intended character was would match. The code
in the regex matching engine needn't have to be made more complicated to
deal with something that it can't do any better

>
> I think we've established in previous discussion that you are more
> comfortable with this sort of overloaded interface than I am, so I
> do not wish to belabour the point. But this does rather remind me of
> https://github.com/Perl/perl5/issues/14498, where pretty much every
> use of grok_atou in core was buggy. I do recommend reading through
> that discussion again to see if there's stuff worth learning that
> could be applicable to this API.

As a counter example, in that discussion you refer approvingly to the
grok_bslash_x API. IIRC, it was me who came up with that interface,
precisely to solve the problem of getting the wrong number silently, or
even with a warning, but still wrongly acted upon.

But that is a different problem. The REPLACEMENT CHARACTER is a
sentinel that blares out that this string is problematic.

From reading that discussion, I think you would be more comfortable
with something like this:

while (s < e) {
if (! next_uvchr(s, e, &ret_cp, &ret_len)) {
... process error ...
}
else {
... process code_point ...
}

s += ret_len;
}

or if the caller didn't care to handle errors

while (s < e) {
(void) next_uvchr(s, e, &ret_cp, &ret_len);

... process code_point ...

s += ret_len;
}

I would consider this. I designed the original proposal to be more of a
drop-in replacement for the current one. But this one gets rid of the
flag to say you want errors, and might be close enough to drop-in. I
await more people to weigh-in.

>
> Hugo

Re: Pre-RFC: New C API for converting from UTF-8 to code point [ In reply to ]

Jul 1, 2022, 3:29 PM

Post #8 of 8 (511 views)

Karl Williamson <public@khwilliamson.com> wrote:
: From reading that discussion, I think you would be more comfortable
:with something like this:
:
: while (s < e) {
: if (! next_uvchr(s, e, &ret_cp, &ret_len)) {
: ... process error ...
: }
: else {
: ... process code_point ...
: }
:
: s += ret_len;
: }

That certainly feels like a cleaner design to me, by avoiding the
overloading of ret_len to have multiple meanings.

:I would consider this. I designed the original proposal to be more of a
:drop-in replacement for the current one. But this one gets rid of the
:flag to say you want errors, and might be close enough to drop-in. I
:await more people to weigh-in.

Thanks for considering it.

Hugo