Mailing List Archive

1 2  View All
Re: PEP 624: Remove Py_UNICODE encoder APIs [ In reply to ]
On Tue, Feb 2, 2021 at 9:40 PM Emily Bowman <silverbacknet@gmail.com> wrote:
>
> On Tue, Feb 2, 2021 at 3:47 AM Inada Naoki <songofacandy@gmail.com> wrote:
>>
>> But when wchar_t* is UTF-16, ucs2_utf8_encoder() can not handle
>> surrogate escape.
>> We need to use a temporary Unicode object. That is what "inefficient" means.
>
>
> Since real UCS-2 is effectively dead, maybe it should be flipped around: Make UTF-16 be the efficient path and UCS-2 be the path that needs to round-trip through Unicode. But I suppose that's out of scope for this PEP.
>
> -Em

Note the ucs2_utf8_encoder() is used only for encoding Python Unicode
object for now.
Unicode object is latin1, UCS2, or UCS4. It never be UTF-16.

So if we support add UTF-16 support to ucs2_utf8_encoder(), it means
we need to add code and maintain only for PyUnicode_EncodeUTF8 (encode
from wchar_t* into char*).

I don't think it is a good deal. As described in the PEP, encoder APIs
are used very rarely.
We must not add any maintainece costs for them.

Regards,
--
Inada Naoki <songofacandy@gmail.com>
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/KDYTBQDA4UFE6XWYENOV32ZRTCTAYEPC/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: PEP 624: Remove Py_UNICODE encoder APIs [ In reply to ]
On Tue, Feb 2, 2021 at 11:47 PM Inada Naoki <songofacandy@gmail.com> wrote:
> So if we support add UTF-16 support to ucs2_utf8_encoder(), it means
> we need to add code and maintain only for PyUnicode_EncodeUTF8 (encode
> from wchar_t* into char*).
>
> I don't think it is a good deal. As described in the PEP, encoder APIs
> are used very rarely.
> We must not add any maintainece costs for them.

I fixed tons of bugs related in Python 2.7 and Python 3 codecs before
PEP 393 (compact strings) to handle properly 16-bit wchar_t: to handle
properly surrogate characters. The implementation was complex and
slow. I would prefer to not move backwards to that :-(

If you are curious, look into PyUnicode_FromWideChar() implementation,
search for find_maxchar_surrogates(), to have an idea of the cost of
handling UTF-16 surrogate pairs. For a full codec, it's way more
complex, painful to write and to maintain. I'm happy that we were able
to remove that thanks to the PEP 393!

Victor
--
Night gathers, and now my watch begins. It shall not end until my death.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/OAPVKJAU6QZCMEWRQSYEDTGO6VAO5ZAN/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: PEP 624: Remove Py_UNICODE encoder APIs [ In reply to ]
On Tue, Feb 2, 2021 at 8:40 PM Inada Naoki <songofacandy@gmail.com> wrote:
>
> On Tue, Feb 2, 2021 at 7:37 PM M.-A. Lemburg <mal@egenix.com> wrote:
> >
> > BTW: I don't understand this comment:
> > "They are inefficient on platforms wchar_t* is UTF-16. It is because
> > built-in codecs supports only UCS-1, UCS-2, and UCS-4 input."
> >
> > Windows is one such platform. Java (indirectly) is another. They both
> > store UTF-16LE in those arrays and Python's codecs handle this just
> > fine.
> >
>
> I'm sorry about the section is not clear.
>
> For example, if wchar_t* is UCS4, ucs4_utf8_encoder() can encode
> wchar_t* into UTF-8.
>
> But when wchar_t* is UTF-16, ucs2_utf8_encoder() can not handle
> surrogate escape.
> We need to use a temporary Unicode object. That is what "inefficient" means.
>
> I will update the section more elaborate.
>

I updated the "Alternative Ideas" section of the PEP.
https://www.python.org/dev/peps/pep-0624/#alternative-ideas

They replaces `Py_UNICODE*` with `PyObject*`, `Py_UCS4*`, and `wchar_t*`.
I explicitly noted that some codecs can bypass temporary Unicode objects:

"""
UTF-8, UTF-16, UTF-32 encoders support Py_UCS4 internally. So
PyUnicode_EncodeUTF8(), PyUnicode_EncodeUTF16(), and
PyUnicode_EncodeUTF32() can avoid to create a temporary Unicode
object.
"""

--
Inada Naoki <songofacandy@gmail.com>
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/AD7YKV33JAQXIXDTGUMH7UDSMQUEKVMG/
Code of Conduct: http://python.org/psf/codeofconduct/

1 2  View All