Mailing List Archive

PEP 624: Remove Py_UNICODE encoder APIs
Hi, folks.

Since the previous discussion was suspended without consensus, I wrote
a new PEP for it. (Thank you Victor for reviewing it!)

This PEP looks very similar to PEP 623 "Remove wstr from Unicode",
but for encoder APIs, not for Unicode object APIs.

URL (not available yet): https://www.python.org/dev/peps/pep-0624/

---

PEP: 624
Title: Remove Py_UNICODE encoder APIs
Author: Inada Naoki <songofacandy@gmail.com>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 06-Jul-2020
Python-Version: 3.11


Abstract
========

This PEP proposes to remove deprecated ``Py_UNICODE`` encoder APIs in
Python 3.11:

* ``PyUnicode_Encode()``
* ``PyUnicode_EncodeASCII()``
* ``PyUnicode_EncodeLatin1()``
* ``PyUnicode_EncodeUTF7()``
* ``PyUnicode_EncodeUTF8()``
* ``PyUnicode_EncodeUTF16()``
* ``PyUnicode_EncodeUTF32()``
* ``PyUnicode_EncodeUnicodeEscape()``
* ``PyUnicode_EncodeRawUnicodeEscape()``
* ``PyUnicode_EncodeCharmap()``
* ``PyUnicode_TranslateCharmap()``
* ``PyUnicode_EncodeDecimal()``
* ``PyUnicode_TransformDecimalToASCII()``

.. note::

`PEP 623 <https://www.python.org/dev/peps/pep-0623/>`_ propose to remove
Unicode object APIs relating to ``Py_UNICODE``. On the other hand, this PEP
is not relating to Unicode object. These PEPs are split because they have
different motivation and need different discussion.


Motivation
==========

In general, reducing the number of APIs that have been deprecated for
a long time and have few users is a good idea for not only it
improves the maintainability of CPython, but it also helps API users
and other Python implementations.


Rationale
=========

Deprecated since Python 3.3
---------------------------

``Py_UNICODE`` and APIs using it are deprecated since Python 3.3.


Inefficient
-----------

All of these APIs are implemented using ``PyUnicode_FromWideChar``.
So these APIs are inefficient when user want to encode Unicode
object.


Not used widely
---------------

When searching from top 4000 PyPI packages [1]_, only pyodbc use
these APIs.

* ``PyUnicode_EncodeUTF8()``
* ``PyUnicode_EncodeUTF16()``

pyodbc uses these APIs to encode Unicode object into bytes object.
So it is easy to fix it. [2]_


Alternative APIs
================

There are alternative APIs to accept ``PyObject *unicode`` instead of
``Py_UNICODE *``. Users can migrate to them.


=========================================
==========================================
Deprecated API Alternative APIs
=========================================
==========================================
``PyUnicode_Encode()`` ``PyUnicode_AsEncodedString()``
``PyUnicode_EncodeASCII()`` ``PyUnicode_AsASCIIString()`` \(1)
``PyUnicode_EncodeLatin1()`` ``PyUnicode_AsLatin1String()`` \(1)
``PyUnicode_EncodeUTF7()`` \(2)
``PyUnicode_EncodeUTF8()`` ``PyUnicode_AsUTF8String()`` \(1)
``PyUnicode_EncodeUTF16()`` ``PyUnicode_AsUTF16String()`` \(3)
``PyUnicode_EncodeUTF32()`` ``PyUnicode_AsUTF32String()`` \(3)
``PyUnicode_EncodeUnicodeEscape()`` ``PyUnicode_AsUnicodeEscapeString()``
``PyUnicode_EncodeRawUnicodeEscape()``
``PyUnicode_AsRawUnicodeEscapeString()``
``PyUnicode_EncodeCharmap()`` ``PyUnicode_AsCharmapString()`` \(1)
``PyUnicode_TranslateCharmap()`` ``PyUnicode_Translate()``
``PyUnicode_EncodeDecimal()`` \(4)
``PyUnicode_TransformDecimalToASCII()`` \(4)
=========================================
==========================================

Notes:

(1)
``const char *errors`` parameter is missing.

(2)
There is no public alternative API. But user can use generic
``PyUnicode_AsEncodedString()`` instead.

(3)
``const char *errors, int byteorder`` parameters are missing.

(4)
There is no direct replacement. But ``Py_UNICODE_TODECIMAL``
can be used instead. CPython uses
``_PyUnicode_TransformDecimalAndSpaceToASCII`` for converting
from Unicode to numbers instead.


Plan
====

Python 3.9
----------

Add ``Py_DEPRECATED(3.3)`` to following APIs. This change is committed
already [3]_. All other APIs have been marked ``Py_DEPRECATED(3.3)``
already.

* ``PyUnicode_EncodeDecimal()``
* ``PyUnicode_TransformDecimalToASCII()``.

Document all APIs as "will be removed in version 3.11".


Python 3.11
-----------

These APIs are removed.

* ``PyUnicode_Encode()``
* ``PyUnicode_EncodeASCII()``
* ``PyUnicode_EncodeLatin1()``
* ``PyUnicode_EncodeUTF7()``
* ``PyUnicode_EncodeUTF8()``
* ``PyUnicode_EncodeUTF16()``
* ``PyUnicode_EncodeUTF32()``
* ``PyUnicode_EncodeUnicodeEscape()``
* ``PyUnicode_EncodeRawUnicodeEscape()``
* ``PyUnicode_EncodeCharmap()``
* ``PyUnicode_TranslateCharmap()``
* ``PyUnicode_EncodeDecimal()``
* ``PyUnicode_TransformDecimalToASCII()``


Alternative ideas
=================

Instead of just removing deprecated APIs, we may be able to use thier
names with different signature.


Make some private APIs public
------------------------------

``PyUnicode_EncodeUTF7()`` doesn't have public alternative APIs.

Some APIs have alternative public APIs. But they are missing
``const char *errors`` or ``int byteorder`` parameters.

We can rename some private APIs and make them public to cover missing
APIs and parameters.

============================= ================================
Rename to Rename from
============================= ================================
``PyUnicode_EncodeASCII()`` ``_PyUnicode_AsASCIIString()``
``PyUnicode_EncodeLatin1()`` ``_PyUnicode_AsLatin1String()``
``PyUnicode_EncodeUTF7()`` ``_PyUnicode_EncodeUTF7()``
``PyUnicode_EncodeUTF8()`` ``_PyUnicode_AsUTF8String()``
``PyUnicode_EncodeUTF16()`` ``_PyUnicode_EncodeUTF16()``
``PyUnicode_EncodeUTF32()`` ``_PyUnicode_EncodeUTF32()``
============================= ================================

Pros:

* We have more consistent API set.

Cons:

* We have more public APIs to maintain.
* Existing public APIs are enough for most use cases, and
``PyUnicode_AsEncodedString()`` can be used in other cases.


Replace ``Py_UNICODE*`` with ``Py_UCS4*``
-----------------------------------------

We can replace ``Py_UNICODE`` (typedef of ``wchar_t``) with
``Py_UCS4``. Since builtin codecs support UCS-4, we don't need to
convert ``Py_UCS4*`` string to Unicode object.


Pros:

* We have more consistent API set.
* User can encode UCS-4 string in C without creating Unicode object.

Cons:

* We have more public APIs to maintain.
* Applications which uses UTF-8 or UTF-32 can not use these APIs
anyway.
* Other Python implementations may not have builtin codec for UCS-4.
* If we change the Unicode internal representation to UTF-8, we need
to keep UCS-4 support only for these APIs.


Replace ``Py_UNICODE*`` with ``wchar_t*``
-----------------------------------------

We can replace ``Py_UNICODE`` to ``wchar_t``.

Pros:

* We have more consistent API set.
* Backward compatible.

Cons:

* We have more public APIs to maintain.
* They are inefficient on platforms ``wchar_t*`` is UTF-16. It is
because built-in codecs supports only UCS-1, UCS-2, and UCS-4
input.


Rejected ideas
==============

Using runtime warning
---------------------

These APIs doesn't release GIL for now. Emitting a warning from
such APIs is not safe. See this example.

.. code-block::

PyObject *u = PyList_GET_ITEM(list, i); // u is borrowed reference.
PyObject *b = PyUnicode_EncodeUTF8(PyUnicode_AS_UNICODE(u),
PyUnicode_GET_SIZE(u), NULL);
// Assumes u is still living reference.
PyObject *t = PyTuple_Pack(2, u, b);
Py_DECREF(b);
return t;

If we emit Python warning from ``PyUnicode_EncodeUTF8()``, warning
filters and other threads may change the ``list`` and ``u`` can be
a dangling reference after ``PyUnicode_EncodeUTF8()`` returned.

Additionally, since we are not changing behavior but removing C APIs,
runtime ``DeprecationWarning`` might not helpful for Python
developers. We should warn to extension developers instead.


Discussions
===========

* `Plan to remove Py_UNICODE APis except PEP 623
<https://mail.python.org/archives/list/python-dev@python.org/thread/S7KW2U6IGXZFBMGS6WSJB26NZIBW4OLE/#S7KW2U6IGXZFBMGS6WSJB26NZIBW4OLE>`_
* `bpo-41123: Remove Py_UNICODE APIs except PEP 623:
<https://bugs.python.org/issue41123>`_


References
==========

.. [1] Source package list chosen from top 4000 PyPI packages.
(https://github.com/methane/notes/blob/master/2020/wchar-cache/package_list.txt)

.. [2] pyodbc -- Don't use PyUnicode_Encode API #792
(https://github.com/mkleehammer/pyodbc/pull/792)

.. [3] Uncomment Py_DEPRECATED for Py_UNICODE APIs (GH-21318)
(https://github.com/python/cpython/commit/9c3840870814493fed62e140cfa43c2883e12181)


Copyright
=========

This document has been placed in the public domain.

--
Inada Naoki <songofacandy@gmail.com>
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/THXVM7FZVT56B7CPEDIYKJG6VMAYIEK5/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: PEP 624: Remove Py_UNICODE encoder APIs [ In reply to ]
Le mar. 7 juil. 2020 à 17:21, Inada Naoki <songofacandy@gmail.com> a écrit :
> This PEP proposes to remove deprecated ``Py_UNICODE`` encoder APIs in
> Python 3.11:

Overall, I like the plan. IMHO 3.11 is a reasonable target version,
since on the top 4000 projects, only 2 are affected and it is easy to
fix them.


> Python 3.9
> ----------
>
> Add ``Py_DEPRECATED(3.3)`` to following APIs. This change is committed
> already [3]_. All other APIs have been marked ``Py_DEPRECATED(3.3)``
> already.
>
> * ``PyUnicode_EncodeDecimal()``
> * ``PyUnicode_TransformDecimalToASCII()``.
>
> Document all APIs as "will be removed in version 3.11".

I guess that if the release manager is not ok to add the two remaining
Py_DEPRECATED() warnings, they can be added to 3.10 instead.


> Make some private APIs public
>
> ``PyUnicode_EncodeUTF7()`` doesn't have public alternative APIs.
>
> Some APIs have alternative public APIs. But they are missing
> ``const char *errors`` or ``int byteorder`` parameters.

If needed, new functions can be added independently of this PEP.



> Using runtime warning
> ---------------------
>
> These APIs doesn't release GIL for now. Emitting a warning from
> such APIs is not safe. See this example.
>
> .. code-block::
>
> PyObject *u = PyList_GET_ITEM(list, i); // u is borrowed reference.
> PyObject *b = PyUnicode_EncodeUTF8(PyUnicode_AS_UNICODE(u),
> PyUnicode_GET_SIZE(u), NULL);
> // Assumes u is still living reference.
> PyObject *t = PyTuple_Pack(2, u, b);
> Py_DECREF(b);
> return t;
>
> If we emit Python warning from ``PyUnicode_EncodeUTF8()``, warning
> filters and other threads may change the ``list`` and ``u`` can be
> a dangling reference after ``PyUnicode_EncodeUTF8()`` returned.
>
> Additionally, since we are not changing behavior but removing C APIs,
> runtime ``DeprecationWarning`` might not helpful for Python
> developers. We should warn to extension developers instead.

DeprecationWarning is hidden by default: users would not be impacted.

I don't think that encoding functions are special enough to skip these
warnings. I think that it's reasonable to change the behavior on these
deprecated functions to emit a warning.

Victor
--
Night gathers, and now my watch begins. It shall not end until my death.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/M523JOR2B36QYIWO4LMS4QPUDFF23E3T/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: PEP 624: Remove Py_UNICODE encoder APIs [ In reply to ]
Hi Inada-san,

I am currently too busy with EuroPython to participate in longer
discussions. FWIW: I intend to continue after EuroPython.

In any case, thanks for writing up the PEP. Could you please add my
points about:

- the fact that the encode APIs encoding from a Unicode buffer
to a bytes object; this is an important fact, since the removal
removes access to this codec functionality for extensions

- PyUnicode_AsEncodedString() is not a proper alternative, since
it requires to create a temporary PyUnicode object, which is
inefficient and wastes memory

- the maintenance effect mentioned in the PEP does not really
materialize, since the underlying functionality still exists
in the codecs - only access to the functionality is removed

- keeping just the generic PyUnicode_Encode() API would be a
compromise

- if we remove the codec specific PyUnicode_Encode*() APIs, why
are we still keeping the specisl PyUnicde_Decode*() APIs ?

- the deprecations were just done because the Py_UNICODE data
type was replaced by a hybrid type. Using this as an argument
for removing functionality is not really good practice, when
these are ways to continue exposing the functionality using other
data types.

I am still strongly -1 on removing all encoding APIs without
at least some upgrade path for existing code to use and keeping
the API symmetric.

Cheers,
--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts
>>> Python Projects, Coaching and Consulting ... http://www.egenix.com/
>>> Python Database Interfaces ... http://products.egenix.com/
>>> Plone/Zope Database Interfaces ... http://zope.egenix.com/
________________________________________________________________________

::: We implement business ideas - efficiently in both time and costs :::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/
http://www.malemburg.com/


On 07.07.2020 17:17, Inada Naoki wrote:
> Hi, folks.
>
> Since the previous discussion was suspended without consensus, I wrote
> a new PEP for it. (Thank you Victor for reviewing it!)
>
> This PEP looks very similar to PEP 623 "Remove wstr from Unicode",
> but for encoder APIs, not for Unicode object APIs.
>
> URL (not available yet): https://www.python.org/dev/peps/pep-0624/
>
> ---
>
> PEP: 624
> Title: Remove Py_UNICODE encoder APIs
> Author: Inada Naoki <songofacandy@gmail.com>
> Status: Draft
> Type: Standards Track
> Content-Type: text/x-rst
> Created: 06-Jul-2020
> Python-Version: 3.11
>
>
> Abstract
> ========
>
> This PEP proposes to remove deprecated ``Py_UNICODE`` encoder APIs in
> Python 3.11:
>
> * ``PyUnicode_Encode()``
> * ``PyUnicode_EncodeASCII()``
> * ``PyUnicode_EncodeLatin1()``
> * ``PyUnicode_EncodeUTF7()``
> * ``PyUnicode_EncodeUTF8()``
> * ``PyUnicode_EncodeUTF16()``
> * ``PyUnicode_EncodeUTF32()``
> * ``PyUnicode_EncodeUnicodeEscape()``
> * ``PyUnicode_EncodeRawUnicodeEscape()``
> * ``PyUnicode_EncodeCharmap()``
> * ``PyUnicode_TranslateCharmap()``
> * ``PyUnicode_EncodeDecimal()``
> * ``PyUnicode_TransformDecimalToASCII()``
>
> .. note::
>
> `PEP 623 <https://www.python.org/dev/peps/pep-0623/>`_ propose to remove
> Unicode object APIs relating to ``Py_UNICODE``. On the other hand, this PEP
> is not relating to Unicode object. These PEPs are split because they have
> different motivation and need different discussion.
>
>
> Motivation
> ==========
>
> In general, reducing the number of APIs that have been deprecated for
> a long time and have few users is a good idea for not only it
> improves the maintainability of CPython, but it also helps API users
> and other Python implementations.
>
>
> Rationale
> =========
>
> Deprecated since Python 3.3
> ---------------------------
>
> ``Py_UNICODE`` and APIs using it are deprecated since Python 3.3.
>
>
> Inefficient
> -----------
>
> All of these APIs are implemented using ``PyUnicode_FromWideChar``.
> So these APIs are inefficient when user want to encode Unicode
> object.
>
>
> Not used widely
> ---------------
>
> When searching from top 4000 PyPI packages [1]_, only pyodbc use
> these APIs.
>
> * ``PyUnicode_EncodeUTF8()``
> * ``PyUnicode_EncodeUTF16()``
>
> pyodbc uses these APIs to encode Unicode object into bytes object.
> So it is easy to fix it. [2]_
>
>
> Alternative APIs
> ================
>
> There are alternative APIs to accept ``PyObject *unicode`` instead of
> ``Py_UNICODE *``. Users can migrate to them.
>
>
> =========================================
> ==========================================
> Deprecated API Alternative APIs
> =========================================
> ==========================================
> ``PyUnicode_Encode()`` ``PyUnicode_AsEncodedString()``
> ``PyUnicode_EncodeASCII()`` ``PyUnicode_AsASCIIString()`` \(1)
> ``PyUnicode_EncodeLatin1()`` ``PyUnicode_AsLatin1String()`` \(1)
> ``PyUnicode_EncodeUTF7()`` \(2)
> ``PyUnicode_EncodeUTF8()`` ``PyUnicode_AsUTF8String()`` \(1)
> ``PyUnicode_EncodeUTF16()`` ``PyUnicode_AsUTF16String()`` \(3)
> ``PyUnicode_EncodeUTF32()`` ``PyUnicode_AsUTF32String()`` \(3)
> ``PyUnicode_EncodeUnicodeEscape()`` ``PyUnicode_AsUnicodeEscapeString()``
> ``PyUnicode_EncodeRawUnicodeEscape()``
> ``PyUnicode_AsRawUnicodeEscapeString()``
> ``PyUnicode_EncodeCharmap()`` ``PyUnicode_AsCharmapString()`` \(1)
> ``PyUnicode_TranslateCharmap()`` ``PyUnicode_Translate()``
> ``PyUnicode_EncodeDecimal()`` \(4)
> ``PyUnicode_TransformDecimalToASCII()`` \(4)
> =========================================
> ==========================================
>
> Notes:
>
> (1)
> ``const char *errors`` parameter is missing.
>
> (2)
> There is no public alternative API. But user can use generic
> ``PyUnicode_AsEncodedString()`` instead.
>
> (3)
> ``const char *errors, int byteorder`` parameters are missing.
>
> (4)
> There is no direct replacement. But ``Py_UNICODE_TODECIMAL``
> can be used instead. CPython uses
> ``_PyUnicode_TransformDecimalAndSpaceToASCII`` for converting
> from Unicode to numbers instead.
>
>
> Plan
> ====
>
> Python 3.9
> ----------
>
> Add ``Py_DEPRECATED(3.3)`` to following APIs. This change is committed
> already [3]_. All other APIs have been marked ``Py_DEPRECATED(3.3)``
> already.
>
> * ``PyUnicode_EncodeDecimal()``
> * ``PyUnicode_TransformDecimalToASCII()``.
>
> Document all APIs as "will be removed in version 3.11".
>
>
> Python 3.11
> -----------
>
> These APIs are removed.
>
> * ``PyUnicode_Encode()``
> * ``PyUnicode_EncodeASCII()``
> * ``PyUnicode_EncodeLatin1()``
> * ``PyUnicode_EncodeUTF7()``
> * ``PyUnicode_EncodeUTF8()``
> * ``PyUnicode_EncodeUTF16()``
> * ``PyUnicode_EncodeUTF32()``
> * ``PyUnicode_EncodeUnicodeEscape()``
> * ``PyUnicode_EncodeRawUnicodeEscape()``
> * ``PyUnicode_EncodeCharmap()``
> * ``PyUnicode_TranslateCharmap()``
> * ``PyUnicode_EncodeDecimal()``
> * ``PyUnicode_TransformDecimalToASCII()``
>
>
> Alternative ideas
> =================
>
> Instead of just removing deprecated APIs, we may be able to use thier
> names with different signature.
>
>
> Make some private APIs public
> ------------------------------
>
> ``PyUnicode_EncodeUTF7()`` doesn't have public alternative APIs.
>
> Some APIs have alternative public APIs. But they are missing
> ``const char *errors`` or ``int byteorder`` parameters.
>
> We can rename some private APIs and make them public to cover missing
> APIs and parameters.
>
> ============================= ================================
> Rename to Rename from
> ============================= ================================
> ``PyUnicode_EncodeASCII()`` ``_PyUnicode_AsASCIIString()``
> ``PyUnicode_EncodeLatin1()`` ``_PyUnicode_AsLatin1String()``
> ``PyUnicode_EncodeUTF7()`` ``_PyUnicode_EncodeUTF7()``
> ``PyUnicode_EncodeUTF8()`` ``_PyUnicode_AsUTF8String()``
> ``PyUnicode_EncodeUTF16()`` ``_PyUnicode_EncodeUTF16()``
> ``PyUnicode_EncodeUTF32()`` ``_PyUnicode_EncodeUTF32()``
> ============================= ================================
>
> Pros:
>
> * We have more consistent API set.
>
> Cons:
>
> * We have more public APIs to maintain.
> * Existing public APIs are enough for most use cases, and
> ``PyUnicode_AsEncodedString()`` can be used in other cases.
>
>
> Replace ``Py_UNICODE*`` with ``Py_UCS4*``
> -----------------------------------------
>
> We can replace ``Py_UNICODE`` (typedef of ``wchar_t``) with
> ``Py_UCS4``. Since builtin codecs support UCS-4, we don't need to
> convert ``Py_UCS4*`` string to Unicode object.
>
>
> Pros:
>
> * We have more consistent API set.
> * User can encode UCS-4 string in C without creating Unicode object.
>
> Cons:
>
> * We have more public APIs to maintain.
> * Applications which uses UTF-8 or UTF-32 can not use these APIs
> anyway.
> * Other Python implementations may not have builtin codec for UCS-4.
> * If we change the Unicode internal representation to UTF-8, we need
> to keep UCS-4 support only for these APIs.
>
>
> Replace ``Py_UNICODE*`` with ``wchar_t*``
> -----------------------------------------
>
> We can replace ``Py_UNICODE`` to ``wchar_t``.
>
> Pros:
>
> * We have more consistent API set.
> * Backward compatible.
>
> Cons:
>
> * We have more public APIs to maintain.
> * They are inefficient on platforms ``wchar_t*`` is UTF-16. It is
> because built-in codecs supports only UCS-1, UCS-2, and UCS-4
> input.
>
>
> Rejected ideas
> ==============
>
> Using runtime warning
> ---------------------
>
> These APIs doesn't release GIL for now. Emitting a warning from
> such APIs is not safe. See this example.
>
> .. code-block::
>
> PyObject *u = PyList_GET_ITEM(list, i); // u is borrowed reference.
> PyObject *b = PyUnicode_EncodeUTF8(PyUnicode_AS_UNICODE(u),
> PyUnicode_GET_SIZE(u), NULL);
> // Assumes u is still living reference.
> PyObject *t = PyTuple_Pack(2, u, b);
> Py_DECREF(b);
> return t;
>
> If we emit Python warning from ``PyUnicode_EncodeUTF8()``, warning
> filters and other threads may change the ``list`` and ``u`` can be
> a dangling reference after ``PyUnicode_EncodeUTF8()`` returned.
>
> Additionally, since we are not changing behavior but removing C APIs,
> runtime ``DeprecationWarning`` might not helpful for Python
> developers. We should warn to extension developers instead.
>
>
> Discussions
> ===========
>
> * `Plan to remove Py_UNICODE APis except PEP 623
> <https://mail.python.org/archives/list/python-dev@python.org/thread/S7KW2U6IGXZFBMGS6WSJB26NZIBW4OLE/#S7KW2U6IGXZFBMGS6WSJB26NZIBW4OLE>`_
> * `bpo-41123: Remove Py_UNICODE APIs except PEP 623:
> <https://bugs.python.org/issue41123>`_
>
>
> References
> ==========
>
> .. [1] Source package list chosen from top 4000 PyPI packages.
> (https://github.com/methane/notes/blob/master/2020/wchar-cache/package_list.txt)
>
> .. [2] pyodbc -- Don't use PyUnicode_Encode API #792
> (https://github.com/mkleehammer/pyodbc/pull/792)
>
> .. [3] Uncomment Py_DEPRECATED for Py_UNICODE APIs (GH-21318)
> (https://github.com/python/cpython/commit/9c3840870814493fed62e140cfa43c2883e12181)
>
>
> Copyright
> =========
>
> This document has been placed in the public domain.
>
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/QT7QVAKF36Y2GOXNPXZ5AGKWGKZI3XT7/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: PEP 624: Remove Py_UNICODE encoder APIs [ In reply to ]
On Thu, Jul 9, 2020 at 5:46 AM M.-A. Lemburg <mal@egenix.com> wrote:
> - the fact that the encode APIs encoding from a Unicode buffer
> to a bytes object; this is an important fact, since the removal
> removes access to this codec functionality for extensions
>
> - PyUnicode_AsEncodedString() is not a proper alternative, since
> it requires to create a temporary PyUnicode object, which is
> inefficient and wastes memory

I wrote your points in the "Alternative Idea > Replace Py_UNICODE*
with Py_UCS4* "
section. I wrote "User can encode UCS-4 string in C without creating
Unicode object." in it.

https://www.python.org/dev/peps/pep-0624/#replace-py-unicode-with-py-ucs4

Note that the current Py_UNICODE* encoder APIs create temporary
PyUnicode objects.
They are inefficient and wastes memory now. Py_UNICODE* may be UTF-16 on some
platforms (e.g. Windows) and builtin codecs don't support UTF-16 input.


>
> - the maintenance effect mentioned in the PEP does not really
> materialize, since the underlying functionality still exists
> in the codecs - only access to the functionality is removed
>

In the same section, I described the maintenance cost as below.

* Other Python implementations may not have builtin codec for UCS-4.
* If we change the Unicode internal representation to UTF-8, we need
to keep UCS-4 support only for these APIs.

> - keeping just the generic PyUnicode_Encode() API would be a
> compromise
>
> - if we remove the codec specific PyUnicode_Encode*() APIs, why
> are we still keeping the specisl PyUnicde_Decode*() APIs ?
>

OK, I will add "Discussions" section. (I don't like "FAQ" because some question
are important even if it is not "frequently" asked.)

Quick answer is:

* They are stable ABI. (Py_UNICODE is excluded from stable ABI).
* Decoding from char* is more common and generic use case than encoding from
Py_UNICODE*.
* Other Python implementations using UTF-8 as internal representation
can implement
it easily.

But I'm not opposite to remove it (especially for minor UTF-7 codec).
It is just out of scope of this PEP.


> - the deprecations were just done because the Py_UNICODE data
> type was replaced by a hybrid type. Using this as an argument
> for removing functionality is not really good practice, when
> these are ways to continue exposing the functionality using other
> data types.

I hope the "Replace Py_UNICODE* with Py_UCS4* " section describe this.

Regards,

--
Inada Naoki <songofacandy@gmail.com>
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/N4F5WLSNYUWQO4FEPIOOUCHG4ZFLQVLI/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: PEP 624: Remove Py_UNICODE encoder APIs [ In reply to ]
Unless I'm missing something, part of M.-A. Lemburg's objection is:

1. The wchar_t type is itself an important interoperability story in C. (I'm not sure if this includes the ability, at compile time, to define wchar_t as either of two widths.)

2. The ability to work directly with wchar_t without a round-trip in/out of python format is an important feature that CPython has provided for C integrators.

3. The above support can be kept even without the wchar_t* member ... so saving the extra space on each string instance does not require dropping this support.

-jJ
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/L3UWQN553EAR7KQSMG4KPI4PP3M6Y4ZX/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: PEP 624: Remove Py_UNICODE encoder APIs [ In reply to ]
On Thu, Jul 9, 2020 at 10:13 PM Jim J. Jewett <jimjjewett@gmail.com> wrote:
>
> Unless I'm missing something, part of M.-A. Lemburg's objection is:
>
> 1. The wchar_t type is itself an important interoperability story in C. (I'm not sure if this includes the ability, at compile time, to define wchar_t as either of two widths.)
>

Of course. But wchar_t* is not the only way to use Unicode in C.
UTF-8 is the most common way to use Unicode in C in recent days.
(except Java, .NET, and Windows API)
So the importance of wchar_t* APIs are relative, not absolute.

In other words, why don't we have an encode API with direct UTF-8 input?
Is there any evidence wchar_t* is much more important than UTF-8?


> 2. The ability to work directly with wchar_t without a round-trip in/out of python format is an important feature that CPython has provided for C integrators.
>

Note that current API *does* the round-trip:
For example: https://github.com/python/cpython/blob/61bb24a270d15106decb1c7983bf4c2831671a75/Objects/unicodeobject.c#L5631-L5644

Users can not use the API without initializing Python VM.
Users can not avoid time and space for the round-trip.
So removing these APIs doesn't reduce any ability.


> 3. The above support can be kept even without the wchar_t* member ... so saving the extra space on each string instance does not require dropping this support.
>

This is why I split PEP 623 and PEP 624. I never said removing the
wchar_t* member is motivation for PEP 624.

Regards,
--
Inada Naoki <songofacandy@gmail.com>
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/UOE2ZYNSB7UEUTEGH27LB5IWPDYO5IDY/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: PEP 624: Remove Py_UNICODE encoder APIs [ In reply to ]
Hi, Lemburg.

Thank you for organizing the EuroPython 2020.
I enjoyed watching some sessions from home.

I think current PEP 624 covers all your points and ready for Steering
Council discussion.
Would you like to review the PEP before it?

Regards,


On Thu, Jul 9, 2020 at 8:19 AM Inada Naoki <songofacandy@gmail.com> wrote:
>
> On Thu, Jul 9, 2020 at 5:46 AM M.-A. Lemburg <mal@egenix.com> wrote:
> > - the fact that the encode APIs encoding from a Unicode buffer
> > to a bytes object; this is an important fact, since the removal
> > removes access to this codec functionality for extensions
> >
> > - PyUnicode_AsEncodedString() is not a proper alternative, since
> > it requires to create a temporary PyUnicode object, which is
> > inefficient and wastes memory
>
> I wrote your points in the "Alternative Idea > Replace Py_UNICODE*
> with Py_UCS4* "
> section. I wrote "User can encode UCS-4 string in C without creating
> Unicode object." in it.
>
> https://www.python.org/dev/peps/pep-0624/#replace-py-unicode-with-py-ucs4
>
> Note that the current Py_UNICODE* encoder APIs create temporary
> PyUnicode objects.
> They are inefficient and wastes memory now. Py_UNICODE* may be UTF-16 on some
> platforms (e.g. Windows) and builtin codecs don't support UTF-16 input.
>
>
> >
> > - the maintenance effect mentioned in the PEP does not really
> > materialize, since the underlying functionality still exists
> > in the codecs - only access to the functionality is removed
> >
>
> In the same section, I described the maintenance cost as below.
>
> * Other Python implementations may not have builtin codec for UCS-4.
> * If we change the Unicode internal representation to UTF-8, we need
> to keep UCS-4 support only for these APIs.
>
> > - keeping just the generic PyUnicode_Encode() API would be a
> > compromise
> >
> > - if we remove the codec specific PyUnicode_Encode*() APIs, why
> > are we still keeping the specisl PyUnicde_Decode*() APIs ?
> >
>
> OK, I will add "Discussions" section. (I don't like "FAQ" because some question
> are important even if it is not "frequently" asked.)
>
> Quick answer is:
>
> * They are stable ABI. (Py_UNICODE is excluded from stable ABI).
> * Decoding from char* is more common and generic use case than encoding from
> Py_UNICODE*.
> * Other Python implementations using UTF-8 as internal representation
> can implement
> it easily.
>
> But I'm not opposite to remove it (especially for minor UTF-7 codec).
> It is just out of scope of this PEP.
>
>
> > - the deprecations were just done because the Py_UNICODE data
> > type was replaced by a hybrid type. Using this as an argument
> > for removing functionality is not really good practice, when
> > these are ways to continue exposing the functionality using other
> > data types.
>
> I hope the "Replace Py_UNICODE* with Py_UCS4* " section describe this.
>
> Regards,
>
> --
> Inada Naoki <songofacandy@gmail.com>



--
Inada Naoki <songofacandy@gmail.com>
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/LXS6SXGX3HADR2GHWWC3C4Q3UGN4M2CR/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: PEP 624: Remove Py_UNICODE encoder APIs [ In reply to ]
Hi Inada-san,

thanks for attending EuroPython. I won't be back online until
next Wednesday. Would it be possible to wait until then to continue
the discussion ?

Thanks,
--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts
>>> Python Projects, Coaching and Consulting ... http://www.egenix.com/
>>> Python Database Interfaces ... http://products.egenix.com/
>>> Plone/Zope Database Interfaces ... http://zope.egenix.com/
________________________________________________________________________

::: We implement business ideas - efficiently in both time and costs :::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/
http://www.malemburg.com/


On 04.08.2020 05:13, Inada Naoki wrote:
> Hi, Lemburg.
>
> Thank you for organizing the EuroPython 2020.
> I enjoyed watching some sessions from home.
>
> I think current PEP 624 covers all your points and ready for Steering
> Council discussion.
> Would you like to review the PEP before it?
>
> Regards,
>
>
> On Thu, Jul 9, 2020 at 8:19 AM Inada Naoki <songofacandy@gmail.com> wrote:
>>
>> On Thu, Jul 9, 2020 at 5:46 AM M.-A. Lemburg <mal@egenix.com> wrote:
>>> - the fact that the encode APIs encoding from a Unicode buffer
>>> to a bytes object; this is an important fact, since the removal
>>> removes access to this codec functionality for extensions
>>>
>>> - PyUnicode_AsEncodedString() is not a proper alternative, since
>>> it requires to create a temporary PyUnicode object, which is
>>> inefficient and wastes memory
>>
>> I wrote your points in the "Alternative Idea > Replace Py_UNICODE*
>> with Py_UCS4* "
>> section. I wrote "User can encode UCS-4 string in C without creating
>> Unicode object." in it.
>>
>> https://www.python.org/dev/peps/pep-0624/#replace-py-unicode-with-py-ucs4
>>
>> Note that the current Py_UNICODE* encoder APIs create temporary
>> PyUnicode objects.
>> They are inefficient and wastes memory now. Py_UNICODE* may be UTF-16 on some
>> platforms (e.g. Windows) and builtin codecs don't support UTF-16 input.
>>
>>
>>>
>>> - the maintenance effect mentioned in the PEP does not really
>>> materialize, since the underlying functionality still exists
>>> in the codecs - only access to the functionality is removed
>>>
>>
>> In the same section, I described the maintenance cost as below.
>>
>> * Other Python implementations may not have builtin codec for UCS-4.
>> * If we change the Unicode internal representation to UTF-8, we need
>> to keep UCS-4 support only for these APIs.
>>
>>> - keeping just the generic PyUnicode_Encode() API would be a
>>> compromise
>>>
>>> - if we remove the codec specific PyUnicode_Encode*() APIs, why
>>> are we still keeping the specisl PyUnicde_Decode*() APIs ?
>>>
>>
>> OK, I will add "Discussions" section. (I don't like "FAQ" because some question
>> are important even if it is not "frequently" asked.)
>>
>> Quick answer is:
>>
>> * They are stable ABI. (Py_UNICODE is excluded from stable ABI).
>> * Decoding from char* is more common and generic use case than encoding from
>> Py_UNICODE*.
>> * Other Python implementations using UTF-8 as internal representation
>> can implement
>> it easily.
>>
>> But I'm not opposite to remove it (especially for minor UTF-7 codec).
>> It is just out of scope of this PEP.
>>
>>
>>> - the deprecations were just done because the Py_UNICODE data
>>> type was replaced by a hybrid type. Using this as an argument
>>> for removing functionality is not really good practice, when
>>> these are ways to continue exposing the functionality using other
>>> data types.
>>
>> I hope the "Replace Py_UNICODE* with Py_UCS4* " section describe this.
>>
>> Regards,
>>
>> --
>> Inada Naoki <songofacandy@gmail.com>
>
>
>
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/WZYG5X3MMJX6B7LWO6FXIJEORSYJSQYK/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: PEP 624: Remove Py_UNICODE encoder APIs [ In reply to ]
On Tue, Aug 4, 2020 at 3:31 PM M.-A. Lemburg <mal@egenix.com> wrote:
>
> Hi Inada-san,
>
> thanks for attending EuroPython. I won't be back online until
> next Wednesday. Would it be possible to wait until then to continue
> the discussion ?
>

Of course. The PEP is for Python 3.11. We have a lot of time.

Bests,
--
Inada Naoki <songofacandy@gmail.com>
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/YSRQWCOGXHFL6BOYBAFGW72YOTRII5AR/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: PEP 624: Remove Py_UNICODE encoder APIs [ In reply to ]
Hi, Lemburg.

I want to send the PEP to SC.
I think I wrote all your points in the PEP. Would you review it?

Regards,

On Tue, Aug 4, 2020 at 5:04 PM Inada Naoki <songofacandy@gmail.com> wrote:
>
> On Tue, Aug 4, 2020 at 3:31 PM M.-A. Lemburg <mal@egenix.com> wrote:
> >
> > Hi Inada-san,
> >
> > thanks for attending EuroPython. I won't be back online until
> > next Wednesday. Would it be possible to wait until then to continue
> > the discussion ?
> >
>
> Of course. The PEP is for Python 3.11. We have a lot of time.
>
> Bests,

--
Inada Naoki <songofacandy@gmail.com>
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/E2KSOHWSI5H2YAUP7LLLRUABBYAH64BW/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: PEP 624: Remove Py_UNICODE encoder APIs [ In reply to ]
Hi Inada-san,

thank you for adding some comments, but they are not really capturing
what I think is missing:

"""
Removing these APIs removes ability to use codec without temporary Unicode.

Codecs can not encode Unicode buffer directly without temporary Unicode
object since Python 3.3. All these APIs creates temporary Unicode object for
now. So removing them doesn't reduce any abilities.
"""

The point is that while the decoders allow going from a C object
to a Python object directly, we are missing a way to do the same
for the encoders, since the Python 3.3 change in the Unicode internals.

At the very least, we should have such APIs for going from wchar_t*
to a Python object.

The alternatives you provide all require creating an intermediate
Python object for this purpose. The APIs you want to remove do that
as well, but that's not the point. The point is to expose the codecs'
decode mechanism which is available in the C code, but currently
not exposed via C APIs, e.g. ucs4lib_utf8_encode().

It would be breaking change, but those APIs in your list could
simply be changed from using Py_UNICODE to using whcar_t instead
and then interface directly to the internal functions we have for
the encoders.

That would keep extensions working after a recompile, since
Py_UNICODE is already a typedef to wchar_t.

Thanks,
--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Feb 01 2021)
>>> Python Projects, Coaching and Support ... https://www.egenix.com/
>>> Python Product Development ... https://consulting.egenix.com/
________________________________________________________________________

::: We implement business ideas - efficiently in both time and costs :::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
https://www.egenix.com/company/contact/
https://www.malemburg.com/



On 22.01.2021 07:47, Inada Naoki wrote:
> Hi, Lemburg.
>
> I want to send the PEP to SC.
> I think I wrote all your points in the PEP. Would you review it?
>
> Regards,
>
> On Tue, Aug 4, 2020 at 5:04 PM Inada Naoki <songofacandy@gmail.com> wrote:
>>
>> On Tue, Aug 4, 2020 at 3:31 PM M.-A. Lemburg <mal@egenix.com> wrote:
>>>
>>> Hi Inada-san,
>>>
>>> thanks for attending EuroPython. I won't be back online until
>>> next Wednesday. Would it be possible to wait until then to continue
>>> the discussion ?
>>>
>>
>> Of course. The PEP is for Python 3.11. We have a lot of time.
>>
>> Bests,
>
> --
> Inada Naoki <songofacandy@gmail.com>
>
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/VT2J6GC7ED4PTUCU5QO6SLL4PAQ6XEKL/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: PEP 624: Remove Py_UNICODE encoder APIs [ In reply to ]
On Mon, Feb 1, 2021 at 4:47 PM M.-A. Lemburg <mal@egenix.com> wrote:
> At the very least, we should have such APIs for going from wchar_t*
> to a Python object.
>
> The alternatives you provide all require creating an intermediate
> Python object for this purpose.

We cannot optimize all use cases. IMO we should only optimize
conversions between char* and Python object.

I don't see the need for two conversions (char* => Python and then
Python => wchar_t*) as an issue if you need wchar_t*.

Objects/unicodeobject.c is already very complex with specialization
for ASCII, Py_UCS1 (latin1), Py_UCS2 and Py_UCS4 kinds: 16k lines of C
code. I would prefer to make it simpler than more complex.

Internally, functions like PyUnicode_EncodeLatin1() already do the two
conversions. So it's not like the PEP has any impact on performance.


> That would keep extensions working after a recompile, since
> Py_UNICODE is already a typedef to wchar_t.

Extensions should not use Py_UNICODE*/wchar_t*.

Can you explain where wchar_t* type is appropriate and how two
conversions is a performance bottleneck?

Victor
--
Night gathers, and now my watch begins. It shall not end until my death.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/U6V6XWMLPTSNMLDQWRWBVPNTVG6SF5F6/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: PEP 624: Remove Py_UNICODE encoder APIs [ In reply to ]
On 01.02.2021 17:10, Victor Stinner wrote:
> On Mon, Feb 1, 2021 at 4:47 PM M.-A. Lemburg <mal@egenix.com> wrote:
>> At the very least, we should have such APIs for going from wchar_t*
>> to a Python object.
>>
>> The alternatives you provide all require creating an intermediate
>> Python object for this purpose.
>
> We cannot optimize all use cases. IMO we should only optimize
> conversions between char* and Python object.
>
> I don't see the need for two conversions (char* => Python and then
> Python => wchar_t*) as an issue if you need wchar_t*.

The C code is already there, but it got hidden away in the
Python 3.3 change to new internals.

All that needs to be done is remove the intermediate Python
Unicode object creation and have those encoder APIs again
interface to the native C code.

> Objects/unicodeobject.c is already very complex with specialization
> for ASCII, Py_UCS1 (latin1), Py_UCS2 and Py_UCS4 kinds: 16k lines of C
> code. I would prefer to make it simpler than more complex.
>
> Internally, functions like PyUnicode_EncodeLatin1() already do the two
> conversions. So it's not like the PEP has any impact on performance.

Before Python 3.3 all those APIs interfaced directly to the
C codec functions. The introduction of an intermediate Python
Unicode object was just done as quick work-around, even
though it was not really needed, since Python 3.3 did not
remove the C code of the encoders.

>> That would keep extensions working after a recompile, since
>> Py_UNICODE is already a typedef to wchar_t.
>
> Extensions should not use Py_UNICODE*/wchar_t*.

They should not use Py_UNICODE.

wchar_t is standard C and is in wide spread use in C code for
storing Unicode data. This was one of the main reason for
introducing UCS4 Python versions for Linux in the mid 2000s,
since Linux apps used 4 byte wchar_t as native storage format.

My point is that extensions would just need a recompile
with the change from Py_UNICODE to wchar_t, since Py_UNICODE
and wchar_t are already the same thing in Python 3.3+.

> Can you explain where wchar_t* type is appropriate and how two
> conversions is a performance bottleneck?

If an extension has a wchar_t* string, it should be easy
to convert this in to a Python bytes object for use in Python.

Just like it should be easy to go from a char* string to
a Python str object.

The PEP breaks this symmetry by removing access to the
encoder implementations.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Feb 01 2021)
>>> Python Projects, Coaching and Support ... https://www.egenix.com/
>>> Python Product Development ... https://consulting.egenix.com/
________________________________________________________________________

::: We implement business ideas - efficiently in both time and costs :::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
https://www.egenix.com/company/contact/
https://www.malemburg.com/
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/FSUPT6B26VJT7S6UCW4RYWRQ3LYLUINU/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: PEP 624: Remove Py_UNICODE encoder APIs [ In reply to ]
On Mon, Feb 1, 2021 at 5:39 PM M.-A. Lemburg <mal@egenix.com> wrote:
> The C code is already there, but it got hidden away in the
> Python 3.3 change to new internals.

Well, we are not in agreement and it's ok. Your objection is written
in the PEP. IMO it's now up to the Steering Council to decide if the
overall PEP is ok or not. The PEP itself is now complete and lists
advantages and drawbacks.

Victor
--
Night gathers, and now my watch begins. It shall not end until my death.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/VUT2T2VJUFXE57YN4VFHSTHTDWR6MRHP/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: PEP 624: Remove Py_UNICODE encoder APIs [ In reply to ]
On Mon, 1 Feb 2021 17:39:16 +0100
"M.-A. Lemburg" <mal@egenix.com> wrote:
>
> They should not use Py_UNICODE.
>
> wchar_t is standard C and is in wide spread use in C code for
> storing Unicode data.

Do you have any data points about "wide spread use"?

I work in C++ daily and don't see any "wide spread use" of wchar_t (or
its C++ cousin std::wstring). Modern APIs assume bytestrings and UTF-8
encoding.

Regards

Antoine.

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/QGSPEEYFOYZR6PVPH5NOQWF4HMHVNTP6/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: PEP 624: Remove Py_UNICODE encoder APIs [ In reply to ]
On 01.02.2021 17:51, Victor Stinner wrote:
> On Mon, Feb 1, 2021 at 5:39 PM M.-A. Lemburg <mal@egenix.com> wrote:
>> The C code is already there, but it got hidden away in the
>> Python 3.3 change to new internals.
>
> Well, we are not in agreement and it's ok. Your objection is written
> in the PEP. IMO it's now up to the Steering Council to decide if the
> overall PEP is ok or not. The PEP itself is now complete and lists
> advantages and drawbacks.

Please read my reply to Inada-san. If the PEP were complete and ok, I
would not have written the email.

The fix is pretty simple, doesn't add a lot more code and gets
us the symmetry back that I had put into the Unicode C API when
I created this back in 2000.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Feb 01 2021)
>>> Python Projects, Coaching and Support ... https://www.egenix.com/
>>> Python Product Development ... https://consulting.egenix.com/
________________________________________________________________________

::: We implement business ideas - efficiently in both time and costs :::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
https://www.egenix.com/company/contact/
https://www.malemburg.com/
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/P5I3S4KKM3FMIMGQAGO67PPEX5VIEL6X/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: PEP 624: Remove Py_UNICODE encoder APIs [ In reply to ]
On Mon, Feb 1, 2021 at 5:58 PM M.-A. Lemburg <mal@egenix.com> wrote:
> The fix is pretty simple, doesn't add a lot more code and gets
> us the symmetry back that I had put into the Unicode C API when
> I created this back in 2000.

This sounds like a completely different PEP than PEP 624 (which aims
to remove code, not add code). I suggest you to propose your own PEP.

Victor
--
Night gathers, and now my watch begins. It shall not end until my death.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/VC6E7JMITO27PTYEUFAAD2KOH7BNAWNA/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: PEP 624: Remove Py_UNICODE encoder APIs [ In reply to ]
On 01/02/2021 17.39, M.-A. Lemburg wrote:
>> Can you explain where wchar_t* type is appropriate and how two
>> conversions is a performance bottleneck?
>
> If an extension has a wchar_t* string, it should be easy
> to convert this in to a Python bytes object for use in Python.

How much software actually uses wchar_t these days and interfaces with
Python? Do you have examples for software that uses wchar_t and would
benefit from wchar_t support in Python?

I did a quick search for wcslen in all shared libraries and binaries on
my system. It's a good indicator how many programs actually use wchar_t.
126 out of more than 9,000 shared libraries and binaries contain the
string "wcslen". The only hit for PyUnicode_AsWideCharString was
libpypy3-c.so...

(Fedora has unified /usr and /lib64, e.g. /bin -> /usr/bin)

$ ls /usr/bin/ /usr/sbin/ | grep -v python | wc -l
4264
$ grep -R wcslen /usr/bin/ /usr/sbin/ | grep -v python | wc -l
92

$ find /usr/lib64/ -name '*.so' -not -name '*python*' | wc -l
5478
$ find /usr/lib64/ -name '*.so' -not -name '*python*' | xargs grep
wcslen | wc -l
34

Christian
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/M6JC5XCXL4ENTMTFR7SUKM7PDQO5KZPT/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: PEP 624: Remove Py_UNICODE encoder APIs [ In reply to ]
On Mon, 1 Feb 2021 at 17:19, Christian Heimes <christian@python.org> wrote:
> How much software actually uses wchar_t these days and interfaces with
> Python? Do you have examples for software that uses wchar_t and would
> benefit from wchar_t support in Python?

This is very much a drive-by comment (I haven't been following this
thread) so ignore me if this is already covered, but Windows APIs use
wchar_t extensively. I routinely work with wchar_t when interfacing
Windows API code and Python. But I have no idea what this PEP is
proposing to drop, so as long as someone has ensured that the PEP
won't adversely affect working with Windows APIs, I'm happy.

Paul
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/STBXVKV7SB7M55AIL7D34IYKXGTMFWCM/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: PEP 624: Remove Py_UNICODE encoder APIs [ In reply to ]
On 2/1/2021 5:16 PM, Christian Heimes wrote:
> On 01/02/2021 17.39, M.-A. Lemburg wrote:
>>> Can you explain where wchar_t* type is appropriate and how two
>>> conversions is a performance bottleneck?
>>
>> If an extension has a wchar_t* string, it should be easy
>> to convert this in to a Python bytes object for use in Python.
>
> How much software actually uses wchar_t these days and interfaces with
> Python? Do you have examples for software that uses wchar_t and would
> benefit from wchar_t support in Python?
>
> I did a quick search for wcslen in all shared libraries and binaries on
> my system....

Yeah, you searched the wrong kind of system ;)

Pick up a Windows machine, cross-platform code that originated on
Windows, anything that interoperates with Java or .NET as well, or uses
wxWidgets.

I'm not defending the choice of wchar_t over UTF-8 (but I can: most of
these systems chose Unicode before UTF-8 was invented and never took the
backwards-incompatible change because they were so popular), but if we
want to pragmatically weigh the needs of our users above our desire for
purity, then we should try and support both equally wherever possible.

Cheers,
Steve
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/GYUWANE7IMPU45A257UYQD4ZGUDE6QUX/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: PEP 624: Remove Py_UNICODE encoder APIs [ In reply to ]
On Tue, Feb 2, 2021 at 12:43 AM M.-A. Lemburg <mal@egenix.com> wrote:
>
> Hi Inada-san,
>
> thank you for adding some comments, but they are not really capturing
> what I think is missing:
>
> """
> Removing these APIs removes ability to use codec without temporary Unicode.
>
> Codecs can not encode Unicode buffer directly without temporary Unicode
> object since Python 3.3. All these APIs creates temporary Unicode object for
> now. So removing them doesn't reduce any abilities.
> """
>
> The point is that while the decoders allow going from a C object
> to a Python object directly, we are missing a way to do the same
> for the encoders, since the Python 3.3 change in the Unicode internals.
>
> At the very least, we should have such APIs for going from wchar_t*
> to a Python object.

We already have PyUnicode_FromWideChar(). So I assume you mean
"wchar_t* to Python bytes object".

>
> The alternatives you provide all require creating an intermediate
> Python object for this purpose. The APIs you want to remove do that
> as well, but that's not the point. The point is to expose the codecs'
> decode mechanism which is available in the C code, but currently
> not exposed via C APIs, e.g. ucs4lib_utf8_encode().
>
> It would be breaking change, but those APIs in your list could
> simply be changed from using Py_UNICODE to using whcar_t instead
> and then interface directly to the internal functions we have for
> the encoders.
>

OK, I see codecs.h has three encoders.

* utf8_encode
* utf16_encode
* utf32_encode

But there are 13 encoders in my PEP:

PyUnicode_Encode()
PyUnicode_EncodeASCII()
PyUnicode_EncodeLatin1()
PyUnicode_EncodeUTF7()
PyUnicode_EncodeUTF8()
PyUnicode_EncodeUTF16()
PyUnicode_EncodeUTF32()
PyUnicode_EncodeUnicodeEscape()
PyUnicode_EncodeRawUnicodeEscape()
PyUnicode_EncodeCharmap()
PyUnicode_TranslateCharmap()
PyUnicode_EncodeDecimal()
PyUnicode_TransformDecimalToASCII()

Do you want to keep all encoders? or 3 encoders?


> That would keep extensions working after a recompile, since
> Py_UNICODE is already a typedef to wchar_t.
>

That idea is written in the PEP already.
https://www.python.org/dev/peps/pep-0624/#replace-py-unicode-with-wchar-t

Regards,
--
Inada Naoki <songofacandy@gmail.com>
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/USUH2YDEXW64NQYGJPG2OOLEJS3NJLXG/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: PEP 624: Remove Py_UNICODE encoder APIs [ In reply to ]
On Tue, Feb 2, 2021 at 4:28 AM Steve Dower <steve.dower@python.org> wrote:
>
>
> I'm not defending the choice of wchar_t over UTF-8 (but I can: most of
> these systems chose Unicode before UTF-8 was invented and never took the
> backwards-incompatible change because they were so popular), but if we
> want to pragmatically weigh the needs of our users above our desire for
> purity, then we should try and support both equally wherever possible.
>

Note that we don't have "utf8 (char*) to Python bytes object" direct
encoder API.
If PEP 624 is accepted, utf8 and wchar_t* become equal.

So please don't think PEP 624 neglect only wchar_t*.

Regards,
--
Inada Naoki <songofacandy@gmail.com>
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ZZLY6AFXYEQQ7PI6IXRNU3FWQ23MXPZU/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: PEP 624: Remove Py_UNICODE encoder APIs [ In reply to ]
On 02.02.2021 00:33, Inada Naoki wrote:
> On Tue, Feb 2, 2021 at 12:43 AM M.-A. Lemburg <mal@egenix.com> wrote:
>>
>> Hi Inada-san,
>>
>> thank you for adding some comments, but they are not really capturing
>> what I think is missing:
>>
>> """
>> Removing these APIs removes ability to use codec without temporary Unicode.
>>
>> Codecs can not encode Unicode buffer directly without temporary Unicode
>> object since Python 3.3. All these APIs creates temporary Unicode object for
>> now. So removing them doesn't reduce any abilities.
>> """
>>
>> The point is that while the decoders allow going from a C object
>> to a Python object directly, we are missing a way to do the same
>> for the encoders, since the Python 3.3 change in the Unicode internals.
>>
>> At the very least, we should have such APIs for going from wchar_t*
>> to a Python object.
>
> We already have PyUnicode_FromWideChar(). So I assume you mean
> "wchar_t* to Python bytes object".

Yes, that's what I meant. Encoding from wchar_t* to a Python bytes
object. This is what the encoder APIs all implement. They have become
less efficient with Python 3.3, but this can be resolved, while
at the same time removing Py_UNICODE and replacing it with wchar_t
in those encoder APIs.

>>
>> The alternatives you provide all require creating an intermediate
>> Python object for this purpose. The APIs you want to remove do that
>> as well, but that's not the point. The point is to expose the codecs'
>> decode mechanism which is available in the C code, but currently
>> not exposed via C APIs, e.g. ucs4lib_utf8_encode().
>>
>> It would be breaking change, but those APIs in your list could
>> simply be changed from using Py_UNICODE to using whcar_t instead
>> and then interface directly to the internal functions we have for
>> the encoders.
>>
>
> OK, I see codecs.h has three encoders.
>
> * utf8_encode
> * utf16_encode
> * utf32_encode
>
> But there are 13 encoders in my PEP:
>
> PyUnicode_Encode()
> PyUnicode_EncodeASCII()
> PyUnicode_EncodeLatin1()
> PyUnicode_EncodeUTF7()
> PyUnicode_EncodeUTF8()
> PyUnicode_EncodeUTF16()
> PyUnicode_EncodeUTF32()
> PyUnicode_EncodeUnicodeEscape()
> PyUnicode_EncodeRawUnicodeEscape()
> PyUnicode_EncodeCharmap()
> PyUnicode_TranslateCharmap()
> PyUnicode_EncodeDecimal()
> PyUnicode_TransformDecimalToASCII()
>
> Do you want to keep all encoders? or 3 encoders?

We could keep all encoders, replacing Py_UNICODE with wchar_t
in the API.

For the ones where we have separate implementations
as private functions, we can move back to direct encoding.

For the others, we can keep using the temporary Unicode object
or refactor the code to expose the native encoders working
directly on the internal buffers as private functions
and then use those in the same way for direct encoding.

The Unicode API was meant and designed as a rich API, making
it easy to use and providing a complete set for extension
writers and CPython to use. I believe we should keep it that
way.

>> That would keep extensions working after a recompile, since
>> Py_UNICODE is already a typedef to wchar_t.
>>
>
> That idea is written in the PEP already.
> https://www.python.org/dev/peps/pep-0624/#replace-py-unicode-with-wchar-t

Right and I think this is a more workable approach than removing
APIs.

BTW: I don't understand this comment:
"They are inefficient on platforms wchar_t* is UTF-16. It is because
built-in codecs supports only UCS-1, UCS-2, and UCS-4 input."

Windows is one such platform. Java (indirectly) is another. They both
store UTF-16LE in those arrays and Python's codecs handle this just
fine.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Feb 02 2021)
>>> Python Projects, Coaching and Support ... https://www.egenix.com/
>>> Python Product Development ... https://consulting.egenix.com/
________________________________________________________________________

::: We implement business ideas - efficiently in both time and costs :::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
https://www.egenix.com/company/contact/
https://www.malemburg.com/
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/PRFDSXHVNITI5PKQPI7DJJJ6DPIKRYM5/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: PEP 624: Remove Py_UNICODE encoder APIs [ In reply to ]
On Tue, Feb 2, 2021 at 7:37 PM M.-A. Lemburg <mal@egenix.com> wrote:
>
> >> That would keep extensions working after a recompile, since
> >> Py_UNICODE is already a typedef to wchar_t.
> >>
> >
> > That idea is written in the PEP already.
> > https://www.python.org/dev/peps/pep-0624/#replace-py-unicode-with-wchar-t
>
> Right and I think this is a more workable approach than removing
> APIs.
>
> BTW: I don't understand this comment:
> "They are inefficient on platforms wchar_t* is UTF-16. It is because
> built-in codecs supports only UCS-1, UCS-2, and UCS-4 input."
>
> Windows is one such platform. Java (indirectly) is another. They both
> store UTF-16LE in those arrays and Python's codecs handle this just
> fine.
>

I'm sorry about the section is not clear.

For example, if wchar_t* is UCS4, ucs4_utf8_encoder() can encode
wchar_t* into UTF-8.

But when wchar_t* is UTF-16, ucs2_utf8_encoder() can not handle
surrogate escape.
We need to use a temporary Unicode object. That is what "inefficient" means.

I will update the section more elaborate.

Regards,
--
Inada Naoki <songofacandy@gmail.com>
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/QUGBVLQNBFVNX25AEIL77WSFOHQES6LJ/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: PEP 624: Remove Py_UNICODE encoder APIs [ In reply to ]
On Tue, Feb 2, 2021 at 3:47 AM Inada Naoki <songofacandy@gmail.com> wrote:

> But when wchar_t* is UTF-16, ucs2_utf8_encoder() can not handle
> surrogate escape.
> We need to use a temporary Unicode object. That is what "inefficient"
> means.
>

Since real UCS-2 is effectively dead, maybe it should be flipped around:
Make UTF-16 be the efficient path and UCS-2 be the path that needs to
round-trip through Unicode. But I suppose that's out of scope for this PEP.

-Em

1 2  View All