New submission from Ma Lin <malincns@163.com>:
CJK encode/decode functions only have three error-handler fast-paths:
replace
ignore
strict
See the code: [1][2]
If use other built-in error-handlers, need to get the error-handler object, and call it with an Unicode Exception argument. See the code: [3]
But the error-handler object is not cached, it needs to be looked up from a dict every time, which is very inefficient.
Another possible optimization is to write fast-path for common error-handlers, Python has these built-in error-handlers:
strict
replace
ignore
backslashreplace
xmlcharrefreplace
namereplace
surrogateescape
surrogatepass (only for utf-8/utf-16/utf-32 family)
For example, maybe `xmlcharrefreplace` is heavily used in Web application, it can be implemented as a fast-path, so that no need to call the error-handler object every time.
Just like the `xmlcharrefreplace` fast-path in `PyUnicode_EncodeCharmap` [4].
[1] encode function:
https://github.com/python/cpython/blob/v3.9.0b4/Modules/cjkcodecs/multibytecodec.c#L192
[2] decode function:
https://github.com/python/cpython/blob/v3.9.0b4/Modules/cjkcodecs/multibytecodec.c#L347
[3] `call_error_callback` function:
https://github.com/python/cpython/blob/v3.9.0b4/Modules/cjkcodecs/multibytecodec.c#L82
[4] `xmlcharrefreplace` fast-path in `PyUnicode_EncodeCharmap`:
https://github.com/python/cpython/blob/v3.9.0b4/Objects/unicodeobject.c#L8662
----------
components: Unicode
messages: 373871
nosy: ezio.melotti, malin, vstinner
priority: normal
severity: normal
status: open
title: Inefficient error-handle for CJK encodings
type: performance
versions: Python 3.10
_______________________________________
Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue41330>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com
CJK encode/decode functions only have three error-handler fast-paths:
replace
ignore
strict
See the code: [1][2]
If use other built-in error-handlers, need to get the error-handler object, and call it with an Unicode Exception argument. See the code: [3]
But the error-handler object is not cached, it needs to be looked up from a dict every time, which is very inefficient.
Another possible optimization is to write fast-path for common error-handlers, Python has these built-in error-handlers:
strict
replace
ignore
backslashreplace
xmlcharrefreplace
namereplace
surrogateescape
surrogatepass (only for utf-8/utf-16/utf-32 family)
For example, maybe `xmlcharrefreplace` is heavily used in Web application, it can be implemented as a fast-path, so that no need to call the error-handler object every time.
Just like the `xmlcharrefreplace` fast-path in `PyUnicode_EncodeCharmap` [4].
[1] encode function:
https://github.com/python/cpython/blob/v3.9.0b4/Modules/cjkcodecs/multibytecodec.c#L192
[2] decode function:
https://github.com/python/cpython/blob/v3.9.0b4/Modules/cjkcodecs/multibytecodec.c#L347
[3] `call_error_callback` function:
https://github.com/python/cpython/blob/v3.9.0b4/Modules/cjkcodecs/multibytecodec.c#L82
[4] `xmlcharrefreplace` fast-path in `PyUnicode_EncodeCharmap`:
https://github.com/python/cpython/blob/v3.9.0b4/Objects/unicodeobject.c#L8662
----------
components: Unicode
messages: 373871
nosy: ezio.melotti, malin, vstinner
priority: normal
severity: normal
status: open
title: Inefficient error-handle for CJK encodings
type: performance
versions: Python 3.10
_______________________________________
Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue41330>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com