Mailing List Archive: [issue4474] PyUnicode_FromWideChar incorrect for characters outside the BMP (unix only)

[issue4474] PyUnicode_FromWideChar incorrect for characters outside the BMP (unix only)

Jan 16, 2009, 7:58 PM

Post #1 of 19 (866 views)

STINNER Victor <victor.stinner@haypocalc.com> added the comment:

Patch fixing PyUnicode_FromWideChar() for UCS-2 build: create
surrogates for character > U+FFFF like PyUnicode_FromOrdinal() does.

----------
keywords: +patch
Added file: http://bugs.python.org/file12773/unicode_fromwidechar_surrogate.patch

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue4474>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

[issue4474] PyUnicode_FromWideChar incorrect for characters outside the BMP (unix only) [ In reply to ]

Jan 16, 2009, 8:00 PM

Post #2 of 19 (859 views)

STINNER Victor <victor.stinner@haypocalc.com> added the comment:

Note: I wrote my patch against py3k r68646.

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue4474>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

[issue4474] PyUnicode_FromWideChar incorrect for characters outside the BMP (unix only) [ In reply to ]

Jan 17, 2009, 1:41 AM

Post #3 of 19 (842 views)

Mark Dickinson <dickinsm@gmail.com> added the comment:

Thanks for the patch, Victor!

Looks pretty good at first glance, except that it seems that the UTF-32 to
UTF-16 translation is skipped if HAVE_USABLE_WCHAR_T is defined. Is that
deliberate?

A test would be good, too.

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue4474>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

[issue4474] PyUnicode_FromWideChar incorrect for characters outside the BMP (unix only) [ In reply to ]

Jan 17, 2009, 5:00 AM

Post #4 of 19 (862 views)

STINNER Victor <victor.stinner@haypocalc.com> added the comment:

> Looks pretty good at first glance, except that it seems that the UTF-32 to
> UTF-16 translation is skipped if HAVE_USABLE_WCHAR_T is defined. Is that
> deliberate?

#ifdef HAVE_USABLE_WCHAR_T
memcpy(unicode->str, w, size * sizeof(wchar_t));
#else
...
#endif

I understand this code as: sizeof(wchar_t) == sizeof(Py_UNICODE). If I
misunderstood the code, it's a a heap overflow :-) So there is no not
conversion from UTF-32 to UTF-16 using memcpy if HAVE_USABLE_WCHAR_T is
defined, right?

> A test would be good, too.

PyUnicode_FromWideChar() is not a public API. Should I write a function in
_testcapi?

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue4474>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

[issue4474] PyUnicode_FromWideChar incorrect for characters outside the BMP (unix only) [ In reply to ]

Jan 17, 2009, 5:18 AM

Post #5 of 19 (847 views)

Mark Dickinson <dickinsm@gmail.com> added the comment:

> I understand this code as: sizeof(wchar_t) == sizeof(Py_UNICODE). If I
> misunderstood the code, it's a a heap overflow :-)

Yep, sorry. You're right.

>> A test would be good, too.
>
>PyUnicode_FromWideChar() is not a public API. Should I write a function
>in _testcapi?

I was actually thinking of a test for the "No such file or directory"
bug that's mentioned at the start of the discussion.

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue4474>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

[issue4474] PyUnicode_FromWideChar incorrect for characters outside the BMP (unix only) [ In reply to ]

Jan 17, 2009, 7:12 AM

Post #6 of 19 (855 views)

Marc-Andre Lemburg <mal@egenix.com> added the comment:

On 2009-01-17 14:00, STINNER Victor wrote:
>> A test would be good, too.
>
> PyUnicode_FromWideChar() is not a public API. Should I write a function in
> _testcapi?

It is a public C API. Regardless of this aspect, we should always
add tests for bugs in order to make sure they don't pop up again.

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue4474>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

[issue4474] PyUnicode_FromWideChar incorrect for characters outside the BMP (unix only) [ In reply to ]

Jan 17, 2009, 7:19 AM

Post #7 of 19 (836 views)

Marc-Andre Lemburg <mal@egenix.com> added the comment:

On 2009-01-17 14:00, STINNER Victor wrote:
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
>
>> Looks pretty good at first glance, except that it seems that the UTF-32 to
>> UTF-16 translation is skipped if HAVE_USABLE_WCHAR_T is defined. Is that
>> deliberate?
>
> #ifdef HAVE_USABLE_WCHAR_T
> memcpy(unicode->str, w, size * sizeof(wchar_t));
> #else
> ...
> #endif
>
> I understand this code as: sizeof(wchar_t) == sizeof(Py_UNICODE). If I
> misunderstood the code, it's a a heap overflow :-) So there is no not
> conversion from UTF-32 to UTF-16 using memcpy if HAVE_USABLE_WCHAR_T is
> defined, right?

If HAVE_USABLE_WCHAR_T is defined, Py_UNICODE is defined as wchar_t,
so a memcpy can be used. Note that this does not provide any information
about sizeof(wchar_t), e.g. with GLIBC, wchar_t is 4 bytes. MS C lib defines
it as 2 bytes.

That said, if Py_UNICODE is the same as wchar_t, no conversion is
necessary and that's why the function simply copies over the data.

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue4474>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

[issue4474] PyUnicode_FromWideChar incorrect for characters outside the BMP (unix only) [ In reply to ]

Jan 17, 2009, 8:53 AM

Post #8 of 19 (839 views)

STINNER Victor <victor.stinner@haypocalc.com> added the comment:

Updated patch including a test in _testcapi module: create two
PyUnicode objects from wide string (PyUnicode_FromWideChar) and
another from utf-8 (PyUnicode_FromString) and compare the value. Patch
is still for py3k branch and can be ported easily on python trunk.

Added file: http://bugs.python.org/file12776/unicode_fromwidechar_surrogate-2.patch

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue4474>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

[issue4474] PyUnicode_FromWideChar incorrect for characters outside the BMP (unix only) [ In reply to ]

Jan 17, 2009, 9:07 AM

Post #9 of 19 (850 views)

STINNER Victor <victor.stinner@haypocalc.com> added the comment:

I run my test on py3k on Linux with 32 bits wchar_t:
- 16 bits Py_UNICODE: test is failing without
PyUnicode_FromWideChar() patch
- 32 bits Py_UNICODE: test pass without the patch, so the issue only
impact 16 bits Py_UNICODE

Can someone test with 16 bits wchar_t (eg. Windows)? I think that the
bug only impact 16 bits Py_UNICODE with 32 bits wchar_t.

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue4474>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

[issue4474] PyUnicode_FromWideChar incorrect for characters outside the BMP (unix only) [ In reply to ]

Jan 17, 2009, 9:07 AM

Post #10 of 19 (843 views)

Changes by STINNER Victor <victor.stinner@haypocalc.com>:

Removed file: http://bugs.python.org/file12773/unicode_fromwidechar_surrogate.patch

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue4474>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

[issue4474] PyUnicode_FromWideChar incorrect for characters outside the BMP (unix only) [ In reply to ]

Jan 17, 2009, 9:09 AM

Post #11 of 19 (845 views)

STINNER Victor <victor.stinner@haypocalc.com> added the comment:

(with the full patch, all tests pass with 16 or 32 bits Py_UNICODE)

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue4474>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

[issue4474] PyUnicode_FromWideChar incorrect for characters outside the BMP (unix only) [ In reply to ]

Jan 18, 2009, 1:59 PM

Post #12 of 19 (820 views)

Mark Dickinson <dickinsm@gmail.com> added the comment:

Looks good to me.

I'm not in a position to test with 16-bit wchar_t, but I can't see why
anything would go wrong. I think we can take our chances: check this in
and watch the buildbots for signs of trouble.

Some minor whitespace issues in the unicodeobject.c part of the patch
(mixing of tabs and spaces, one brace indented oddly), but those can
easily be taken care of before committing; not worth regenerating the
patch for.

Marc-AndrÃ©, is it okay with you to check this in?

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue4474>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

[issue4474] PyUnicode_FromWideChar incorrect for characters outside the BMP (unix only) [ In reply to ]

Jan 19, 2009, 1:52 AM

Post #13 of 19 (820 views)

Marc-Andre Lemburg <mal@egenix.com> added the comment:

On 2009-01-18 22:59, Mark Dickinson wrote:
> Mark Dickinson <dickinsm@gmail.com> added the comment:
>
> Looks good to me.
>
> I'm not in a position to test with 16-bit wchar_t, but I can't see why
> anything would go wrong. I think we can take our chances: check this in
> and watch the buildbots for signs of trouble.
>
> Some minor whitespace issues in the unicodeobject.c part of the patch
> (mixing of tabs and spaces, one brace indented oddly), but those can
> easily be taken care of before committing; not worth regenerating the
> patch for.
>
> Marc-AndrÃ©, is it okay with you to check this in?

I'd structure the patch differently, ie. put the whole support code
into a single #ifndef Py_UNICODE_WIDE section as part of the
#ifdef HAVE_USABLE_WCHAR_T pre-processor statement.

Also note that on platforms with 16-bit wchar_t, the comparison
(0xffff < *w) will always be false, so an additional check for
(Py_UNICODE_SIZE > 2) is needed.

BTW: Please always use upper-case hex literals, or at leat don't
mix the case within the same function.

Thanks,
--
Marc-Andre Lemburg
eGenix.com

________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue4474>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

[issue4474] PyUnicode_FromWideChar incorrect for characters outside the BMP (unix only) [ In reply to ]

Jan 20, 2009, 6:05 PM

Post #14 of 19 (817 views)

STINNER Victor <victor.stinner@haypocalc.com> added the comment:

> Also note that on platforms with 16-bit wchar_t, the comparison
> (0xffff < *w) will always be false, so an additional check for
> (Py_UNICODE_SIZE > 2) is needed.

Yes, but the right test is (SIZEOF_WCHAR_T > 2). I wrote a new test:

#if (Py_UNICODE_SIZE == 2) && (SIZEOF_WCHAR_T > 2)
#define USE_WCHAR_SURROGATE
const wchar_t *orig_w;
#endif

> BTW: Please always use upper-case hex literals, or at leat don't
> mix the case within the same function.

I try, but it would be easier if the rule was already respected: they
are many tabs and many lower-case hex literals. I used copy/paste from
existing code of unicodeobject.c...

Patch version 3:
- disable the UTF-16 surrogate for 16 bits wchar_t: so my patch is
only used for 16 bits Py_UNICODE and 32 bits wchar_t... which is the
default case for python 2.6 and 3.0 on Linux
- replace tabulation by spaces (in existing code)
- use upper case literals

Added file: http://bugs.python.org/file12822/unicode_fromwidechar_surrogate-3.patch

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue4474>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

[issue4474] PyUnicode_FromWideChar incorrect for characters outside the BMP (unix only) [ In reply to ]

Jan 20, 2009, 6:05 PM

Post #15 of 19 (818 views)

Changes by STINNER Victor <victor.stinner@haypocalc.com>:

Removed file: http://bugs.python.org/file12776/unicode_fromwidechar_surrogate-2.patch

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue4474>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

[issue4474] PyUnicode_FromWideChar incorrect for characters outside the BMP (unix only) [ In reply to ]

Jan 26, 2009, 8:56 AM

Post #16 of 19 (806 views)

STINNER Victor <victor.stinner@haypocalc.com> added the comment:

@marketdickinson, @lemburg: ping! I updated the patch, does it look
better?

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue4474>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

[issue4474] PyUnicode_FromWideChar incorrect for characters outside the BMP (unix only) [ In reply to ]

Jan 27, 2009, 12:49 AM

Post #17 of 19 (813 views)

Marc-Andre Lemburg <mal@egenix.com> added the comment:

On 2009-01-26 17:56, STINNER Victor wrote:
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
>
> @marketdickinson, @lemburg: ping! I updated the patch, does it look
> better?

Yes, but there are a few things that still need fixing:

* SIZEOF_WCHAR_T is not defined for Windows builds, so needs
to be added to PC/pyconfig.h (OTOH wchar_t is 2 bytes on Windows)
* USE_WCHAR_SURROGATE should be #defined just before the
function and #undef'ed right after it; I'd also use a more
accurate name
* please use pre-processor indents, e.g.
#ifdef ...
# define ...
#endif

I'd write

#if (Py_UNICODE_SIZE == 2) && defined((SIZEOF_WCHAR_T) && (SIZEOF_WCHAR_T > 2)
# define CONVERT_WCHAR_TO_SURROGATES
#endif

...

#undef CONVERT_WCHAR_TO_SURROGATES

Thanks.

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue4474>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

[issue4474] PyUnicode_FromWideChar incorrect for characters outside the BMP (unix only) [ In reply to ]

Jan 29, 2009, 3:36 PM

Post #18 of 19 (793 views)

STINNER Victor <victor.stinner@haypocalc.com> added the comment:

For lemburg, updated patch:
- Move USE_WCHAR_SURROGATE define outside PyUnicode_FromWideChar()
(and indent the defines, sorry)
- Add "#define SIZEOF_WCHAR_T 2" to PC/pyconfig.h

Added file: http://bugs.python.org/file12890/unicode_fromwidechar_surrogate-4.patch

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue4474>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

[issue4474] PyUnicode_FromWideChar incorrect for characters outside the BMP (unix only) [ In reply to ]

Jan 29, 2009, 3:36 PM

Post #19 of 19 (797 views)

Changes by STINNER Victor <victor.stinner@haypocalc.com>:

Removed file: http://bugs.python.org/file12822/unicode_fromwidechar_surrogate-3.patch

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue4474>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com