Mailing List Archive: [issue5127] UnicodeEncodeError - I can't even see license

[issue5127] UnicodeEncodeError - I can't even see license

Feb 1, 2009, 5:31 PM

Post #1 of 14 (586 views)

New submission from Venusaur <bupjae@hotmail.com>:

>>> license
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python30\lib\site.py", line 372, in __repr__
self.__setup()
File "C:\Python30\lib\site.py", line 359, in __setup
data = fp.read()
File "C:\Python30\lib\io.py", line 1724, in read
decoder.decode(self.buffer.read(), final=True))
File "C:\Python30\lib\io.py", line 1295, in decode
output = self.decoder.decode(input, final=final)
UnicodeDecodeError: 'cp949' codec can't decode bytes in position 15164-
15165: il
legal multibyte sequence
>>> chr(0x10000)
'\U00010000'
>>> chr(0x11000)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python30\lib\io.py", line 1491, in write
b = encoder.encode(s)
UnicodeEncodeError: 'cp949' codec can't encode character '\ud804' in
position 1:
illegal multibyte sequence
>>>

I also can't understand why chr(0x10000) and chr(0x11000) has different
behavior

----------
components: Unicode
messages: 80924
nosy: bupjae
severity: normal
status: open
title: UnicodeEncodeError - I can't even see license
versions: Python 3.0

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue5127>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

[issue5127] UnicodeEncodeError - I can't even see license [ In reply to ]

Feb 1, 2009, 5:56 PM

Post #2 of 14 (564 views)

Ezio Melotti <ezio.melotti@gmail.com> added the comment:

Here (winxpsp2, Py3, cp850-terminal) the license works fine:
>>> license
Type license() to see the full license text

and license() works as well.

I get this output for the chr()s:
>>> chr(0x10000)
'\U00010000'
>>> chr(0x11000)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Programs\Python30\lib\io.py", line 1491, in write
b = encoder.encode(s)
File "C:\Programs\Python30\lib\encodings\cp850.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position
1-2: character maps to <undefined>

I believe that chr(0x10000) and chr(0x11000) should have the opposite
behavior.
U+10000 (LINEAR B SYLLABLE B008 A) belongs to the 'Lo' category and
should be printed (and possibly raise a UnicodeError, see issue5110
[1]), U+11000 belongs to the 'Cn' category and should be escaped[2].

On Linux with Py3 and a UTF-8 terminal, chr(0x10000) prints '\U00010000'
and chr(0x11000) prints the char (actually I see two boxes, but it
shouldn't be a problem of Python). The license() works fine too.

Also note that with cp850 the error message is 'character maps to
<undefined>' and with cp949 is 'illegal multibyte sequence'.

[1]: http://bugs.python.org/issue5110
[2]: http://www.python.org/dev/peps/pep-3138/#specification

----------
nosy: +ezio.melotti

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue5127>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

[issue5127] UnicodeEncodeError - I can't even see license [ In reply to ]

Feb 2, 2009, 10:05 AM

Post #3 of 14 (554 views)

Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:

There were non-ascii characters in the Windows license file. This was
corrected with r67860.

> I believe that chr(0x10000) and chr(0x11000) should have the
> opposite behavior.

This other problem is because on a narrow unicode build,
Py_UNICODE_ISPRINTABLE takes a 16bit integer.
And indeed,

>>> unicodedata.category(chr(0x10000 % 65536))
'Cc'
>>> unicodedata.category(chr(0x11000 % 65536))
'Lo'

----------
nosy: +amaury.forgeotdarc

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue5127>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

[issue5127] UnicodeEncodeError - I can't even see license [ In reply to ]

Feb 3, 2009, 3:34 AM

Post #4 of 14 (552 views)

STINNER Victor <victor.stinner@haypocalc.com> added the comment:

I don't understand the behaviour of unichr():

Python 2.7a0 (trunk:68963M, Jan 30 2009, 00:49:28)
>>> import unicodedata
>>> unicodedata.category(u"\U00010000")
'Lo'
>>> unicodedata.category(u"\U00011000")
'Cn'
>>> unicodedata.category(unichr(0x10000))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

Why unichr() fails whereas \Uxxxxxxxx works?

>>> len(u"\U00010000")
2
>>> ord(u"\U00010000")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: ord() expected a character, but string of length 2 found

----------
nosy: +haypo

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue5127>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

[issue5127] UnicodeEncodeError - I can't even see license [ In reply to ]

Feb 3, 2009, 4:26 AM

Post #5 of 14 (555 views)

Ezio Melotti <ezio.melotti@gmail.com> added the comment:

FWIW, on Python3 it seems to work:
>>> import unicodedata
>>> unicodedata.category("\U00010000")
'Lo'
>>> unicodedata.category("\U00011000")
'Cn'
>>> unicodedata.category(chr(0x10000))
'Lo'
>>> unicodedata.category(chr(0x11000))
'Cn'
>>> ord(chr(0x10000)), 0x10000
(65536, 65536)
>>> ord(chr(0x11000)), 0x11000
(69632, 69632)

I'm using a narrow build too:
>>> import sys
>>> sys.maxunicode
65535
>>> len('\U00010000')
2
>>> ord('\U00010000')
65536

On Python2 unichr() is supposed to raise a ValueError on a narrow build
if the value is greater than 0xFFFF [1], but if the characters above
0xFFFF can be represented with u"\Uxxxxxxxx" there should be a way to
fix unichr so it can return them. Python3 already does it with chr().

Maybe we should open a new issue for this if it's not present already.

[1]: http://docs.python.org/library/functions.html#unichr

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue5127>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

[issue5127] UnicodeEncodeError - I can't even see license [ In reply to ]

Feb 3, 2009, 4:39 AM

Post #6 of 14 (551 views)

Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:

Since r56395, ord() and chr() accept and return surrogate pairs even in
narrow builds.

The goal is to remove most differences between narrow and wide unicode
builds (except for string lengths, indices or slices)

To address this problem, I suggest to change all functions in
unicodectype.c so that they accept Py_UCS4 characters (instead of
Py_UNICODE).
This would be a binary-incompatible change; and --with-wctype-functions
would have an effect only if sizeof(wchar_t)==4 (instead of the current
condition sizeof(wchar_t)==sizeof(PY_UNICODE_TYPE))

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue5127>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

[issue5127] UnicodeEncodeError - I can't even see license [ In reply to ]

Feb 3, 2009, 5:12 AM

Post #7 of 14 (547 views)

Marc-Andre Lemburg <mal@egenix.com> added the comment:

On 2009-02-03 13:39, Amaury Forgeot d'Arc wrote:
> Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:
>
> Since r56395, ord() and chr() accept and return surrogate pairs even in
> narrow builds.
>
> The goal is to remove most differences between narrow and wide unicode
> builds (except for string lengths, indices or slices)
>
> To address this problem, I suggest to change all functions in
> unicodectype.c so that they accept Py_UCS4 characters (instead of
> Py_UNICODE).

-1.

That would cause major breakage in the C API and is not inline with the
intention of having a Py_UNICODE type in the first place.

Users who are interested in UCS4 builds should simply use UCS4 builds.

> This would be a binary-incompatible change; and --with-wctype-functions
> would have an effect only if sizeof(wchar_t)==4 (instead of the current
> condition sizeof(wchar_t)==sizeof(PY_UNICODE_TYPE))

--with-wctype-functions was scheduled for removal many releases ago,
but I never got around to it. The only reason it's still there is
that some Linux distribution use this config option (AFAIR, RedHat).
I'd be +1 on removing the option in 3.0.1 or deprecating it in
3.0.1 and removing it in 3.1.

It's not useful in any way, and causes compatibility problems
with regular builds.

----------
nosy: +lemburg

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue5127>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

[issue5127] UnicodeEncodeError - I can't even see license [ In reply to ]

Feb 3, 2009, 5:14 AM

Post #8 of 14 (553 views)

STINNER Victor <victor.stinner@haypocalc.com> added the comment:

Note: My examples are made with Python 2.x.

> The goal is to remove most differences between narrow and wide unicode
> builds (except for string lengths, indices or slices)

It would be nice to get the same behaviour in Python 2.x and 3.x to help
migration from Python2 to Python3 ;-)

unichr() (in Python 2.x) documentation is correct. But I would approciate to
support surrogates using unichr() which means also changing ord() behaviour.

> To address this problem, I suggest to change all functions in
> unicodectype.c so that they accept Py_UCS4 characters (instead of
> Py_UNICODE).

Why? Using surrogates, you can use 16-bits Py_UNICODE to store non-BMP
characters (code > 0xffff).

--

I can open a new issue if you agree that we can change unichr() / ord()
behaviour on narrow build. We may ask on the mailing list?

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue5127>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

[issue5127] UnicodeEncodeError - I can't even see license [ In reply to ]

Feb 3, 2009, 5:18 AM

Post #9 of 14 (550 views)

Marc-Andre Lemburg <mal@egenix.com> added the comment:

On 2009-02-03 14:14, STINNER Victor wrote:
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
>
> amaury> Since r56395, ord() and chr() accept and return surrogate pairs
> amaury> even in narrow builds.
>
> Note: My examples are made with Python 2.x.
>
>> The goal is to remove most differences between narrow and wide unicode
>> builds (except for string lengths, indices or slices)
>
> It would be nice to get the same behaviour in Python 2.x and 3.x to help
> migration from Python2 to Python3 ;-)
>
> unichr() (in Python 2.x) documentation is correct. But I would approciate to
> support surrogates using unichr() which means also changing ord() behaviour.

This is not possible for unichr() in Python 2.x, since applications
always expect len(unichr(x)) == 1.

Changing ord() would be possible in Python 2.x is easier, since
this would only extend the range of returned values for UCS2
builds.

>> To address this problem, I suggest to change all functions in
>> unicodectype.c so that they accept Py_UCS4 characters (instead of
>> Py_UNICODE).
>
> Why? Using surrogates, you can use 16-bits Py_UNICODE to store non-BMP
> characters (code > 0xffff).
>
> --
>
> I can open a new issue if you agree that we can change unichr() / ord()
> behaviour on narrow build. We may ask on the mailing list?

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue5127>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

[issue5127] UnicodeEncodeError - I can't even see license [ In reply to ]

Feb 3, 2009, 5:50 AM

Post #10 of 14 (559 views)

Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:

> That would cause major breakage in the C API

Not if you recompile. I don't see how this breaks the API at the C level.

> and is not inline with the intention of having a Py_UNICODE
> type in the first place.

Py_UNICODE is still used as the allocation unit for unicode strings.

To get correct results, we need a way to access the whole unicode
database even on ucs2 builds; it's possible with the unicodedata module,
why not from C?

My motivation for the change is this post:
http://mail.python.org/pipermail/python-dev/2008-July/080900.html

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue5127>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

[issue5127] UnicodeEncodeError - I can't even see license [ In reply to ]

Feb 3, 2009, 6:18 AM

Post #11 of 14 (553 views)

STINNER Victor <victor.stinner@haypocalc.com> added the comment:

lemburg> This is not possible for unichr() in Python 2.x, since applications
lemburg> always expect len(unichr(x)) == 1

Oh, ok.

lemburg> Changing ord() would be possible in Python 2.x is easier, since
lemburg> this would only extend the range of returned values for UCS2
lemburg> builds.

ord() of Python3 (narrow build) rejects surrogate characters:

'\U00010000'
>>> len(chr(0x10000))
2
>>> ord(0x10000)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: ord() expected string of length 1, but int found

---

It looks that narrow builds with surrogates have some more problems...

Test with U+10000: "LINEAR B SYLLABLE B008 A", category: Letter, Other.

Correct result (Python 2.5, wide build):

$ python
Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:40)
>>> unichr(0x10000)
u'\U00010000'
>>> unichr(0x10000).isalpha()
True

Error in Python3 (narrow build):

marge$ ./python
Python 3.1a0 (py3k:69105M, Feb 3 2009, 15:04:35)
>>> chr(0x10000).isalpha()
False
>>> list(chr(0x10000))
['\ud800', '\udc00']
>>> chr(0xd800).isalpha()
False
>>> chr(0xdc00).isalpha()
False

Unicode ranges, all in the category "Other, Surrogate":
- U+D800..U+DB7F: Non Private Use High Surrogate
- U+DB80..U+DBFF: Private Use High Surrogate
- U+DC00..U+DFFF: Low Surrogate" range

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue5127>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

[issue5127] UnicodeEncodeError - I can't even see license [ In reply to ]

Feb 3, 2009, 10:47 AM

Post #12 of 14 (548 views)

Marc-Andre Lemburg <mal@egenix.com> added the comment:

On 2009-02-03 14:50, Amaury Forgeot d'Arc wrote:
> Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:
>
>> That would cause major breakage in the C API
>
> Not if you recompile. I don't see how this breaks the API at the C level.

Well, then try to look at such a change from a C extension
writer's perspective.

They'd have to change all their function calls and routines to work
with Py_UCS4.

Supporting both the old API and the new one would
be nearly impossible and require either an adapter API or a lot
of #ifdef'ery.

Please remember that the public Python C API is not only meant for
Python developers. It's main purpose is for it to be used by
other developers extending or embedding Python and those developers
use different release cycles and want to support more than just the
bleeding edge Python version.

Python has a long history of providing very stable APIs, both in
C and in Python.

FWIW: The last major change in the C API (the change to Py_ssize_t
from Python 2.4 to 2.5) has not even propogated to all major C
extensions yet. It's only now that people start to realize problems
with this, since their extensions start failing with segfaults
on 64-bit machines.

That said, we can of course provide additional UCS4 APIs for
certain things and also provide conversion helpers between
Py_UNICODE and Py_UCS4 where needed.

>> and is not inline with the intention of having a Py_UNICODE
>> type in the first place.
>
> Py_UNICODE is still used as the allocation unit for unicode strings.
>
> To get correct results, we need a way to access the whole unicode
> database even on ucs2 builds; it's possible with the unicodedata module,
> why not from C?

I must be missing some detail, but what does the Unicode database
have to do with the unicodeobject.c C API ?

> My motivation for the change is this post:
> http://mail.python.org/pipermail/python-dev/2008-July/080900.html

There are certainly other ways to make Python deal with surrogates
in more cases than the ones we already support.

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue5127>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

[issue5127] UnicodeEncodeError - I can't even see license [ In reply to ]

Feb 3, 2009, 11:19 AM

Post #13 of 14 (549 views)

Ezio Melotti <ezio.melotti@gmail.com> added the comment:

haypo> ord() of Python3 (narrow build) rejects surrogate characters:
haypo> '\U00010000'
haypo> >>> len(chr(0x10000))
haypo> 2
haypo> >>> ord(0x10000)
haypo> TypeError: ord() expected string of length 1, but int found

ord() works fine on Py3, you probably meant to do
>>> ord('\U00010000')
65536
or
>>> ord(chr(0x10000))
65536

In Py3 is also stated that it accepts surrogate pairs (help(ord)).
Py2 instead doesn't support them:
>>> ord(u'\U00010000')
TypeError: ord() expected a character, but string of length 2 found

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue5127>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

[issue5127] UnicodeEncodeError - I can't even see license [ In reply to ]

Feb 3, 2009, 3:25 PM

Post #14 of 14 (550 views)

Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:

> I must be missing some detail, but what does the Unicode database
> have to do with the unicodeobject.c C API ?

Ah, now I understand your concerns. My suggestion is to change only the 20 functions in
unicodectype.c: _PyUnicode_IsAlpha, _PyUnicode_ToLowercase... and no change in
unicodeobject.c at all.
They all take a single code point as argument, some also return a single code point.
Changing these functions is backwards compatible.

I join a patch so we can argue on concrete code (tests are missing).

Another effect of the patch: unicodedata.numeric('\N{AEGEAN NUMBER TWO}') can return 2.0.

The str.isalpha() (and others) methods did not change: they still split the surrogate pairs.

----------
keywords: +patch
Added file: http://bugs.python.org/file12934/unicodectype_ucs4.patch

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue5127>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com