Mailing List Archive: Re: Python 1.6a2 Unicode bug (was Re: comparing strings and ints)

Re: Python 1.6a2 Unicode bug (was Re: comparing strings and ints)

Apr 26, 2000, 5:04 AM

Post #1 of 8 (963 views)

Fredrik Lundh replied to himself in c.l.py:
>> as far as I can tell, it's supposed to be a feature.
>>
>> if you mix 8-bit strings with unicode strings, python 1.6a2
>> attempts to interpret the 8-bit string as an utf-8 encoded
>> unicode string.
>>
>> but yes, I also think it's a bug. but this far, my attempts
>> to get someone else to fix it has failed. might have to do
>> it myself... ;-)
>
>postscript: the powers-that-be has decided that this is not
>a bug. if you thought that strings were just sequences of
>characters, just as in Perl and Tcl, you're in for one big
>surprise in Python 1.6...

I just read the last few posts of the powers-that-be-list on this subject
(Thanks to Christian for pointing out the archives in c.l.py ;-), and I
must say I completely agree with Fredrik. The current situation sucks. A
string should always be a sequence of characters. A utf-8-encoded 8-bit
string in Python is *not* a string, but a "ByteArray". An 8-bit string
should never be assumed to be utf-8 because of that distinction. (The
default encoding for the builtin unicode() function may be another story.)

Just

Re: Python 1.6a2 Unicode bug (was Re: comparing strings and ints) [ In reply to ]

just at letterror

Apr 26, 2000, 7:13 AM

Post #2 of 8 (949 views)

Permalink

I wrote:
>A utf-8-encoded 8-bit string in Python is *not* a string, but a "ByteArray".

Another way of putting this is:
- utf-8 in an 8-bit string is to a unicode string what a pickle is to an
object.
- defaulting to utf-8 upon coercing is like implicitly trying to unpickle
an 8-bit string when comparing it to an instance. Bad idea.

Defaulting to Latin-1 is the only logical choice, no matter how
western-culture-centric this may seem.

Just

Re: Re: Python 1.6a2 Unicode bug (was Re: comparing strings and ints) [ In reply to ]

mal at lemburg

Apr 26, 2000, 11:01 AM

Post #3 of 8 (945 views)

Permalink

Just van Rossum wrote:
>
> I wrote:
> >A utf-8-encoded 8-bit string in Python is *not* a string, but a "ByteArray".
>
> Another way of putting this is:
> - utf-8 in an 8-bit string is to a unicode string what a pickle is to an
> object.
> - defaulting to utf-8 upon coercing is like implicitly trying to unpickle
> an 8-bit string when comparing it to an instance. Bad idea.
>
> Defaulting to Latin-1 is the only logical choice, no matter how
> western-culture-centric this may seem.

Please note that the support for mixing strings and Unicode
objects is really only there to aid porting applications
to Unicode.

New code should use Unicode directly and apply all needed
conversions explicitly using one of the many ways to
encode or decode Unicode data. The auto-conversions are
only there to help out and provide some convenience.

--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/

Re: Re: Python 1.6a2 Unicode bug (was Re: comparing strings and ints) [ In reply to ]

effbot at telia

Apr 26, 2000, 2:29 PM

Post #4 of 8 (946 views)

Permalink

(forwarded from c.l.py, on request)

> New code should use Unicode directly and apply all needed
> conversions explicitly using one of the many ways to
> encode or decode Unicode data. The auto-conversions are
> only there to help out and provide some convenience.

does this mean that the 8-bit string type is deprecated ???

</F>

RE: Re: Python 1.6a2 Unicode bug (was Re: comparing strings and ints) [ In reply to ]

mhammond at skippinet

Apr 26, 2000, 5:08 PM

Post #5 of 8 (944 views)

Permalink

It is necessary for us to also have this scrag-fight in public?
Most of the thread on c.l.py is filled in by people who are also
py-dev members!

[MAL writes]

> Please note that the support for mixing strings and Unicode
> objects is really only there to aid porting applications
> to Unicode.
>
> New code should use Unicode directly and apply all needed
> conversions explicitly using one of the many ways to
> encode or decode Unicode data.

This will _never_ happen. The Python programmer should never need
to be aware they have a Unicode string versus a standard string -
just a "string"! The fact there are 2 string types should be
considered an implementation detail, and not a conceptual model for
people to work within.

I think we will be mixing Unicode and strings for ever! The only
way to avoid it would be a unified type - possibly Py3k. Until
then, people will still generally use strings as literals in their
code, and should not even be aware they are mixing. Im never going
to prefix my ascii-only strings with u"" just to avoid the
possibility of mixing!

Listening to the arguments, Ive got to say Im coming down squarely
on the side of Fredrik and Just. strings must be sequences of
characters, whose length is the number of characters. A string
holding an encoding should be considered logically a byte array, and
conversions should be explicit.

> The auto-conversions are only there to help out and provide some
convenience.

Doesn't sound like it is working :-(

Mark.

RE: Re: Python 1.6a2 Unicode bug (was Re: comparing strings and ints) [ In reply to ]

tim_one at email

Apr 26, 2000, 7:27 PM

Post #6 of 8 (951 views)

Permalink

[Just van Rossum]
> ...
> Defaulting to Latin-1 is the only logical choice, no matter how
> western-culture-centric this may seem.

Indeed, if someone from an inferior culture wants to chime in, let them find
Python-Dev with their own beady little eyes <wink>.

western-culture-is-better-than-none-&-at-least-*we*-understand-it-ly
y'rs - tim

RE: Re: Python 1.6a2 Unicode bug (was Re: comparingstrings and ints) [ In reply to ]

tim_one at email

Apr 26, 2000, 10:08 PM

Post #7 of 8 (951 views)

Permalink

[Just van Rossum]
> All irony aside, I think you've nailed one of the problems spot on:
> - most core Python developers seem to be too busy to read
> *anything* at all in c.l.py
> - most people that care about the issues are not on python-dev

But they're not on c.l.py either, are they? I still read everything there,
although that's gotten so time-consuming I rarely reply anymore. In any
case, I've seen almost nothing useful about Unicode issues on c.l.py that
wasn't also on Python-Dev; perhaps I missed something.

ask-10-more-people-&-you'll-get-20-more-opinions-ly y'rs - tim

RE: Re: Python 1.6a2 Unicode bug (was Re: comparing strings and ints) [ In reply to ]

just at letterror

Apr 26, 2000, 10:42 PM

Post #8 of 8 (946 views)

Permalink

At 10:27 PM -0400 26-04-2000, Tim Peters wrote:
>Indeed, if someone from an inferior culture wants to chime in, let them find
>Python-Dev with their own beady little eyes <wink>.

All irony aside, I think you've nailed one of the problems spot on:
- most core Python developers seem to be too busy to read *anything* at all
in c.l.py
- most people that care about the issues are not on python-dev

Just