Mailing List Archive

Encoding of 8-bit strings and Python source code
After the discussion about #pragmas two weeks ago and some
interesting ideas in the direction of source code encodings
and ways to implement them, I would like to restart the
talk about encodings in source code and runtime
auto-conversions.

Fredrik recently posted patches to the patches list which
loosen the currently hard-coded default encoding used throughout
the Unicode design and add a layer of abstraction which would
make it easily possible to change the default encoding at some
later point. While making things more abstract is certainly
a wise thing to do, I am not sure whether this particular
case fits into the design decisions made a few months ago.

Here's a short summary of what was discussed recently:

1. Fredrik posted the idea of changing the default encoding
from UTF-8 to Latin-1 (he calls this 8-bit Unicode which
points to the motivation behind this: 8-bit strings should
behave like 8-bit Unicode). His recent patches work into
this direction.

2. Fredrik also posted an interesting idea which enables
writing Python source code in any supported encoding by
having the Python tokenizer read Py_UNICODE data instead
of char data. A preprocessor would take care of converting
the input to Py_UNICODE; the parser would assure that
8-bit string data gets converted back to char data (using
e.g. UTF-8 or Latin-1 for the encoding)

3. Regarding the addition of pragmas to allow specifying
the used source code encoding several possibilities were
mentioned:
- addition of a keyword "pragma" to define pragma dictionaries
- usage of a "global" as basis for this
- adding a new keyword "decl" which also allows defining other
things such as type information
- XML like syntax embedded into Python comments

Some comments:

Ad 1. UTF-8 is used as basis in many other languages such
as TCL or Perl. It is not an intuitive way of
writing strings and causes problems due to one character
spanning 1-6 bytes. Still, the world seems to be moving
into this direction, so going the same way can't be all
wrong... Note that stream IO can be recoded in a way
which allows Python to print and read e.g. Latin-1
(see below). The general idea behind the fixed default
encoding design was to give all the power to the user,
since she eventually knows best which encoding to
use or expect.

Ad 2. I like this idea because it enables writing Unicode-
aware programs *in* Unicode... the only problem which remains
is again the encoding to use for the classic 8-bit strings.

Ad 3. For 2. to work, the encoding would have to appear
close to the top of the file. The preprocessor would have
to be BOM-mark aware to tell whether UTF-16 or some ASCII
extension is used by the file.

Guido asked me for some code which demonstrates Latin-1
recoding using the existing mechanisms. I've attached
a simple script to this mail. It is not much tested yet,
so please give it a try.

You can also change it to use any other encoding you like.
Together with the Japanese codecs provided by Tamito Kajiyama
(http://pseudo.grad.sccs.chukyo-u.ac.jp/~kajiyama/tmp/japanese-codecs.tar.gz)
you should be able to type Shift-JIS at the raw_input()
or interactive prompt, have it stored as UTF-8 and then
printed back as Shift-JIS, provided you put add a recoder
similar to the attached one for Latin-1 to your
PYTHONSTARTUP or site.py script.

--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
Re: Encoding of 8-bit strings and Python source code [ In reply to ]
I'll follow up with a longer reply later; just one correction:

M.-A. Lemburg <mal@lemburg.com> wrote:
> Ad 1. UTF-8 is used as basis in many other languages such
> as TCL or Perl. It is not an intuitive way of
> writing strings and causes problems due to one character
> spanning 1-6 bytes. Still, the world seems to be moving
> into this direction, so going the same way can't be all
> wrong...

the problem here is the current Python implementation
doesn't use UTF-8 in the same way as Perl and Tcl. Perl
and Tcl only exposes one string type, and that type be-
haves exactly like it should:

"The Tcl string functions properly handle multi-
byte UTF-8 characters as single characters."

"By default, Perl now thinks in terms of Unicode
characters instead of simple bytes. /.../ All the
relevant built-in functions (length, reverse, and
so on) now work on a character-by-character
basis instead of byte-by-byte, and strings are
represented internally in Unicode."

or in other words, both languages guarantee that given a
string s:

- s is a sequence of characters (not bytes)
- len(s) is the number of characters in the string
- s[i] is the i'th character
- len(s[i]) is 1

and as I've pointed out a zillion times, Python 1.6a2 doesn't. this
should be solved, and I see (at least) four ways to do that:

-- the Tcl 8.1 way: make 8-bit strings UTF-8 aware. operations
like len and getitem usually searches from the start of the string.

to handle binary data, introduce a special ByteArray type. when
mixing ByteArrays and strings, treat each byte in the array as an
8-bit unicode character (conversions from strings to byte arrays
are lossy).

[.imho: lots of code, and seriously affects performance, even when
unicode characters are never used. this approach was abandoned
in Tcl 8.2]

-- the Tcl 8.2 way: use a unified string type, which stores data as
UTF-8 and/or 16-bit unicode:

struct {
char* bytes; /* 8-bit representation (utf-8) */
Tcl_UniChar* unicode; /* 16-bit representation */
}

if one of the strings are modified, the other is regenerated on
demand. operations like len, slice and getitem always convert
to 16-bit first.

still need a ByteArray type, similar to the one described above.

[.imho: faster than before, but still not as good as a pure 8-bit string
type. and the need for a separate byte array type would break alot
of existing Python code]

-- the Perl 5.6 way? (haven't looked at the implementation, but I'm
pretty sure someone told me it was done this way). essentially
same as Tcl 8.2, but with an extra encoding field (to avoid con-
versions if data is just passed through).

struct {
int encoding;
char* bytes; /* 8-bit representation */
Tcl_UniChar* unicode; /* 16-bit representation */
}

[imho: see Tcl 8.2]

-- my proposal: expose both types, but let them contain characters
from the same character set -- at least when used as strings.

as before, 8-bit strings can be used to store binary data, so we
don't need a separate ByteArray type. in an 8-bit string, there's
always one character per byte.

[.imho: small changes to the existing code base, about as efficient as
can be, no attempt to second-guess the user, fully backwards com-
patible, fully compliant with the definition of strings in the language
reference, patches are available, etc...]

</F>
Re: Encoding of 8-bit strings and Python source code [ In reply to ]
Fredrik Lundh wrote:
>
> I'll follow up with a longer reply later; just one correction:
>
> M.-A. Lemburg <mal@lemburg.com> wrote:
> > Ad 1. UTF-8 is used as basis in many other languages such
> > as TCL or Perl. It is not an intuitive way of
> > writing strings and causes problems due to one character
> > spanning 1-6 bytes. Still, the world seems to be moving
> > into this direction, so going the same way can't be all
> > wrong...
>
> the problem here is the current Python implementation
> doesn't use UTF-8 in the same way as Perl and Tcl. Perl
> and Tcl only exposes one string type, and that type be-
> haves exactly like it should:
>
> "The Tcl string functions properly handle multi-
> byte UTF-8 characters as single characters."
>
> "By default, Perl now thinks in terms of Unicode
> characters instead of simple bytes. /.../ All the
> relevant built-in functions (length, reverse, and
> so on) now work on a character-by-character
> basis instead of byte-by-byte, and strings are
> represented internally in Unicode."
>
> or in other words, both languages guarantee that given a
> string s:
>
> - s is a sequence of characters (not bytes)
> - len(s) is the number of characters in the string
> - s[i] is the i'th character
> - len(s[i]) is 1
>
> and as I've pointed out a zillion times, Python 1.6a2 doesn't.

Just a side note: we never discussed turning the native
8-bit strings into any encoding aware type.

> this
> should be solved, and I see (at least) four ways to do that:
>
> ...
> -- the Perl 5.6 way? (haven't looked at the implementation, but I'm
> pretty sure someone told me it was done this way). essentially
> same as Tcl 8.2, but with an extra encoding field (to avoid con-
> versions if data is just passed through).
>
> struct {
> int encoding;
> char* bytes; /* 8-bit representation */
> Tcl_UniChar* unicode; /* 16-bit representation */
> }
>
> [imho: see Tcl 8.2]
>
> -- my proposal: expose both types, but let them contain characters
> from the same character set -- at least when used as strings.
>
> as before, 8-bit strings can be used to store binary data, so we
> don't need a separate ByteArray type. in an 8-bit string, there's
> always one character per byte.
>
> [.imho: small changes to the existing code base, about as efficient as
> can be, no attempt to second-guess the user, fully backwards com-
> patible, fully compliant with the definition of strings in the language
> reference, patches are available, etc...]

Why not name the beast ?! In your proposal, the old 8-bit
strings simply use Latin-1 as native encoding.

The current version doesn't make any encoding assumption as
long as the 8-bit strings do not get auto-converted. In that case
they are interpreted as UTF-8 -- which will (usually) fail
for Latin-1 encoded strings using the 8th bit, but hey, at least
you get an error message telling you what is going wrong.

The key to these problems is using explicit conversions where
8-bit strings meet Unicode objects.

Some more ideas along the convenience path:

Perhaps changing just the way 8-bit strings are coerced
to Unicode would help: strings would then be interpreted
as Latin-1. str(Unicode) and "t" would still return
UTF-8 to assure loss-less conversion.

Another way to tackle this would be to first try UTF-8
conversion during auto-conversion and then fallback to
Latin-1 in case it fails. Has anyone tried this ? Guido
mentioned that TCL does something along these lines...

--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
Re: Encoding of 8-bit strings and Python source code [ In reply to ]
M.-A. Lemburg wrote:
> > and as I've pointed out a zillion times, Python 1.6a2 doesn't.
>
> Just a side note: we never discussed turning the native
> 8-bit strings into any encoding aware type.

hey, you just argued that we should use UTF-8 because Tcl and
Perl use it, didn't you?

my point is that they don't use it the way Python 1.6a2 uses it,
and that their design is correct, while our design is slightly broken.

so let's fix it !

> Why not name the beast ?! In your proposal, the old 8-bit
> strings simply use Latin-1 as native encoding.

in my proposal, there's an important distinction between character
sets and character encodings. unicode is a character set. latin 1
is one of many possible encodings of (portions of) that set.

maybe it's easier to grok if we get rid of the term "character set"?

http://www.hut.fi/u/jkorpela/chars.html suggests the following
replacements:

character repertoire

A set of distinct characters.

character code

A mapping, often presented in tabular form, which defines
one-to-one correspondence between characters in a character
repertoire and a set of nonnegative integers.

character encoding

A method (algorithm) for presenting characters in digital form
by mapping sequences of code numbers of characters into
sequences of octets.

now, in my proposal, the *repertoire* contains all characters
described by the unicode standard. the *codes* are defined
by the same standard.

but strings are sequences of characters, not sequences of
octets:

strings have *no* encoding.

(the encoding used for the internal string storage is an
implementation detail).

(but sure, given the current implementation, the internal storage
for an 8-bit string happens use Latin-1. just as the internal
storage for a 16-bit string happens to use UCS-2 stored in
native byte order. but from the outside, they're just character
sequences).

> The current version doesn't make any encoding assumption as
> long as the 8-bit strings do not get auto-converted. In that case
> they are interpreted as UTF-8 -- which will (usually) fail
> for Latin-1 encoded strings using the 8th bit, but hey, at least
> you get an error message telling you what is going wrong.

sure, but I don't think you get the right message, or that you
get it at the right time. consider this:

if you're going from 8-bit strings to unicode using implicit con-
version, the current design can give you:

"UnicodeError: UTF-8 decoding error: unexpected code byte"

if you go from unicode to 8-bit strings, you'll never get an error.

however, the result is not always a string -- if the unicode string
happened to contain any characters larger than 127, the result
is a binary buffer containing encoded data. you cannot use string
methods on it, you cannot use regular expressions on it. indexing
and slicing won't work.

unlike earlier versions of Python, and unlike unicode-aware
versions of Tcl and Perl, the fundamental assumption that
a string is a sequence of characters no longer holds.

in my proposal, going from 8-bit strings to unicode always works.
a character is a character, no matter what string type you're using.

however, going from unicode to an 8-bit string may given you an
OverflowError, say:

"OverflowError: unicode character too large to fit in a byte"

the important thing here is that if you don't get an exception, the
result is *always* a string. string methods always work. etc.

[8. Special cases aren't special enough to break the rules.]

> The key to these problems is using explicit conversions where
> 8-bit strings meet Unicode objects.

yeah, but the flaw in the current design is the implicit conversions,
not the explicit ones.

[2. Explicit is better than implicit.]

(of course, the 8-bit string type also needs an "encode" method
under my proposal, but that's just a detail ;-)

> Some more ideas along the convenience path:
>
> Perhaps changing just the way 8-bit strings are coerced
> to Unicode would help: strings would then be interpreted
> as Latin-1.

ok.

> str(Unicode) and "t" would still return UTF-8 to assure loss-
> less conversion.

maybe.

or maybe str(Unicode) should return a unicode string?

think about it!

(after all, I'm pretty sure that ord() and chr() should do the right
thing, also for character codes above 127)

> Another way to tackle this would be to first try UTF-8
> conversion during auto-conversion and then fallback to
> Latin-1 in case it fails. Has anyone tried this ? Guido
> mentioned that TCL does something along these lines...

haven't found any traces of that in the source code. hmm, you're
right -- it looks like it attempts to "fix" invalid UTF-8 data (on a
character by character basis), instead of choking on it. scary.

[.12. In the face of ambiguity, refuse the temptation to guess.]

more tomorrow.

</F>
Re: Encoding of 8-bit strings and Python source code [ In reply to ]
[Fredrik]
> -- my proposal: expose both types, but let them contain characters
> from the same character set -- at least when used as strings.
>
> as before, 8-bit strings can be used to store binary data, so we
> don't need a separate ByteArray type. in an 8-bit string, there's
> always one character per byte.
>
> [.imho: small changes to the existing code base, about as efficient as
> can be, no attempt to second-guess the user, fully backwards com-
> patible, fully compliant with the definition of strings in the language
> reference, patches are available, etc...]

Sorry, all this proposal does is change the default encoding on
conversions from UTF-8 to Latin-1. That's very
western-culture-centric.

You already have control over the encoding: use unicode(s,
"latin-1"). If there are places where you don't have enough control
(e.g. file I/O), let's add control there.

--Guido van Rossum (home page: http://www.python.org/~guido/)
Re: Encoding of 8-bit strings and Python source code [ In reply to ]
> Sorry, all this proposal does is change the default encoding on
> conversions from UTF-8 to Latin-1. That's very
> western-culture-centric.

That decision was made by ISO and the Unicode consortium, not
me. I don't know why, and I don't really care -- I'm arguing that
strings should contain characters, just like the language reference
says, and that all characters should be from the same character
repertoire and use the same character codes.

From the user's perspective, that's the way it's done in Perl, Tcl,
XML, Java, and Windows.

But alright, I give up. I've wasted way too much time on this, my
patches were rejected, and nobody seems to care. Not exactly
inspiring.

</F>
RE: Encoding of 8-bit strings and Python source code [ In reply to ]
[/F]
> ...
> But alright, I give up. I've wasted way too much time on this, my
> patches were rejected, and nobody seems to care. Not exactly
> inspiring.

I lost track of this stuff months ago, and since I use only 7-bit ASCII in
my own source code and file names and etc etc, UTF-8 and Latin-1 are
identical to me <0.5 wink>.

[Guido]
> Sorry, all this proposal does is change the default encoding on
> conversions from UTF-8 to Latin-1. That's very
> western-culture-centric.

Well, if you talk with an Asian, they'll probably tell you that Unicode
itself is Eurocentric, and especially UTF-8 (UTF-7 introduces less bloat for
non-Latin-1 Unicode characters). Most everyone likes their own national
gimmicks best. Or, as Andy once said (paraphrasing), the virtue of UTF-8 is
that it annoys everyone.

I do expect that the vase bulk of users would be less surprised if Latin-1
*were* the default encoding. Then the default would be usable as-is for
many more people; UTF-8 is usable as-is only for me (i.e., 7-bit Americans).
The non-Euros are in for a world of pain no matter what.

just-because-some-groups-can't-win-doesn't-mean-everyone-must-
lose-ly y'rs - tim
Re: Encoding of 8-bit strings and Python source code [ In reply to ]
Tim Peters wrote:
>
> [Guido about going Latin-1]
> > Sorry, all this proposal does is change the default encoding on
> > conversions from UTF-8 to Latin-1. That's very
> > western-culture-centric.
>
> Well, if you talk with an Asian, they'll probably tell you that Unicode
> itself is Eurocentric, and especially UTF-8 (UTF-7 introduces less bloat for
> non-Latin-1 Unicode characters). Most everyone likes their own national
> gimmicks best. Or, as Andy once said (paraphrasing), the virtue of UTF-8 is
> that it annoys everyone.
>
> I do expect that the vase bulk of users would be less surprised if Latin-1
> *were* the default encoding. Then the default would be usable as-is for
> many more people; UTF-8 is usable as-is only for me (i.e., 7-bit Americans).
> The non-Euros are in for a world of pain no matter what.
>
> just-because-some-groups-can't-win-doesn't-mean-everyone-must-
> lose-ly y'rs - tim

People tend to forget that UTF-8 is a loss-less Unicode
encoding while Latin-1 reduces Unicode to its lower 8 bits:
conversion from non-Latin-1 Unicode to strings would simply
not work, conversion from non-Latin-1 strings to Unicode
would only be possible via unicode().

Thus mixing Unicode and strings would then run perfectly in all
western countries using Latin-1 while the rest of the
world would need to convert all their strings to Unicode...
giving them an advantage over the western world we couldn't
possibly accept ;-)

FYI, here's a summary of which conversions take place (going Latin-1
would disable most of the Unicode integration in favour of conversion
errors):

Python:
-------
string + unicode: unicode(string,'utf-8') + unicode
string.method(unicode): unicode(string,'utf-8').method(unicode)
print unicode: print unicode.encode('utf-8'); with stdout
redirection this can be changed to any
other encoding
str(unicode): unicode.encode('utf-8')
repr(unicode): repr(unicode.encode('unicode-escape'))


C (PyArg_ParserTuple):
----------------------
"s" + unicode: same as "s" + unicode.encode('utf-8')
"s#" + unicode: same as "s#" + unicode.encode('unicode-internal')
"t" + unicode: same as "t" + unicode.encode('utf-8')
"t#" + unicode: same as "t#" + unicode.encode('utf-8')

This effects all C modules and builtins. In case a C module
wants to receive a certain predefined encoding, it can
use the new "es" and "es#" parser markers.


Ways to enter Unicode:
----------------------
u'' + string same as unicode(string,'utf-8')
unicode(string,encname) any supported encoding
u'...unicode-escape...' unicode-escape currently accepts
Latin-1 chars as single-char input; using
escape sequences any Unicode char can be
entered (*)
codecs.open(filename,mode,encname)
opens an encoded file for
reading and writing Unicode directly
raw_input() + stdin redirection (see one of my earlier posts for code)
returns UTF-8 strings based on the input
encoding

Hmm, perhaps a codecs.raw_input(encname) which returns Unicode
directly wouldn't be a bad idea either ?!

(*) This should probably be changed to be source code
encoding dependent, so that u"...data..." matches
"...data..." in appearance in the Python source code
(see below).


IO:
---
open(file,'w').write(unicode)
same as open(file,'w').write(unicode.encode('utf-8'))
open(file,'wb').write(unicode)
same as open(file,'wb').write(unicode.encode('unicode-internal'))
codecs.open(file,'wb',encname).write(unicode)
same as open(file,'wb').write(unicode.encode(encname))
codecs.open(file,'rb',encname).read()
same as unicode(open(file,'rb').read(),encname)
stdin + stdout
can be redirected using StreamRecoders to handle any
of the supported encodings

The Python parser should probably also be extended to read
encoded Python source code using some hint at the start of
the source file (perhaps only allowing a small subset of the
supported encodings, e.g. ASCII, Latin-1, UTF-8 and UTF-16).


--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/