Mailing List Archive

Re #pragmas in Python source code
M.-A. Lemburg <mal@lemburg.com> wrote:
> > but they won't -- if you don't use an encoding directive, and
> > don't use 8-bit characters in your string literals, everything
> > works as before.
> >
> > (that's why the default is "none" and not "utf-8")
> >
> > if you use 8-bit characters in your source code and wish to
> > add an encoding directive, you need to add the right encoding
> > directive...
>
> Fair enough, but this would render all the auto-coercion
> code currently in 1.6 useless -- all string to Unicode
> conversions would have to raise an exception.

I though it was rather clear by now that I think the auto-
conversion stuff *is* useless...

but no, that doesn't mean that all string to unicode conversions
need to raise exceptions -- any 8-bit unicode character obviously
fits into a 16-bit unicode character, just like any integer fits in a
long integer.

if you convert the other way, you might get an OverflowError, just
like converting from a long integer to an integer may give you an
exception if the long integer is too large to be represented as an
ordinary integer. after all,

i = int(long(v))

doesn't always raise an exception...

> > > > why keep on pretending that strings and strings are two
> > > > different things? it's an artificial distinction, and it only
> > > > causes problems all over the place.
> > >
> > > Sure. The point is that we can't just drop the old 8-bit
> > > strings... not until Py3K at least (and as Fred already
> > > said, all standard editors will have native Unicode support
> > > by then).
> >
> > I discussed that in my original "all characters are unicode
> > characters" proposal. in my proposal, the standard string
> > type will have to roles: a string either contains unicode
> > characters, or binary bytes.
> >
> > -- if it contains unicode characters, python guarantees that
> > methods like strip, lower (etc), and regular expressions work
> > as expected.
> >
> > -- if it contains binary data, you can still use indexing, slicing,
> > find, split, etc. but they then work on bytes, not on chars.
> >
> > it's still up to the programmer to keep track of what a certain
> > string object is (a real string, a chunk of binary data, an en-
> > coded string, a jpeg image, etc). if the programmer wants
> > to convert between a unicode string and an external encoding
> > to use a certain unicode encoding, she needs to spell it out.
> > the codecs are never called "under the hood".
> >
> > (note that if you encode a unicode string into some other
> > encoding, the result is binary buffer. operations like strip,
> > lower et al does *not* work on encoded strings).
>
> Huh ? If the programmer already knows that a certain
> string uses a certain encoding, then he can just as well
> convert it to Unicode by hand using the right encoding
> name.

I thought that was what I said, but the text was garbled. let's
try again:

if the programmer wants to convert between a unicode
string and a buffer containing encoded text, she needs
to spell it out. the codecs are never called "under the
hood"

> The whole point we are talking about here is that when
> having the implementation convert a string to Unicode all
> by itself it needs to know which encoding to use. This is
> where we have decided long ago that UTF-8 should be
> used.

does "long ago" mean that the decision cannot be
questioned? what's going on here?

face it, I don't want to guess when and how the interpreter
will convert strings for me. after all, this is Python, not Perl.

if I want to convert from a "string of characters" to a byte
buffer using a certain character encoding, let's make that
explicit.

Python doesn't convert between other data types for me, so
why should strings be a special case?

> The pragma discussion is about a totally different
> issue: pragmas could make it possible for the programmer
> to tell the *compiler* which encoding to use for literal
> u"unicode" strings -- nothing more. Since "8-bit" strings
> currently don't have an encoding attached to them we store
> them as-is.

what do I have to do to make you read my proposal?

shout?

okay, I'll try:

THERE SHOULD BE JUST ONE INTERNAL CHARACTER
SET IN PYTHON 1.6: UNICODE.

for consistency, let this be true for both 8-bit and 16-bit
strings (as well as Py3K's 31-bit strings ;-).

there are many possible external string encodings, just like there
are many possible external integer encodings. but for integers,
that's not something that the core implementation cares much
about. why are strings different?

> I don't want to get into designing a completely new
> character container type here... this can all be done for Py3K,
> but not now -- it breaks things at too many ends (even though
> it would solve the issues with strings being used in different
> contexts).

you don't need to -- you only need to define how the *existing*
string type should be used. in my proposal, it can be used in two
ways:

-- as a string of unicode characters (restricted to the
0-255 subset, by obvious reasons). given a string 's',
len(s) is always the number of characters, s[i] is the
i'th character, etc.

or

-- as a buffer containing binary bytes. given a buffer 'b',
len(b) is always the number of bytes, b[i] is the i'th
byte, etc.

this is one flavour less than in the 1.6 alphas -- where strings sometimes
contain UTF-8 (and methods like upper etc doesn't work), sometimes an
8-bit character set (and upper works), and sometimes binary buffers (for
which upper doesn't work).

(hmm. I've said all this before, haven't I?)

> > > > -- we still need an encoding marker for ascii supersets (how about
> > > > <?python encoding="utf-8" version="1.6"?> ;-). however, it's up to
> > > > the tokenizer to detect that one, not the parser. the parser only
> > > > sees unicode strings.
> > >
> > > Hmm, the tokenizer doesn't do any string -> object conversion.
> > > That's a task done by the parser.
> >
> > "unicode string" meant Py_UNICODE*, not PyUnicodeObject.
> >
> > if the tokenizer does the actual conversion doesn't really matter;
> > the point is that once the code has passed through the tokenizer,
> > it's unicode.
>
> The tokenizer would have to know which parts of the
> input string to convert to Unicode and which not... plus there
> are different encodings to be applied, e.g. UTF-8, Unicode-Escape,
> Raw-Unicode-Escape, etc.

sigh. why do you insist on taking a very simple thing and making
it very very complicated? will anyone out there ever use an editor
that supports different encodings for different parts of the file?

why not just assume that the *ENTIRE SOURCE FILE* uses a single
encoding, and let the tokenizer (or more likely, a conversion stage
before the tokenizer) convert the whole thing to unicode.

let the rest of the compiler work on Py_UNICODE* strings only, and
all your design headaches will just disappear.

...

frankly, I'm beginning to feel like John Skaller. do I have to write my
own interpreter to get this done right? :-(

</F>
Re: Re #pragmas in Python source code [ In reply to ]
Fredrik Lundh writes:
> if the programmer wants to convert between a unicode
> string and a buffer containing encoded text, she needs
> to spell it out. the codecs are never called "under the
> hood"

Watching the successive weekly Unicode patchsets, each one fixing some
obscure corner case that turned out to be buggy -- '%s' % ustr,
concatenating literals, int()/float()/long(), comparisons -- I'm
beginning to agree with Fredrik. Automatically making Unicode strings
and regular strings interoperate looks like it requires many changes
all over the place, and I worry if it's possible to catch them all in
time.

Maybe we should consider being more conservative, and just having the
Unicode built-in type, the unicode() built-in function, and the u"..."
notation, and then leaving all responsibility for conversions up to
the user. On the other hand, *some* default conversion seems needed,
because it seems draconian to make open(u"abcfile") fail with a
TypeError.

(While I want to see Python 1.6 expedited, I'd also not like to see it
saddled with a system that proves to have been a mistake, or one
that's a maintenance burden. If forced to choose between delaying and
getting it right, the latter wins.)

>why not just assume that the *ENTIRE SOURCE FILE* uses a single
>encoding, and let the tokenizer (or more likely, a conversion stage
>before the tokenizer) convert the whole thing to unicode.

To reinforce Fredrik's point here, note that XML only supports
encodings at the level of an entire file (or external entity). You
can't tell an XML parser that a file is in UTF-8, except for this one
element whose contents are in Latin1.

--
A.M. Kuchling http://starship.python.net/crew/amk/
Dream casts a human shadow, when it occurs to him to do so.
-- From SANDMAN: "Season of Mists", episode 0
Re: Re #pragmas in Python source code [ In reply to ]
Fredrik Lundh wrote:
>
> M.-A. Lemburg <mal@lemburg.com> wrote:
> > > but they won't -- if you don't use an encoding directive, and
> > > don't use 8-bit characters in your string literals, everything
> > > works as before.
> > >
> > > (that's why the default is "none" and not "utf-8")
> > >
> > > if you use 8-bit characters in your source code and wish to
> > > add an encoding directive, you need to add the right encoding
> > > directive...
> >
> > Fair enough, but this would render all the auto-coercion
> > code currently in 1.6 useless -- all string to Unicode
> > conversions would have to raise an exception.
>
> I though it was rather clear by now that I think the auto-
> conversion stuff *is* useless...
>
> but no, that doesn't mean that all string to unicode conversions
> need to raise exceptions -- any 8-bit unicode character obviously
> fits into a 16-bit unicode character, just like any integer fits in a
> long integer.
>
> if you convert the other way, you might get an OverflowError, just
> like converting from a long integer to an integer may give you an
> exception if the long integer is too large to be represented as an
> ordinary integer. after all,
>
> i = int(long(v))
>
> doesn't always raise an exception...

This is exactly the same as proposing to change the default
encoding to Latin-1.

I don't have anything against that (being a native Latin-1
user :), but I would assume that other native language
writer sure do: e.g. all programmers not using Latin-1
as native encoding (and there are lots of them).

> > > > > why keep on pretending that strings and strings are two
> > > > > different things? it's an artificial distinction, and it only
> > > > > causes problems all over the place.
> > > >
> > > > Sure. The point is that we can't just drop the old 8-bit
> > > > strings... not until Py3K at least (and as Fred already
> > > > said, all standard editors will have native Unicode support
> > > > by then).
> > >
> > > I discussed that in my original "all characters are unicode
> > > characters" proposal. in my proposal, the standard string
> > > type will have to roles: a string either contains unicode
> > > characters, or binary bytes.
> > >
> > > -- if it contains unicode characters, python guarantees that
> > > methods like strip, lower (etc), and regular expressions work
> > > as expected.
> > >
> > > -- if it contains binary data, you can still use indexing, slicing,
> > > find, split, etc. but they then work on bytes, not on chars.
> > >
> > > it's still up to the programmer to keep track of what a certain
> > > string object is (a real string, a chunk of binary data, an en-
> > > coded string, a jpeg image, etc). if the programmer wants
> > > to convert between a unicode string and an external encoding
> > > to use a certain unicode encoding, she needs to spell it out.
> > > the codecs are never called "under the hood".
> > >
> > > (note that if you encode a unicode string into some other
> > > encoding, the result is binary buffer. operations like strip,
> > > lower et al does *not* work on encoded strings).
> >
> > Huh ? If the programmer already knows that a certain
> > string uses a certain encoding, then he can just as well
> > convert it to Unicode by hand using the right encoding
> > name.
>
> I thought that was what I said, but the text was garbled. let's
> try again:
>
> if the programmer wants to convert between a unicode
> string and a buffer containing encoded text, she needs
> to spell it out. the codecs are never called "under the
> hood"

Again and again...

The orginal intent of the Unicode integration was trying to
make Unicode and 8-bit strings interoperate without too
much user intervention. At a cost (the UTF-8 encoding), but
then if you do use this encoding (and this is not far fetched
since there are input sources which do return UTF-8, e.g.
TCL), the Unicode implementation will apply all its knowledge
in order to get you satisfied.

If you don't like this, you can always apply explicit
conversion calls wherever needed. Latin-1 and UTF-8
are not compatible, the conversion is very likely to
cause an exception, so the user will indeed be informed
about this failure.

> > The whole point we are talking about here is that when
> > having the implementation convert a string to Unicode all
> > by itself it needs to know which encoding to use. This is
> > where we have decided long ago that UTF-8 should be
> > used.
>
> does "long ago" mean that the decision cannot be
> questioned? what's going on here?
>
> face it, I don't want to guess when and how the interpreter
> will convert strings for me. after all, this is Python, not Perl.
>
> if I want to convert from a "string of characters" to a byte
> buffer using a certain character encoding, let's make that
> explicit.

Hey, there's nothing which prevents you from doing so
explicitly.

> Python doesn't convert between other data types for me, so
> why should strings be a special case?

Sure it does: 1.5 + 2 == 3.5, 2L + 3 == 5L, etc...

> > The pragma discussion is about a totally different
> > issue: pragmas could make it possible for the programmer
> > to tell the *compiler* which encoding to use for literal
> > u"unicode" strings -- nothing more. Since "8-bit" strings
> > currently don't have an encoding attached to them we store
> > them as-is.
>
> what do I have to do to make you read my proposal?
>
> shout?
>
> okay, I'll try:
>
> THERE SHOULD BE JUST ONE INTERNAL CHARACTER
> SET IN PYTHON 1.6: UNICODE.

Please don't shout... simply read on...

Note that you are again argueing for using Latin-1 as
default encoding -- why don't you simply make this fact
explicit ?

> for consistency, let this be true for both 8-bit and 16-bit
> strings (as well as Py3K's 31-bit strings ;-).
>
> there are many possible external string encodings, just like there
> are many possible external integer encodings. but for integers,
> that's not something that the core implementation cares much
> about. why are strings different?
>
> > I don't want to get into designing a completely new
> > character container type here... this can all be done for Py3K,
> > but not now -- it breaks things at too many ends (even though
> > it would solve the issues with strings being used in different
> > contexts).
>
> you don't need to -- you only need to define how the *existing*
> string type should be used. in my proposal, it can be used in two
> ways:
>
> -- as a string of unicode characters (restricted to the
> 0-255 subset, by obvious reasons). given a string 's',
> len(s) is always the number of characters, s[i] is the
> i'th character, etc.
>
> or
>
> -- as a buffer containing binary bytes. given a buffer 'b',
> len(b) is always the number of bytes, b[i] is the i'th
> byte, etc.
>
> this is one flavour less than in the 1.6 alphas -- where strings sometimes
> contain UTF-8 (and methods like upper etc doesn't work), sometimes an
> 8-bit character set (and upper works), and sometimes binary buffers (for
> which upper doesn't work).

Strings always contain data -- there's no encoding attached
to them. If the user calls .upper() on a binary string the
output will most probably no longer be usable... but that's
the programmers fault, not the string type's fault.

> (hmm. I've said all this before, haven't I?)

You know as well as I do that the existing string type
is used for both binary and text data. You cannot simply change
this by introducing some new definition of what should
be stored in buffers and what in strings... not until we
officially redefined these things say in Py3K ;-)

> frankly, I'm beginning to feel like John Skaller. do I have to write my
> own interpreter to get this done right? :-(

No, but you should have started this discussion in late
November last year... not now, when everything has already
been implemented and people are starting to the use the
code that's there with great success.

--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
Re: Re #pragmas in Python source code [ In reply to ]
"Andrew M. Kuchling" wrote:
>
> >why not just assume that the *ENTIRE SOURCE FILE* uses a single
> >encoding, and let the tokenizer (or more likely, a conversion stage
> >before the tokenizer) convert the whole thing to unicode.
>
> To reinforce Fredrik's point here, note that XML only supports
> encodings at the level of an entire file (or external entity). You
> can't tell an XML parser that a file is in UTF-8, except for this one
> element whose contents are in Latin1.

Hmm, this would mean that someone who writes:

"""
#pragma script-encoding utf-8

u = u"\u1234"
print u
"""

would suddenly see "\u1234" as output. If that's ok, fine with me...
it would make things easier on the compiler side (even though
I'm pretty sure that people won't like this).

BTW: I will be offline for the next week... I'm looking forward
to where this dicussion will be heading.

Have fun,
--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
RE: Re #pragmas in Python source code [ In reply to ]
I can see the dilemma, but...

> Maybe we should consider being more conservative, and
> just having the
> Unicode built-in type, the unicode() built-in function,
> and the u"..."
> notation, and then leaving all responsibility for
> conversions up to
> the user.

Win32 and COM has been doing exactly this for the last couple of
years. And it sucked.

> On the other hand, *some* default conversion
> seems needed,
> because it seems draconian to make open(u"abcfile") fail with a
> TypeError.

For exactly this reason. The end result is that the first thing you
ever do with a Unicode object is convert it to a string.


> (While I want to see Python 1.6 expedited, I'd also not
> like to see it
> saddled with a system that proves to have been a mistake, or one
> that's a maintenance burden. If forced to choose between
> delaying and
> getting it right, the latter wins.)

Agreed. I thought this implementation stemmed from Guido's desire
to do it this way in the 1.x family, and move towards Fredrik's
proposal for Py3k.

As a geneal comment:

Im a little confused and dissapointed here. We are all bickering
like children while our parents are away. All we are doing is
creating a _huge_ pile of garbage for Guido to ignore when he
returns.

We are going to be presenting Guido with around 400 messages at my
estimate. He can't possibly read them all. So the end result is
that all the posturing and flapping going on here is for naught, and
he is just going to do whatever he wants anyway - as he always has
done, and as has worked so well for Python.

Sheesh - we should all consider how we can be the most effective,
not the most loud or aggressive!

Mark.
Re: Re #pragmas in Python source code [ In reply to ]
Mark Hammond wrote:
>
> I thought this implementation stemmed from Guido's desire
> to do it this way in the 1.x family, and move towards Fredrik's
> proposal for Py3k.

Right. Let's do this step by step and get some experience first.
With that gained experience we can still polish up the design
towards a compromise which best suits all our needs.

The integration of Unicode into Python is comparable to the
addition of floats to an interpreter which previously only
understood integers -- things are obviously going to be a
little different than before. Our goal should be to make
it as painless as possible and at least IMHO this can
only be achieved by gaining practical experience in this new
field first.

--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
Re: #pragmas in Python source code [ In reply to ]
> This is exactly the same as proposing to change the default
> encoding to Latin-1.

no, it isn't. here's what I'm proposing:

-- the internal character set is unicode, and nothing but
unicode. in 1.6, this applies to strings. in 1.7 or later,
it applies to source code as well.

-- the default source encoding is "unknown"

-- the is no other default encoding. all strings use the
unicode character set.

to give you some background, let's look at section 3.2 of
the existing language definition:

[Sequences] represent finite ordered sets indexed
by natural numbers.

The built-in function len() returns the number of
items of a sequence.

When the length of a sequence is n, the index set
contains the numbers 0, 1, ..., n-1.

Item i of sequence a is selected by a[i].

An object of an immutable sequence type cannot
change once it is created.

The items of a string are characters.

There is no separate character type; a character is
represented by a string of one item.

Characters represent (at least) 8-bit bytes.

The built-in functions chr() and ord() convert between
characters and nonnegative integers representing the
byte values.

Bytes with the values 0-127 usually represent the corre-
sponding ASCII values, but the interpretation of values is
up to the program.

The string data type is also used to represent arrays
of bytes, e.g., to hold data read from a file.

(in other words, given a string s, len(s) is the number of characters
in the string. s[i] is the i'th character. len(s[i]) is 1. etc. the
existing string type doubles as byte arrays, where given an array
b, len(b) is the number of bytes, b[i] is the i'th byte, etc).

my proposal boils down to a few small changes to the last three
sentences in the definition. basically, change "byte value" to
"character code" and "ascii" to "unicode":

The built-in functions chr() and ord() convert between
characters and nonnegative integers representing the
character codes.

Character codes usually represent the corresponding
unicode values.

The 8-bit string data type is also used to represent arrays
of bytes, e.g., to hold data read from a file.

that's all. the rest follows from this.

...

just a few quickies to sort out common misconceptions:

> I don't have anything against that (being a native Latin-1
> user :), but I would assume that other native language
> writer sure do: e.g. all programmers not using Latin-1
> as native encoding (and there are lots of them).

the unicode folks have already made that decision. I find it
very strange that we should use *another* model for the
first 256 characters, just to "equally annoy everyone".

(if people have a problem with the first 256 unicode characters
having the same internal representation as the ISO 8859-1 set,
tell them to complain to the unicode folks).

> (and this is not far fetched since there are input sources
> which do return UTF-8, e.g. TCL), the Unicode implementation
> will apply all its knowledge in order to get you satisfied.

there are all sorts of input sources. major platforms like
windows and java use 16-bit unicode.

and Tcl has an internal unicode string type, since they
realized that storing UTF-8 in 8-bit strings was horridly
inefficient (they tried to do it right, of course). the
internal type looks like this:

typedef unsigned short Tcl_UniChar;

typedef struct String {
int numChars;
size_t allocated;
size_t uallocated;
Tcl_UniChar unicode[2];
} String;

(Tcl uses dual-ported objects, where each object can
have an UTF-8 string representation in addition to the
internal representation. if you change one of them, the
other is recalculated on demand)

in fact, it's Tkinter that converts the return value to
UTF-8, not Tcl. that can be fixed.

> > Python doesn't convert between other data types for me, so
> > why should strings be a special case?
>
> Sure it does: 1.5 + 2 == 3.5, 2L + 3 == 5L, etc...

but that's the key point: 2L and 3 are both integers, from the
same set of integers. if you convert a long integer to an integer,
it still contains an integer from the same set.

(maybe someone can fill me in here: what's the formally
correct word here? set? domain? category? universe?)

also, if you convert every item in a sequence of long integers to
ordinary integers, all items are still members of the same integer
set.

in contrast, the UTF-8 design converts between strings of
characters, and arrays of bytes.

unless you change the 8-bit string type to know about UTF-8,
that means that you change string items from one domain
(characters) to another (bytes).

> Note that you are again argueing for using Latin-1 as
> default encoding -- why don't you simply make this fact
> explicit ?

nope. I'm standardizing on a character set, not an encoding.

character sets are mapping between integers and characters.
in this case, we use the unicode character set.

encodings are ways to store strings of text as bytes in a byte
array.

> not now, when everything has already been implemented and
> people are starting to the use the code that's there with great
> success.

the positive reports I've seen all rave about the codec frame-
work. that's a great piece of work. without that, it would have
been impossible to do what I'm proposing. (so what are you
complaining about? it's all your fault -- if you hadn't done such
a great job on that part of the code, I wouldn't have noticed
the warts ;-)

if you look at my proposal from a little distance, you'll realize
that it doesn't really change much. all that needs to be done
is to change some of the conversion stuff. if we decide to
do this, I can do the work for you, free of charge.

</F>
Re: #pragmas in Python source code [ In reply to ]
M.-A. Lemburg wrote:
> Right. Let's do this step by step and get some experience first.
> With that gained experience we can still polish up the design
> towards a compromise which best suits all our needs.

so practical experience from other languages, other designs,
and playing with the python alphas doesn't count?

> The integration of Unicode into Python is comparable to the
> addition of floats to an interpreter which previously only
> understood integers.

use "long integers" instead of "floats", and you'll get closer to
the actual case.

but where's the problem? python has solved this problem for
numbers, and what's more important: the language reference
tells us how strings are supposed to work:

"The items of a string are characters." (see previous mail)

"Strings are compared lexicographically using the numeric
equivalents (the result of the built-in function ord()) of
their characters."

this solves most of the issues. to handle the rest, look at the
language reference description of integer:

[Integers] represent elements from the mathematical set
of whole numbers.

Borrowing the "elements from a single set" concept, define
characters as

Characters represent elements from the unicode character
set.

and let all mixed-string operations use string coercion, just like
numbers.

can it be much simpler?

</F>
Re: #pragmas in Python source code [ In reply to ]
M.-A. Lemburg wrote:
> > To reinforce Fredrik's point here, note that XML only supports
> > encodings at the level of an entire file (or external entity). You
> > can't tell an XML parser that a file is in UTF-8, except for this one
> > element whose contents are in Latin1.
>
> Hmm, this would mean that someone who writes:
>
> """
> #pragma script-encoding utf-8
>
> u = u"\u1234"
> print u
> """
>
> would suddenly see "\u1234" as output.

not necessarily. consider this XML snippet:

<?xml version='1.0' encoding='utf-8'?>
<body>&#x1234;</body>

if I run this through an XML parser and write it
out as UTF-8, I get:

<body>á^´</body>

in other words, the parser processes "&#x" after
decoding to unicode, not before.

I see no reason why Python cannot do the same.

</F>