Mailing List Archive: [I18n-sig] Re: Unicode debate

Just van Rossum writes:
> How will other parts of a program know which encoding was used for
> non-unicode string literals?

This is the exact reason that Unicode should be used for all string
literals: from a language design perspective I don't understand the
rationale for providing "traditional" and "unicode" string.

> It seems to me that an encoding attribute for 8-bit strings solves this
> nicely. The attribute should only be set automatically if the encoding of
> the source file was specified or when the string has been encoded from a
> unicode string. The attribute should *only* be used when converting to
> unicode. (Hm, it could even be used when calling unicode() without the
> encoding argument.) It should *not* be used when comparing (or adding,
> etc.) 8-bit strings to each other, since they still may contain binary
> goop, even in a source file with a specified encoding!

In Dylan there is an explicit split between 'characters' (which are
always Unicode) and 'bytes'.

What are the compelling reasons to not use UTF-8 as the (source)
document encoding? In the past the usual response is, "the tools are't
there for authoring UTF-8 documents". This argument becomes more
specious as more OS's move towards Unicode. I firmly believe this can
be done without Java's bloat.

One off-the-cuff solution is this:

All character strings are Unicode (utf-8 encoding). Language terminals
and operators are restricted to US-ASCII, which are identical to
UTF8. The contents of comments are not interpreted in any way.

> >- We need a way to indicate the encoding of input and output data
> >files, and we need shortcuts to set the encoding of stdin, stdout and
> >stderr (and maybe all files opened without an explicit encoding).
>
> Can you open a file *with* an explicit encoding?

If you cannot, you lose. You absolutely must be able to specify the
encoding of a file when opening it, so that the runtime can transcode
into the native encoding as you read it. This should be otherwise
transparent the user.

-tree

--
Tom Emerson Basis Technology Corp.
Language Hacker http://www.basistech.com
"Beware the lollipop of mediocrity: lick it once and you suck forever"

> This is the exact reason that Unicode should be used for all string
> literals: from a language design perspective I don't understand the
> rationale for providing "traditional" and "unicode" string.

In Python 3000, you would have a point. In current Python, there
simply are too many programs and extensions written in other languages
that manipulating 8-bit strings to ignore their existence. We're
trying to add Unicode support to Python 1.6 without breaking code that
used to run under Python 1.5.x; practicalities just make it impossible
to go with Unicode for everything.

I think that if Python didn't have so many extension modules (many
maintained by 3rd party modules) it would be a lot easier to switch to
Unicode for all strings (I think JavaScript has done this).

In Python 3000, we'll have to seriously consider having separate
character string and byte array objects, along the lines of Java's
model. Note that I say "seriously consider." We'll first have to see
how well the current solution works *in practice*. There's time
before we fix Py3k in stone. :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)