Mailing List Archive

Minidom and Unicode
While trying the minidom parser from the current CVS, I found that
repr apparently does not work for nodes:

Python 2.0b1 (#29, Jun 30 2000, 10:48:11) [GCC 2.95.2 19991024 (release)] on linux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
Copyright 1995-2000 Corporation for National Research Initiatives (CNRI)
>>> from xml.dom.minidom import parse
>>> d=parse("/usr/src/python/Doc/tools/sgmlconv/conversion.xml")
>>> d.childNodes
[Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: __repr__ returned non-string (type unicode)

The problem here is that __repr__ is computed as

def __repr__( self ):
return "<DOM Element:"+self.tagName+" at "+`id( self )` +" >"

and that self.tagName is u'conversion', so the resulting string is a
unicode string.

I'm not sure whose fault that is: either __repr__ should accept
unicode strings, or minidom.Element.__repr__ should be changed to
return a plain string, e.g. by converting tagname to UTF-8. In any
case, I believe __repr__ should 'work' for these objects.

Regards,
Martin
Re: Minidom and Unicode [ In reply to ]
"Martin v. Loewis" wrote:
>
> While trying the minidom parser from the current CVS, I found that
> repr apparently does not work for nodes:
>
> Python 2.0b1 (#29, Jun 30 2000, 10:48:11) [GCC 2.95.2 19991024 (release)] on linux2
> Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
> Copyright 1995-2000 Corporation for National Research Initiatives (CNRI)
> >>> from xml.dom.minidom import parse
> >>> d=parse("/usr/src/python/Doc/tools/sgmlconv/conversion.xml")
> >>> d.childNodes
> [Traceback (most recent call last):
> File "<stdin>", line 1, in ?
> TypeError: __repr__ returned non-string (type unicode)
>
> The problem here is that __repr__ is computed as
>
> def __repr__( self ):
> return "<DOM Element:"+self.tagName+" at "+`id( self )` +" >"
>
> and that self.tagName is u'conversion', so the resulting string is a
> unicode string.
>
> I'm not sure whose fault that is: either __repr__ should accept
> unicode strings, or minidom.Element.__repr__ should be changed to
> return a plain string, e.g. by converting tagname to UTF-8. In any
> case, I believe __repr__ should 'work' for these objects.

Note that __repr__ has to return a string object (and IIRC
this is checked in object.c or abstract.c). The correct way
to get there is to simply return str(...) or to have a
switch on the type of self.tagName and then call .encode().

--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
Re: Minidom and Unicode [ In reply to ]
> Note that __repr__ has to return a string object (and IIRC
> this is checked in object.c or abstract.c). The correct way
> to get there is to simply return str(...) or to have a
> switch on the type of self.tagName and then call .encode().

Ok. I believe tagName will be always a Unicode object (as mandated by
the DOM API), so I propose patch 100706
(http://sourceforge.net/patch/?func=detailpatch&patch_id=100706&group_id=5470)

Regards,
Martin
Re: Minidom and Unicode [ In reply to ]
mal wrote:
> > I'm not sure whose fault that is: either __repr__ should accept
> > unicode strings, or minidom.Element.__repr__ should be changed to
> > return a plain string, e.g. by converting tagname to UTF-8. In any
> > case, I believe __repr__ should 'work' for these objects.
>
> Note that __repr__ has to return a string object (and IIRC
> this is checked in object.c or abstract.c). The correct way
> to get there is to simply return str(...) or to have a
> switch on the type of self.tagName and then call .encode().

assuming that the goal is to get rid of this restriction in future
versions (a string is a string is a string), how about special-
casing this in PyObject_Repr:

PyObject *res;
res = (*v->ob_type->tp_repr)(v);
if (res == NULL)
return NULL;
---
if (PyUnicode_Check(res)) {
PyObject* str;
str = PyUnicode_AsEncodedString(res, NULL, NULL);
if (str) {
Py_DECREF(res);
res = str;
}
}
---
if (!PyString_Check(res)) {
PyErr_Format(PyExc_TypeError,
"__repr__ returned non-string (type %.200s)",
res->ob_type->tp_name);
Py_DECREF(res);
return NULL;
}
return res;

in this way, people can "do the right thing" in their code,
and have it work better in future versions...

(just say "+1", and the mad patcher will update the repository)

</F>
Re: Minidom and Unicode [ In reply to ]
"M.-A. Lemburg" wrote:
>
> Note that __repr__ has to return a string object (and IIRC
> this is checked in object.c or abstract.c). The correct way
> to get there is to simply return str(...) or to have a
> switch on the type of self.tagName and then call .encode().
> ...

I prefer the former solution and unless someone screams I will check
that in in a few hours.

Why can't repr have a special case that converts Unicode strings to
"Python strings" automatically. This case is going to byte other people.

> Ok. I believe tagName will be always a Unicode object (as mandated by
> the DOM API), so I propose patch 100706
> (http://sourceforge.net/patch/?func=detailpatch&patch_id=100706&group_id=5470)

I would like Unicode usage to be a userland option for reasons of
performance and backwards compatibility.

--
Paul Prescod - Not encumbered by corporate consensus
The calculus and the rich body of mathematical analysis to which it
gave rise made modern science possible, but it was the algorithm that
made the modern world possible.
- The Advent of the Algorithm (pending), by David Berlinski
Re: Minidom and Unicode [ In reply to ]
Fredrik Lundh wrote:
>
> ...
>
> assuming that the goal is to get rid of this restriction in future
> versions (a string is a string is a string), how about special-
> casing this in PyObject_Repr:

This is my prefered solution. +1 from me.

--
Paul Prescod - Not encumbered by corporate consensus
The calculus and the rich body of mathematical analysis to which it
gave rise made modern science possible, but it was the algorithm that
made the modern world possible.
- The Advent of the Algorithm (pending), by David Berlinski
Re: Minidom and Unicode [ In reply to ]
> mal wrote:
> > > I'm not sure whose fault that is: either __repr__ should accept
> > > unicode strings, or minidom.Element.__repr__ should be changed to
> > > return a plain string, e.g. by converting tagname to UTF-8. In any
> > > case, I believe __repr__ should 'work' for these objects.
> >
> > Note that __repr__ has to return a string object (and IIRC
> > this is checked in object.c or abstract.c). The correct way
> > to get there is to simply return str(...) or to have a
> > switch on the type of self.tagName and then call .encode().

[/F]
> assuming that the goal is to get rid of this restriction in future
> versions (a string is a string is a string), how about special-
> casing this in PyObject_Repr:
>
> PyObject *res;
> res = (*v->ob_type->tp_repr)(v);
> if (res == NULL)
> return NULL;
> ---
> if (PyUnicode_Check(res)) {
> PyObject* str;
> str = PyUnicode_AsEncodedString(res, NULL, NULL);
> if (str) {
> Py_DECREF(res);
> res = str;
> }
> }
> ---
> if (!PyString_Check(res)) {
> PyErr_Format(PyExc_TypeError,
> "__repr__ returned non-string (type %.200s)",
> res->ob_type->tp_name);
> Py_DECREF(res);
> return NULL;
> }
> return res;
>
> in this way, people can "do the right thing" in their code,
> and have it work better in future versions...
>
> (just say "+1", and the mad patcher will update the repository)

+1

--Guido van Rossum (home page: http://dinsdale.python.org/~guido/)
Re: Minidom and Unicode [ In reply to ]
Fredrik Lundh wrote:
>
> mal wrote:
> > > I'm not sure whose fault that is: either __repr__ should accept
> > > unicode strings, or minidom.Element.__repr__ should be changed to
> > > return a plain string, e.g. by converting tagname to UTF-8. In any
> > > case, I believe __repr__ should 'work' for these objects.
> >
> > Note that __repr__ has to return a string object (and IIRC
> > this is checked in object.c or abstract.c). The correct way
> > to get there is to simply return str(...) or to have a
> > switch on the type of self.tagName and then call .encode().
>
> assuming that the goal is to get rid of this restriction in future
> versions (a string is a string is a string), how about special-
> casing this in PyObject_Repr:
>
> PyObject *res;
> res = (*v->ob_type->tp_repr)(v);
> if (res == NULL)
> return NULL;
> ---
> if (PyUnicode_Check(res)) {
> PyObject* str;
> str = PyUnicode_AsEncodedString(res, NULL, NULL);
> if (str) {
> Py_DECREF(res);
> res = str;
> }
> }
> ---
> if (!PyString_Check(res)) {
> PyErr_Format(PyExc_TypeError,
> "__repr__ returned non-string (type %.200s)",
> res->ob_type->tp_name);
> Py_DECREF(res);
> return NULL;
> }
> return res;
>
> in this way, people can "do the right thing" in their code,
> and have it work better in future versions...
>
> (just say "+1", and the mad patcher will update the repository)

I'd say +0, since the auto-converion can fail if the default
encoding doesn't have room for the tagName characters.

Either way, I'd still prefer the DOM code to use an explicit
.encode() together with some lossless encoding, e.g.
unicode-escape.

--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
Re: Minidom and Unicode [ In reply to ]
"M.-A. Lemburg" wrote:
>
> ...
>
> I'd say +0, since the auto-converion can fail if the default
> encoding doesn't have room for the tagName characters.
>
> Either way, I'd still prefer the DOM code to use an explicit
> .encode() together with some lossless encoding, e.g.
> unicode-escape.

If we want to use a hard-coded lossless encoding, we should do so in
repr. Rather than having us fix a dozen modules with problems like this,
we should fix repr once and for all.

--
Paul Prescod - Not encumbered by corporate consensus
The calculus and the rich body of mathematical analysis to which it
gave rise made modern science possible, but it was the algorithm that
made the modern world possible.
- The Advent of the Algorithm (pending), by David Berlinski
Re: Minidom and Unicode [ In reply to ]
paul wrote:

> If we want to use a hard-coded lossless encoding, we should do so in
> repr. Rather than having us fix a dozen modules with problems like this,
> we should fix repr once and for all.

how about allowing str and repr to actually return
unicode strings?

or in other words:

PyObject *res;
res = (*v->ob_type->tp_repr)(v);
if (res == NULL)
return NULL;
if (!PyString_Check(res) && !PyUnicode_Check(res)) {
PyErr_Format(PyExc_TypeError,
"__repr__ returned non-string (type %.200s)",
res->ob_type->tp_name);
Py_DECREF(res);
return NULL;
}
return res;

(strings are strings are strings, etc)

</F>
Minidom and Unicode [ In reply to ]
> the repository has been updated.

In what way?

Python 2.0b1 (#1, Jul 3 2000, 09:12:07) [GCC 2.95.2 19991024 (release)] on linux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
Copyright 1995-2000 Corporation for National Research Initiatives (CNRI)
>>> class X:
... def __repr__(self):
... if hasattr(self,"u"):
... return u'\u30b9'
... else:
... return u'hallo'
...
>>> x=X()
>>> x
hallo
>>> x.u=1
>>> x
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: __repr__ returned non-string (type unicode)

> no need to change minidom.

The need still appears to exist, both for minidom.Element (KATAKANA
LETTER SU is a letter, and thus a valid tag name), as well as
minidom.Text. A string is a string is a string.

Regards,
Martin
Re: Minidom and Unicode [ In reply to ]
Paul Prescod wrote:
>
> "M.-A. Lemburg" wrote:
> >
> > ...
> >
> > I'd say +0, since the auto-converion can fail if the default
> > encoding doesn't have room for the tagName characters.
> >
> > Either way, I'd still prefer the DOM code to use an explicit
> > .encode() together with some lossless encoding, e.g.
> > unicode-escape.
>
> If we want to use a hard-coded lossless encoding, we should do so in
> repr. Rather than having us fix a dozen modules with problems like this,
> we should fix repr once and for all.

I think it's ok to auto-convert to the default encoding
as intermediate solution, but the applications wanting to
return Unicode as __repr__ or __str__ should really
use .encode() to make sure the output that is produces
matches their (or their user's) expectations.

--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
Re: Minidom and Unicode [ In reply to ]
Fredrik Lundh wrote:
>
> paul wrote:
>
> > If we want to use a hard-coded lossless encoding, we should do so in
> > repr. Rather than having us fix a dozen modules with problems like this,
> > we should fix repr once and for all.
>
> how about allowing str and repr to actually return
> unicode strings?
>
> or in other words:
>
> PyObject *res;
> res = (*v->ob_type->tp_repr)(v);
> if (res == NULL)
> return NULL;
> if (!PyString_Check(res) && !PyUnicode_Check(res)) {
> PyErr_Format(PyExc_TypeError,
> "__repr__ returned non-string (type %.200s)",
> res->ob_type->tp_name);
> Py_DECREF(res);
> return NULL;
> }
> return res;
>
> (strings are strings are strings, etc)

-1: This breaks code, since it is expected that PyObject_Str()
returns a string object.

--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
Re: Minidom and Unicode [ In reply to ]
mal wrote:
> > (strings are strings are strings, etc)
>
> -1: This breaks code, since it is expected that PyObject_Str()
> returns a string object.

unicode strings are also strings, right?

the interesting question here is to figure out who's expecting that,
and figure out if that code can be changed.

</F>
Re: Minidom and Unicode [ In reply to ]
Fredrik Lundh wrote:
>
> mal wrote:
> > > (strings are strings are strings, etc)
> >
> > -1: This breaks code, since it is expected that PyObject_Str()
> > returns a string object.
>
> unicode strings are also strings, right?

Well, I usually refer to them as Unicode objects to make it
clear that they are different from the standard notion of a
string in Python.

If we were to use classes for Python basic type I would
make them have the same base class though...

> the interesting question here is to figure out who's expecting that,
> and figure out if that code can be changed.

We might proceed in that direction for Py3K, but I don't
think it's a good idea to make such changes just now.

IMHO, it's better to provide other means of getting at the Unicode
data, e.g. instances could provide a __unicode__ method
hook which the builtin unicode() queries and then uses to convert
to Unicode.

--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
Re: Minidom and Unicode [ In reply to ]
martin wrote:
> In what way?
>
> Python 2.0b1 (#1, Jul 3 2000, 09:12:07) [GCC 2.95.2 19991024 (release)] on linux2
> Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
> Copyright 1995-2000 Corporation for National Research Initiatives (CNRI)
> >>> class X:
> ... def __repr__(self):
> ... if hasattr(self,"u"):
> ... return u'\u30b9'
> ... else:
> ... return u'hallo'
> ...
> >>> x=X()
> >>> x
> hallo
> >>> x.u=1
> >>> x
> Traceback (most recent call last):
> File "<stdin>", line 1, in ?
> TypeError: __repr__ returned non-string (type unicode)
>
> > no need to change minidom.
>
> The need still appears to exist, both for minidom.Element (KATAKANA
> LETTER SU is a letter, and thus a valid tag name), as well as
> minidom.Text. A string is a string is a string.

works for me:

$ export LANG=posix.utf8
$ python
>>> import sys
>>> sys.getdefaultencoding()
'utf8'
>>> class X:
... def __repr__(self):
... return u"u\30b9"
...
>>> x = X()
>>> x
ã,¹

(or to put it another way, I'm not sure the repr/str fix is
the real culprit here...)

</F>
Re: Minidom and Unicode [ In reply to ]
> works for me:
>
> $ export LANG=posix.utf8
[...]
> (or to put it another way, I'm not sure the repr/str fix is
> the real culprit here...)

I think it is. My understanding is that repr always returns something
printable - if possible even something that can be passed to eval. I'd
certainly expect that a minidom Node can be printed always, no matter
what the default encoding is.

Consequently, I'd prefer if the conversion uses some fixed, repr-style
encoding, eg. unicode-escape (just as repr of a unicode object does).
If it is deemed unacceptable to put this into the interpreter proper,
I'd prefer if minidom is changed to allow representation of all Nodes
on all systems.

Regards,
Martin
Re: Minidom and Unicode [ In reply to ]
martin wrote:
> >
> > $ export LANG=posix.utf8
> [...]
> > (or to put it another way, I'm not sure the repr/str fix is
> > the real culprit here...)
>
> I think it is. My understanding is that repr always returns something
> printable - if possible even something that can be passed to eval. I'd
> certainly expect that a minidom Node can be printed always, no matter
> what the default encoding is.
>
> Consequently, I'd prefer if the conversion uses some fixed, repr-style
> encoding, eg. unicode-escape (just as repr of a unicode object does).

oh, you're right. repr should of course use unicode-escape, not
the default encoding. my fault.

I'll update the repository soonish.

> If it is deemed unacceptable to put this into the interpreter proper,
> I'd prefer if minidom is changed to allow representation of all Nodes
> on all systems.

the reason for this patch was to avoid forcing everyone to deal with
this in their own code, by providing some kind of fallback behaviour.

</F>
Re: Minidom and Unicode [ In reply to ]
Fredrik Lundh wrote:
>
> martin wrote:
> > >
> > > $ export LANG=posix.utf8
> > [...]
> > > (or to put it another way, I'm not sure the repr/str fix is
> > > the real culprit here...)
> >
> > I think it is. My understanding is that repr always returns something
> > printable - if possible even something that can be passed to eval. I'd
> > certainly expect that a minidom Node can be printed always, no matter
> > what the default encoding is.
> >
> > Consequently, I'd prefer if the conversion uses some fixed, repr-style
> > encoding, eg. unicode-escape (just as repr of a unicode object does).
>
> oh, you're right. repr should of course use unicode-escape, not
> the default encoding. my fault.
>
> I'll update the repository soonish.

I'd rather have some more discussion about this...

IMHO, all auto-conversions should use the default encoding. The
main point here is not to confuse the user with even more magic
happening under the hood.

If the programmer knows that he'll have to deal with Unicode
then he should make sure that the proper encoding is used
and document it that way, e.g. use unicode-escape for Minidom's
__repr__ methods.

BTW, any takers for __unicode__ to complement __str__ ?

> > If it is deemed unacceptable to put this into the interpreter proper,
> > I'd prefer if minidom is changed to allow representation of all Nodes
> > on all systems.
>
> the reason for this patch was to avoid forcing everyone to deal with
> this in their own code, by providing some kind of fallback behaviour.

That's what your patch does; I don't see a reason to change it :-)

--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
Re: Minidom and Unicode [ In reply to ]
"M.-A. Lemburg" wrote:
>
> ...
>
> IMHO, all auto-conversions should use the default encoding. The
> main point here is not to confuse the user with even more magic
> happening under the hood.

I don't see anything confusing about having unicode-escape be the
appropriate escape used for repr. Maybe we need to differentiate between
lossless and lossy encodings. If the default encoding is lossless then
repr could use it. Otherwise it could use unicode-escape.

Anyhow, why would it be wrong for Fredrick to hard-code an encoding in
repr but right for me to hard-code one in minidom? Users should not need
to comb through the hundreds of modules in the library figuring out what
kind of Unicode handling they should expect. It should be as centralized
as possible.

> If the programmer knows that he'll have to deal with Unicode
> then he should make sure that the proper encoding is used
> and document it that way, e.g. use unicode-escape for Minidom's
> __repr__ methods.

One of the major goals of our current Unicode auto-conversion
"compromise" is that modules like xmllib and minidom should work with
Unicode out of the box without any special enhancements. According to
Guido, that's the primary reason we have Unicode auto-conversions at
all.

http://www.python.org/pipermail/i18n-sig/2000-May/000173.html

I'm going to fight very hard to make basic Unicode support in Python
modules "just work" without a bunch of internationalization knowledge
from the programmer. __repr__ is pretty basic.

> > the reason for this patch was to avoid forcing everyone to deal with
> > this in their own code, by providing some kind of fallback behaviour.
>
> That's what your patch does; I don't see a reason to change it :-)

If you're still proposing that I should deal with it in a particular
module's domain-specific code then the patch isn't done yet!

--
Paul Prescod - Not encumbered by corporate consensus
The calculus and the rich body of mathematical analysis to which it
gave rise made modern science possible, but it was the algorithm that
made the modern world possible.
- The Advent of the Algorithm (pending), by David Berlinski
Re: Minidom and Unicode [ In reply to ]
"M.-A. Lemburg" wrote:
>
> ...
>
> I think it's ok to auto-convert to the default encoding
> as intermediate solution, but the applications wanting to
> return Unicode as __repr__ or __str__ should really
> use .encode() to make sure the output that is produces
> matches their (or their user's) expectations.

If my users have expectations, I don't know them. I could allow them to
tell me what encoding to use, but surely they would rather do that in a
Python-wide fashion.

--
Paul Prescod - Not encumbered by corporate consensus
The calculus and the rich body of mathematical analysis to which it
gave rise made modern science possible, but it was the algorithm that
made the modern world possible.
- The Advent of the Algorithm (pending), by David Berlinski
Re: Minidom and Unicode [ In reply to ]
Paul Prescod wrote:
>
> "M.-A. Lemburg" wrote:
> >
> > ...
> >
> > IMHO, all auto-conversions should use the default encoding. The
> > main point here is not to confuse the user with even more magic
> > happening under the hood.
>
> I don't see anything confusing about having unicode-escape be the
> appropriate escape used for repr. Maybe we need to differentiate between
> lossless and lossy encodings. If the default encoding is lossless then
> repr could use it. Otherwise it could use unicode-escape.

Simply because auto-conversion should use one single encoding
throughout the code.

> Anyhow, why would it be wrong for Fredrick to hard-code an encoding in
> repr but right for me to hard-code one in minidom?

Because hardcoding the encoding into the core Python API touches
all programs. Hardcoded encodings should be userland options
whereever possible.

Besides, we're talking about __repr__ which is mainly a
debug tool and doesn't affect program flow or interfacing
in any way. The format used is a userland decision and the
encoding used for it is too.

> Users should not need
> to comb through the hundreds of modules in the library figuring out what
> kind of Unicode handling they should expect. It should be as centralized
> as possible.

True.

> > If the programmer knows that he'll have to deal with Unicode
> > then he should make sure that the proper encoding is used
> > and document it that way, e.g. use unicode-escape for Minidom's
> > __repr__ methods.
>
> One of the major goals of our current Unicode auto-conversion
> "compromise" is that modules like xmllib and minidom should work with
> Unicode out of the box without any special enhancements. According to
> Guido, that's the primary reason we have Unicode auto-conversions at
> all.
>
> http://www.python.org/pipermail/i18n-sig/2000-May/000173.html
>
> I'm going to fight very hard to make basic Unicode support in Python
> modules "just work" without a bunch of internationalization knowledge
> from the programmer.

Great :-)

The next big project ought to be getting the standard lib
to work with Unicode input. A good way to test drive this, is
running Python with -U option.

> __repr__ is pretty basic.
>
> > > the reason for this patch was to avoid forcing everyone to deal with
> > > this in their own code, by providing some kind of fallback behaviour.
> >
> > That's what your patch does; I don't see a reason to change it :-)
>
> If you're still proposing that I should deal with it in a particular
> module's domain-specific code then the patch isn't done yet!

You don't have too: a user who uses Latin-1 tag names will see
the output of __repr__ as Latin-1... pretty straight forward
if you ask me. If you want to make sure that __repr__ output
is printable everywhere you should use an explicit lossless
encoding for your application.

Again, this is a userland decision which you'll have to make.

--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
Re: Minidom and Unicode [ In reply to ]
Paul Prescod wrote:
>
> "M.-A. Lemburg" wrote:
> >
> > ...
> >
> > I think it's ok to auto-convert to the default encoding
> > as intermediate solution, but the applications wanting to
> > return Unicode as __repr__ or __str__ should really
> > use .encode() to make sure the output that is produces
> > matches their (or their user's) expectations.
>
> If my users have expectations, I don't know them. I could allow them to
> tell me what encoding to use, but surely they would rather do that in a
> Python-wide fashion.

They can choose the encoding by setting their LANG variable
or you could make the setting application specific by using
.encode where needed.

BTW, are tag names using non-ASCII really used in practice ?
I can understand values being Unicode, but Unicode
tag names don't really make all that much sense to me (ok,
that's a personal opinion).

--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
Re: Minidom and Unicode [ In reply to ]
mal wrote:

> > Anyhow, why would it be wrong for Fredrick to hard-code an encoding in
> > repr but right for me to hard-code one in minidom?
>
> Because hardcoding the encoding into the core Python API touches
> all programs. Hardcoded encodings should be userland options
> whereever possible.

the problem is that the existing design breaks peoples
expectations: first, minidom didn't work because people
expected to be able to return the result of:

"8 bit string" + something + "8 bit string"

or

"8 bit string %s" % something

from __repr__. that's a reasonable expectation (just look
in the python standard library).

after my fix, minidom still didn't work because people expected
the conversion to work on all strings, on all platforms. that's
also a reasonable expectation (read on).

> Besides, we're talking about __repr__ which is mainly a
> debug tool and doesn't affect program flow or interfacing
> in any way.

exactly. this is the whole point: __repr__ is a debug tool,
and therefore it must work in all platforms, for all strings.

if it's true that repr() cannot be changed to return unicode
strings (in which case the conversion will be done on the
way out to the user, by a file object or a user-interface
library which might actually know what encoding to use),
using a lossless encoding is the second best thing.

on the other hand, if we can change repr/str, this is a non-
issue. maybe someone could tell me exactly what code we'll
break if we do that change?

</F>
Re: Minidom and Unicode [ In reply to ]
Fredrik Lundh wrote:
>
> ...
>
> exactly. this is the whole point: __repr__ is a debug tool,
> and therefore it must work in all platforms, for all strings.

As a debugging tool, it would probably help rather than hurt things to
have repr be consistent on all platforms. If it is going to do a
conversion, I vote for unicode-escape everywhere.

> on the other hand, if we can change repr/str, this is a non-
> issue. maybe someone could tell me exactly what code we'll
> break if we do that change?

I agree that we want to move to a world where unicode strings and 8-bit
strings are accepted equally throughout Python. We do need some
information about whether moving there quickly will break code or not.
We need to know what Idle, PythonWin, Zope and other such environments
do with the results of repr.

--
Paul Prescod - Not encumbered by corporate consensus
The calculus and the rich body of mathematical analysis to which it
gave rise made modern science possible, but it was the algorithm that
made the modern world possible.
- The Advent of the Algorithm (pending), by David Berlinski

1 2  View All