Mailing List Archive

ElementTree: How to return only unicode?
Hallöchen!

I parse an XML file with ElementTree and get the contets with
the .attrib, .text, .get etc methods of the tree's nodes.
Additionally, I use the "find" and "findtext" methods.

My problem is that if there is only ASCII, these methods return
ordinary strings instead of unicode. So sometimes I get str,
sometimes I get unicode. Can one change this globally so that they
only return unicode?

Tschö,
Torsten.

--
Torsten Bronger, aquisgrana, europa vetus
Jabber ID: torsten.bronger@jabber.rwth-aachen.de
--
http://mail.python.org/mailman/listinfo/python-list
Re: ElementTree: How to return only unicode? [ In reply to ]
Torsten Bronger wrote:
> I parse an XML file with ElementTree and get the contets with
> the .attrib, .text, .get etc methods of the tree's nodes.
> Additionally, I use the "find" and "findtext" methods.
>
> My problem is that if there is only ASCII, these methods return
> ordinary strings instead of unicode. So sometimes I get str,
> sometimes I get unicode. Can one change this globally so that they
> only return unicode?

That's a convenience measure to reduce memory and processing overhead.
Could you explain why this is a problem for you?

Stefan
--
http://mail.python.org/mailman/listinfo/python-list
Re: ElementTree: How to return only unicode? [ In reply to ]
Hallöchen!

Stefan Behnel writes:

> Torsten Bronger wrote:
>
>> [...]
>>
>> My problem is that if there is only ASCII, these methods return
>> ordinary strings instead of unicode. So sometimes I get str,
>> sometimes I get unicode. Can one change this globally so that
>> they only return unicode?
>
> That's a convenience measure to reduce memory and processing
> overhead.

But is this really worth the inconsistency of having partly str and
partly unicode, given that the common origin is unicode XML data?

> Could you explain why this is a problem for you?

I feed ElementTree's output to functions in the unicodedata module.
And they want unicode input. While it's not a big deal to write
e.g. unicodedata.category(unicode(my_character)), I find this rather
wasteful.

Tschö,
Torsten.

--
Torsten Bronger, aquisgrana, europa vetus
Jabber ID: torsten.bronger@jabber.rwth-aachen.de
--
http://mail.python.org/mailman/listinfo/python-list
Re: ElementTree: How to return only unicode? [ In reply to ]
Torsten Bronger wrote:
> Hallöchen!

und zurück!


> Stefan Behnel writes:
>
>> Torsten Bronger wrote:
>>
>>> [...]
>>>
>>> My problem is that if there is only ASCII, these methods return
>>> ordinary strings instead of unicode. So sometimes I get str,
>>> sometimes I get unicode. Can one change this globally so that
>>> they only return unicode?
>> That's a convenience measure to reduce memory and processing
>> overhead.
>
> But is this really worth the inconsistency of having partly str and
> partly unicode, given that the common origin is unicode XML data?

Yes. It's no difference in almost all use cases, as long as you assume Py2
string handling semantics. In Py3, you will always get Unicode strings anyway.


>> Could you explain why this is a problem for you?
>
> I feed ElementTree's output to functions in the unicodedata module.
> And they want unicode input. While it's not a big deal to write
> e.g. unicodedata.category(unicode(my_character)), I find this rather
> wasteful.

I just looked at the code. It seems that you can use your own
XMLTreeBuilder subclass and overwrite the "._fixtext()" method like this:

def _fixtext(self, text):
return text

Then pass an instance of that as "parser" when parsing in ElementTree. That
should do the trick.

Stefan
--
http://mail.python.org/mailman/listinfo/python-list
Re: ElementTree: How to return only unicode? [ In reply to ]
Hallöchen!

Stefan Behnel writes:

> Torsten Bronger wrote:
>
>> Stefan Behnel writes:
>>
>>> Torsten Bronger wrote:
>>>
>>>> [...]
>>>>
>>>> My problem is that if there is only ASCII, these methods return
>>>> ordinary strings instead of unicode. So sometimes I get str,
>>>> sometimes I get unicode. Can one change this globally so that
>>>> they only return unicode?
>
> [...]
>
> I just looked at the code. It seems that you can use your own
> XMLTreeBuilder subclass and overwrite the "._fixtext()" method
> like this:
>
> def _fixtext(self, text):
> return text

Great. Thus, the following monkeypatch seems to do the trick:

from xml.etree import ElementTree
# FixMe: Must go away with Python 3
ElementTree.XMLTreeBuilder._fixtext = lambda self, text: text

Thank you!

Tschö,
Torsten.

--
Torsten Bronger, aquisgrana, europa vetus
Jabber ID: torsten.bronger@jabber.rwth-aachen.de
--
http://mail.python.org/mailman/listinfo/python-list