Mailing List Archive

preserving entities with lxml
I have a puzzle over how lxml & entities should be 'preserved' code below illustrates. To preserve I change & --> &
in the source and add resolve_entities=False to the parser definition. The escaping means we only have one kind of
entity & which means lxml will preserve it. For whatever reason lxml won't preserve character entities eg !.

The simple parse from string and conversion tostring shows that the parsing at least took notice of it.

However, I want to create a tuple tree so have to use tree.text, tree.getchildren() and tree.tail for access.

When I use those I expected to have to undo the escaping to get back the original entities, but it seems they are
already done.

Good for me, but if the tree knows how it was created (tostring shows that) why is it ignored with attribute access?

if __name__=='__main__':
from lxml import etree as ET
#initial xml
xml = b'<a attr="&mysym; &lt; &amp; &gt; &#33;">aaaaa &mysym; &lt; &amp; &gt; &#33; AAAAA</a>'
#escaped xml
xxml = xml.replace(b'&',b'&amp;')

myparser = ET.XMLParser(resolve_entities=False)
tree = ET.fromstring(xxml,parser=myparser)

#use tostring
print(f'using tostring\n{xxml=!r}\n{ET.tostring(tree)=!r}\n')

#now access the items using text & children & text
print(f'using attributes\n{tree.text=!r}\n{tree.getchildren()=!r}\n{tree.tail=!r}')

when run I see this

$ python tmp/tlp.py
using tostring
xxml=b'<a attr="&amp;mysym; &amp;lt; &amp;amp; &amp;gt; &amp;#33;">aaaaa &amp;mysym; &amp;lt; &amp;amp; &amp;gt;
&amp;#33; AAAAA</a>'
ET.tostring(tree)=b'<a attr="&amp;mysym; &amp;lt; &amp;amp; &amp;gt; &amp;#33;">aaaaa &amp;mysym; &amp;lt; &amp;amp;
&amp;gt; &amp;#33; AAAAA</a>'

using attributes
tree.text='aaaaa &mysym; &lt; &amp; &gt; &#33; AAAAA'
tree.getchildren()=[]
tree.tail=None
--
Robin Becker
--
https://mail.python.org/mailman/listinfo/python-list
Re: preserving entities with lxml [ In reply to ]
Robin Becker wrote at 2022-1-12 10:22 +0000:
>I have a puzzle over how lxml & entities should be 'preserved' code below illustrates. To preserve I change & --> &amp;
>in the source and add resolve_entities=False to the parser definition. The escaping means we only have one kind of
>entity &amp; which means lxml will preserve it. For whatever reason lxml won't preserve character entities eg &#33;.
>
>The simple parse from string and conversion tostring shows that the parsing at least took notice of it.
>
>However, I want to create a tuple tree so have to use tree.text, tree.getchildren() and tree.tail for access.
>
>When I use those I expected to have to undo the escaping to get back the original entities, but it seems they are
>already done.
>
>Good for me, but if the tree knows how it was created (tostring shows that) why is it ignored with attribute access?
>
>if __name__=='__main__':
> from lxml import etree as ET
> #initial xml
> xml = b'<a attr="&mysym; &lt; &amp; &gt; &#33;">aaaaa &mysym; &lt; &amp; &gt; &#33; AAAAA</a>'
> #escaped xml
> xxml = xml.replace(b'&',b'&amp;')
>
> myparser = ET.XMLParser(resolve_entities=False)
> tree = ET.fromstring(xxml,parser=myparser)
>
> #use tostring
> print(f'using tostring\n{xxml=!r}\n{ET.tostring(tree)=!r}\n')
>
> #now access the items using text & children & text
> print(f'using attributes\n{tree.text=!r}\n{tree.getchildren()=!r}\n{tree.tail=!r}')
>
>when run I see this
>
>$ python tmp/tlp.py
>using tostring
>xxml=b'<a attr="&amp;mysym; &amp;lt; &amp;amp; &amp;gt; &amp;#33;">aaaaa &amp;mysym; &amp;lt; &amp;amp; &amp;gt;
>&amp;#33; AAAAA</a>'
>ET.tostring(tree)=b'<a attr="&amp;mysym; &amp;lt; &amp;amp; &amp;gt; &amp;#33;">aaaaa &amp;mysym; &amp;lt; &amp;amp;
>&amp;gt; &amp;#33; AAAAA</a>'
>
>using attributes
>tree.text='aaaaa &mysym; &lt; &amp; &gt; &#33; AAAAA'
>tree.getchildren()=[]
>tree.tail=None

Apparently, the `resolve_entities=False` was not effective: otherwise,
your tree content should have more structure (especially some
entity reference children).

`&#<value>` is not an entity reference but a character reference.
It may rightfully be treated differently from entity references.
--
https://mail.python.org/mailman/listinfo/python-list
Re: preserving entities with lxml [ In reply to ]
On 12/01/2022 20:49, Dieter Maurer wrote:
.......
>>
>> when run I see this
>>
>> $ python tmp/tlp.py
>> using tostring
>> xxml=b'<a attr="&amp;mysym; &amp;lt; &amp;amp; &amp;gt; &amp;#33;">aaaaa &amp;mysym; &amp;lt; &amp;amp; &amp;gt;
>> &amp;#33; AAAAA</a>'
>> ET.tostring(tree)=b'<a attr="&amp;mysym; &amp;lt; &amp;amp; &amp;gt; &amp;#33;">aaaaa &amp;mysym; &amp;lt; &amp;amp;
>> &amp;gt; &amp;#33; AAAAA</a>'
>>
>> using attributes
>> tree.text='aaaaa &mysym; &lt; &amp; &gt; &#33; AAAAA'
>> tree.getchildren()=[]
>> tree.tail=None
>
> Apparently, the `resolve_entities=False` was not effective: otherwise,
> your tree content should have more structure (especially some
> entity reference children).
>
except that the tree knows not to expand the entities using ET.tostring so in some circumstances resolve_entities=False
does work.

I expected that the tree would contain the parsed (unexpanded) values, but referencing the actual tree.text/tail/attrib
doesn't give the expected results. There's no criticism here, it makes my life a bit easier. If I had wanted the
unexpanded values in the attrib/text/tail it would be more of a problem.


> `&#<value>` is not an entity reference but a character reference.
> It may rightfully be treated differently from entity references.
I understand the difference, but lxml (and perhaps libxml2) doesn't provide a way to turn off character reference
expansion. This makes using lxml for source transformation a bit harder since the original text is not preserved.

--
https://mail.python.org/mailman/listinfo/python-list
Re: preserving entities with lxml [ In reply to ]
Robin Becker wrote at 2022-1-13 09:13 +0000:
>On 12/01/2022 20:49, Dieter Maurer wrote:
> ...
>> Apparently, the `resolve_entities=False` was not effective: otherwise,
>> your tree content should have more structure (especially some
>> entity reference children).
>>
>except that the tree knows not to expand the entities using ET.tostring so in some circumstances resolve_entities=False
>does work.

I think this is a misunderstanding: `tostring` will represent the text character `&` as `&amp;`.
--
https://mail.python.org/mailman/listinfo/python-list
Re: preserving entities with lxml [ In reply to ]
On 13/01/2022 09:29, Dieter Maurer wrote:
> Robin Becker wrote at 2022-1-13 09:13 +0000:
>> On 12/01/2022 20:49, Dieter Maurer wrote:
>> ...
>>> Apparently, the `resolve_entities=False` was not effective: otherwise,
>>> your tree content should have more structure (especially some
>>> entity reference children).
>>>
>> except that the tree knows not to expand the entities using ET.tostring so in some circumstances resolve_entities=False
>> does work.
>
> I think this is a misunderstanding: `tostring` will represent the text character `&` as `&amp;`.

aaahhhh,

thanks I see now. So tostring is actually restoring some of the entities which on input are normally expanded. If that
means resolve_entities=False does not work at all then I guess there's no need to use it at all. The initial transform

& --> &amp;

does what I need as it is reversed on output of the tree fragments.

Wonder what resolve_entities is actually used for then? All the docs seem to say

> resolve_entities - replace entities by their text value (default: True)

I assumed False would mean that they would pass through the parse
--
Robin Becker

--
https://mail.python.org/mailman/listinfo/python-list