Mailing List Archive: Re: [Python-checkins] bpo-41944: No longer call eval() on content received via HTTP in the UnicodeNames tests (GH-22575)

Re: [Python-checkins] bpo-41944: No longer call eval() on content received via HTTP in the UnicodeNames tests (GH-22575)

walter at livinglogic

Oct 6, 2020, 7:36 AM

Post #1 of 3 (136 views)

Permalink

On 6 Oct 2020, at 16:22, Florian Bruhin wrote:

> https://github.com/python/cpython/commit/a8bf44d04915f7366d9f8dfbf84822ac37a4bab3
> commit: a8bf44d04915f7366d9f8dfbf84822ac37a4bab3
> branch: master
> author: Florian Bruhin <me@the-compiler.org>
> committer: GitHub <noreply@github.com>
> date: 2020-10-06T16:21:56+02:00
> summary:
>
> bpo-41944: No longer call eval() on content received via HTTP in the
> UnicodeNames tests (GH-22575)
>
> Similarly to GH-22566, those tests called eval() on content received
> via
> HTTP in test_named_sequences_full. This likely isn't exploitable
> because
> unicodedata.lookup(seqname) is called before self.checkletter(seqname,
> None) - thus any string which isn't a valid unicode character name
> wouldn't ever reach the checkletter method.
>
> Still, it's probably better to be safe than sorry.
>
> files:
> M Lib/test/test_ucn.py
> [...]
> # Helper that put all \N escapes inside eval'd raw strings,
> # to make sure this script runs even if the compiler
> # chokes on \N escapes
> - res = eval(r'"\N{%s}"' % name)
> + res = ast.literal_eval(r'"\N{%s}"' % name)
> self.assertEqual(res, code)
> return res

It would be even simpler to use unicodedata.lookup() which returns the
unicode character when passed the name of the character, e.g.

>>> unicodedata.lookup("NO-BREAK SPACE")
'\xa0'

Servus,
Walter

Re: [Python-checkins] bpo-41944: No longer call eval() on content received via HTTP in the UnicodeNames tests (GH-22575) [ In reply to ]

vstinner at python

Oct 6, 2020, 4:27 PM

Post #2 of 3 (136 views)

Permalink

Hi Walter,

> https://github.com/python/cpython/commit/a8bf44d04915f7366d9f8dfbf84822ac37a4bab3

Le mar. 6 oct. 2020 à 17:02, Walter Dörwald <walter@livinglogic.de> a écrit :
> It would be even simpler to use unicodedata.lookup() which returns the unicode character when passed the name of the character

That was my first idea as well when I reviewed the change, but the
function contains this comment:

def checkletter(self, name, code):
# Helper that put all \N escapes inside eval'd raw strings,
# to make sure this script runs even if the compiler
# chokes on \N escapes

test_named_sequences_full() checks that unicodedata.lookup() works,
but that checkletter() raises a SyntaxError. Look at the code ;-)

Victor
--
Night gathers, and now my watch begins. It shall not end until my death.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/LNZSHNATY7MSO5JKHELR433TCJ7ZZ5YR/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-checkins] bpo-41944: No longer call eval() on content received via HTTP in the UnicodeNames tests (GH-22575) [ In reply to ]

walter at livinglogic

Oct 7, 2020, 3:59 AM

Post #3 of 3 (136 views)

Permalink

On 7 Oct 2020, at 1:27, Victor Stinner wrote:

> Hi Walter,
>
>> https://github.com/python/cpython/commit/a8bf44d04915f7366d9f8dfbf84822ac37a4bab3
>
> Le mar. 6 oct. 2020 à 17:02, Walter Dörwald <walter@livinglogic.de>
> a écrit :
>> It would be even simpler to use unicodedata.lookup() which returns
>> the unicode character when passed the name of the character
>
> That was my first idea as well when I reviewed the change, but the
> function contains this comment:
>
> def checkletter(self, name, code):
> # Helper that put all \N escapes inside eval'd raw strings,
> # to make sure this script runs even if the compiler
> # chokes on \N escapes
>
> test_named_sequences_full() checks that unicodedata.lookup() works,

OK, that change would then have checked unicodedata.lookup() twice.

However I'm puzzled by the fact that the "\N{}" escape sequence is
supposed to raise a SyntaxError. And indeed it does in some cases:

Python 3.8.5 (default, Jul 21 2020, 10:48:26)
[Clang 11.0.3 (clang-1103.0.32.62)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import unicodedata
>>> unicodedata.lookup("DIGIT ZERO")
'0'
>>> "\N{DIGIT ZERO}"
'0'
>>> "\N{EURO SIGN}"
'€'
>>> unicodedata.lookup("EURO SIGN")
'€'
>>> unicodedata.lookup("KEYCAP NUMBER SIGN")
'#??'
>>> "\N{KEYCAP NUMBER SIGN}"
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in
position 0-21: unknown Unicode character name
>>> unicodedata.lookup("LATIN CAPITAL LETTER A WITH MACRON AND GRAVE")
'??'
>>> "\N{LATIN CAPITAL LETTER A WITH MACRON AND GRAVE}"
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in
position 0-47: unknown Unicode character name

It seems that unicodedata.lookup() honors "Code point sequences", but
\N{} does not.

Indeed
https://docs.python.org/3/library/unicodedata.html#unicodedata.lookup
mentions that fact:

Changed in version 3.3: Support for name aliases and named sequences
has been added.

https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals

doesn't mention anything. It simply states

Escape Sequence: \N{name}
Meaning: Character named name in the Unicode database

with the footnote "Changed in version 3.3: Support for name aliases has
been added.".

Which leads to the question:

Should \N{} be updated to support "Code point sequences"?

Furthermore it states: "Unlike Standard C, all unrecognized escape
sequences are left in the string unchanged", which could be interpreted
as meaning that "\N{BAD}" results in "\\N{BAD}".

> but that checkletter() raises a SyntaxError. Look at the code ;-)

That would have helped. ;)

> Victor

Servus,
Walter