Mailing List Archive: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

Nov 2, 2021, 7:16 AM

Post #1 of 48 (2004 views)

On 01. 11. 21 18:32, Serhiy Storchaka wrote:
> This is excellent!
>
> 01.11.21 14:17, Petr Viktorin ????:
>>> CPython treats the control character NUL (``\0``) as end of input,
>>> but many editors simply skip it, possibly showing code that Python
>>> will not
>>> run as a regular part of a file.
>
> It is an implementation detail and we will get rid of it. It only
> happens when you read the Python script from a file. If you import it as
> a module or run with runpy, the NUL character is an error.

That brings us to possible changes in Python in this area, which is an
interesting topic.

As for \0, can we ban all ASCII & C1 control characters except
whitespace? I see no place for them in source code.

For homoglyphs/confusables, should there be a SyntaxWarning when an
identifier looks like ASCII but isn't?

For right-to-left text: does anyone actually name identifiers in
Hebrew/Arabic? AFAIK, we should allow a few non-printing
"joiner"/"non-joiner" characters to make it possible to use all Arabic
words. But it would be great to consult with users/teachers of the
languages.
Should Python run the bidi algorithm when parsing and disallow reordered
tokens? Maybe optionally?
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/TGB377QWGIDPUWMAJSZLT22ERGPNZ5FZ/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

storchaka at gmail

Nov 2, 2021, 8:55 AM

Post #2 of 48 (2004 views)

Permalink

02.11.21 16:16, Petr Viktorin ????:
> As for \0, can we ban all ASCII & C1 control characters except
> whitespace? I see no place for them in source code.

All control characters except CR, LF, TAB and FF are banned outside
comments and string literals. I think it is worth to ban them in
comments and string literals too. In string literals you can use
backslash-escape sequences, and comments should be human readable, there
are no reason to include control characters in them. There is a
precedence of emitting warnings for some superficial escapes in strings.

> For homoglyphs/confusables, should there be a SyntaxWarning when an
> identifier looks like ASCII but isn't?

It would virtually ban Cyrillic. There is a lot of Cyrillic letters
which look like Latin letters, and there are complete words written in
Cyrillic which by accident look like other words written in Latin.

It is a work for linters, which can have many options for configuring
acceptable scripts, use spelling dictionaries and dictionaries of
homoglyphs, etc.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/DN24FK3A2DSO4HBGEDGJXERSAUYK6VK6/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

jimjjewett at gmail

Nov 2, 2021, 9:49 AM

Post #3 of 48 (2004 views)

Permalink

Serhiy Storchaka wrote:
> 02.11.21 16:16, Petr Viktorin ????:
> > As for \0, can we ban all ASCII & C1 control characters except
> > whitespace? I see no place for them in source code.

> All control characters except CR, LF, TAB and FF are banned outside
> comments and string literals. I think it is worth to ban them in
> comments and string literals too. In string literals you can use
> backslash-escape sequences, and comments should be human readable, there
> are no reason to include control characters in them.

If escape sequences were also allowed in comments (or at least in strings within comments), this would make sense. I don't like banning them otherwise, since odd characters are often a good reason to need a comment, but it is definitely a "mention, not use" situation.

> > For homoglyphs/confusables, should there be a SyntaxWarning when an
> > identifier looks like ASCII but isn't?
> > It would virtually ban Cyrillic. There is a lot of Cyrillic letters
> which look like Latin letters, and there are complete words written in
> Cyrillic which by accident look like other words written in Latin.

At the time, we considered it, and we also considered a narrower restriction on using multiple scripts in the same identifier, or at least the same identifier portion (so it was OK if separated by _).

Simplicity won, in part because of existing practice in EMACS scripting, particularly with some Asian languages.

> It is a work for linters, which can have many options for configuring
> acceptable scripts, use spelling dictionaries and dictionaries of
> homoglyphs, etc.

It might be time for the documentation to mention a specific linter/configuration that does this. It also might be reasonable to do by default in IDLE or even the interactive shell.

-jJ
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/BCZI6HCZJ34XABFFZETJMWFQWOUG4UB4/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

stephenjturnbull at gmail

Nov 2, 2021, 10:26 PM

Post #4 of 48 (2004 views)

Permalink

Serhiy Storchaka writes:

> All control characters except CR, LF, TAB and FF are banned outside
> comments and string literals. I think it is worth to ban them in
> comments and string literals too.

+1

> > For homoglyphs/confusables, should there be a SyntaxWarning when an
> > identifier looks like ASCII but isn't?
>
> It would virtually ban Cyrillic.

+1 (for the comment and for the implied -1 on SyntaxWarning, let's
keep the Cyrillic repertoire in Python!)

> It is a work for linters,

+1

Aside from the reasons Serhiy presents, I'd rather not tie
this kind of rather ambiguous improvement in Unicode handling to the
release cycle.

It might be worth having a pep9999 module/script in Python (perhaps
more likely, PyPI but maintained by whoever does the work to make
these improvements + Petr or somebody Petr trusts to do it), that
lints scripts specifically for confusables and other issues.

Steve
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/Z62GMKAJLHZJD3YSEOJKKBWUZSBYEIVA/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

stephenjturnbull at gmail

Nov 2, 2021, 10:44 PM

Post #5 of 48 (2004 views)

Permalink

Jim J. Jewett writes:

> At the time, we considered it, and we also considered a narrower
> restriction on using multiple scripts in the same identifier, or at
> least the same identifier portion (so it was OK if separated by
> _).

This would ban "????", aka "pango". That's arguably a good idea
(IMO, 0.9 wink), but might make some GTK/GNOME folks sad.

> Simplicity won, in part because of existing practice in EMACS
> scripting, particularly with some Asian languages.

Interesting. I maintained a couple of Emacs libraries (dictionaries
and input methods) for Japanese in XEmacs, and while hyphen-separated
mixtures of ASCII and Japanese are common, I don't recall ever seeing
an identifier with ASCII and Japanese glommed together without a
separator. It was almost always of the form "English verb - Japanese
lexical component". Or do you consider that "relatively complicated"?

> It might be time for the documentation to mention a specific
> linter/configuration that does this. It also might be reasonable
> to do by default in IDLE or even the interactive shell.

It would have to be easy to turn off, perhaps even provide
instructions in the messages. I would guess that for code that uses
it at all, it would be common. So the warnings would likely make
those tools somewhere between really annoying and unusable.

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/FPO3EJISKDZUVMC3RMJJQZIKGCOG35CX/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

storchaka at gmail

Nov 3, 2021, 12:56 AM

Post #6 of 48 (2004 views)

Permalink

02.11.21 18:49, Jim J. Jewett ????:
> If escape sequences were also allowed in comments (or at least in strings within comments), this would make sense. I don't like banning them otherwise, since odd characters are often a good reason to need a comment, but it is definitely a "mention, not use" situation.

If you mean backslash-escaped sequences like \uXXXX, there is no reason
to ban them in comments. Unlike to Java they do not have special meaning
outside of string literals. But if you mean terminal control sequences
(which change color or move cursor) they should not be allowed in comments.

> At the time, we considered it, and we also considered a narrower restriction on using multiple scripts in the same identifier, or at least the same identifier portion (so it was OK if separated by _).

I implemented this restrictions in one of my projects. The character set
was limited, and even this did not solve all issues with homoglyphs.

I think that we should not introduce such arbitrary limitations at the
parser level and left it to linters.

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/2TT3VX4D4FMRSEOV4O2ICYTC7VC5M2J4/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

steve at pearwood

Nov 3, 2021, 3:40 AM

Post #7 of 48 (2004 views)

Permalink

On Tue, Nov 02, 2021 at 05:55:55PM +0200, Serhiy Storchaka wrote:

> All control characters except CR, LF, TAB and FF are banned outside
> comments and string literals. I think it is worth to ban them in
> comments and string literals too. In string literals you can use
> backslash-escape sequences, and comments should be human readable, there
> are no reason to include control characters in them. There is a
> precedence of emitting warnings for some superficial escapes in strings.

Agreed. I don't think there is any good reason for including control
characters (apart from whitespace) in comments.

In strings, I would consider allowing VT (vertical tab) as well, that is
whitespace.

>>> '\v'.isspace()
True

But I don't have a strong opinion on that.

[Petr]
> > For homoglyphs/confusables, should there be a SyntaxWarning when an
> > identifier looks like ASCII but isn't?

Let's not enshrine as a language "feature" that non Western European
languages are dangerous second-class citizens.

> It would virtually ban Cyrillic. There is a lot of Cyrillic letters
> which look like Latin letters, and there are complete words written in
> Cyrillic which by accident look like other words written in Latin.

Agreed.

> It is a work for linters, which can have many options for configuring
> acceptable scripts, use spelling dictionaries and dictionaries of
> homoglyphs, etc.

Linters and editors. I have no objection to people using editors that
highlight non-ASCII characters in blinking red letters, so long as I can
turn that option off :-)

--
Steve
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/RWE5FIWHUM5PSOJ6BI2PAO5TDE3KLC5D/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

encukou at gmail

Nov 3, 2021, 5:31 AM

Post #8 of 48 (2004 views)

Permalink

We seem to agree that this is work for linters. That's reasonable; I'd
generalize it to "tools and policies". But even so, discussing what we'd
expect linters to do is on topic here.
Perhaps we can even find ways for the language to support linters --
type checking is also for external tools, but has language support.

For example: should the parser emit a lightweight audit event if it
finds a non-ASCII identifier? (See below for why ASCII is special.)
Or for encoding declarations?

On 03. 11. 21 6:26, Stephen J. Turnbull wrote:
> Serhiy Storchaka writes:
>
> > All control characters except CR, LF, TAB and FF are banned outside
> > comments and string literals. I think it is worth to ban them in
> > comments and string literals too.
>
> +1
>
> > > For homoglyphs/confusables, should there be a SyntaxWarning when an
> > > identifier looks like ASCII but isn't?
> >
> > It would virtually ban Cyrillic.
>
> +1 (for the comment and for the implied -1 on SyntaxWarning, let's
> keep the Cyrillic repertoire in Python!)

I don't think this would actually ban Cyrillic/Greek.
(My suggestion is not vanilla confusables detection; it might require
careful reading: "should there be a [linter] warning when an identifier
looks like ASCII but isn't?")

I am not a native speaker, but I did try a bit to find an actual
ASCII-like word in a language that uses Cyrillic. I didn't succeed; I
think they might be very rare.
Even if there was such a word -- or a one-letter abbreviation used as a
variable name -- it would be confusing to use. Removing the possibility
of confusion could *help* Cyrillic users. (I can't speak for them; this
is just a brainstorming idea.)

Steven adds:
> Let's not enshrine as a language "feature" that non Western European
> languages are dangerous second-class citizens.

That would be going too far, yes, but the fact is that non-English
languages *are* second-class citizens. Code that uses Python keywords
and stdlib must use English, and possibly another language. It is the
mixing of languages that can be dangerous/confusing, not the languages
themselves.

>
> > It is a work for linters,
>
> +1
>
> Aside from the reasons Serhiy presents, I'd rather not tie
> this kind of rather ambiguous improvement in Unicode handling to the
> release cycle.
>
> It might be worth having a pep9999 module/script in Python (perhaps
> more likely, PyPI but maintained by whoever does the work to make
> these improvements + Petr or somebody Petr trusts to do it), that
> lints scripts specifically for confusables and other issues.

If I have any say in it, the name definitely won't include a PEP number ;)
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/LB4O3YVDNVVNLYPMNH236QXGGUYG4BVI/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

storchaka at gmail

Nov 3, 2021, 8:44 AM

Post #9 of 48 (2004 views)

Permalink

03.11.21 14:31, Petr Viktorin ????:
> For example: should the parser emit a lightweight audit event if it
> finds a non-ASCII identifier? (See below for why ASCII is special.)
> Or for encoding declarations?

There are audit events for import and compile. You can also register
import hooks if you want more fanny preprocessing than just
unicode-encoding. I do not think we need to add more specific audit
events, they were not designed for this.

And I think it is too late to detect suspicious code at the time of its
execution. It should be detected before adding that code to the code
base (review tools, pre-commit hooks).

> I don't think this would actually ban Cyrillic/Greek.
> (My suggestion is not vanilla confusables detection; it might require
> careful reading: "should there be a [linter] warning when an identifier
> looks like ASCII but isn't?")

Yes, but it should be optional and configurable and not be the part of
the Python compiler. This is not our business as Python core developers.

> I am not a native speaker, but I did try a bit to find an actual
> ASCII-like word in a language that uses Cyrillic. I didn't succeed; I
> think they might be very rare.

With simple script I have found 62 words common between English and
Ukrainian: ????/racy, ????/rope, ????/puma, ???/mix, etc. But there are
much more English and Ukrainian words which contains only letters which
can be confused with letters from other script. And identifiers can
contains abbreviations and shortening, they are not all can be found in
dictionaries.

> Even if there was such a word -- or a one-letter abbreviation used as a
> variable name -- it would be confusing to use. Removing the possibility
> of confusion could *help* Cyrillic users. (I can't speak for them; this
> is just a brainstorming idea.)

I never used non-Latin identifiers in Python, but I guess that where
they are used (in schools?) there is a mix of English and non-English
identifiers, and identifiers consisting of parts of English and
non-English words without even an underscore between them. I know
because in other languages they just use inconsistent transliteration.
Emitting any warning by default is a discrimination of non-English
users. It would be better to not add support of non-ASCII identifiers at
first place.

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/XHHXRWGKTDTZIYGS6AB3DKEVFH5D6BHV/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

chris.jerdonek at gmail

Nov 3, 2021, 11:00 AM

Post #10 of 48 (2004 views)

Permalink

On Tue, Nov 2, 2021 at 7:21 AM Petr Viktorin <encukou@gmail.com> wrote:

> That brings us to possible changes in Python in this area, which is an
> interesting topic.

Is there a use case or need for allowing the comment-starting character “#”
to occur when text is still in the right-to-left direction? Disallowing
that would prevent Petr’s examples in which active code is displayed after
the comment mark, which to me seems to be one of the more egregious
examples. Or maybe this case is no worse than others and isn’t worth
singling out.

—Chris

>
> As for \0, can we ban all ASCII & C1 control characters except
> whitespace? I see no place for them in source code.
>
>
> For homoglyphs/confusables, should there be a SyntaxWarning when an
> identifier looks like ASCII but isn't?
>
> For right-to-left text: does anyone actually name identifiers in
> Hebrew/Arabic? AFAIK, we should allow a few non-printing
> "joiner"/"non-joiner" characters to make it possible to use all Arabic
> words. But it would be great to consult with users/teachers of the
> languages.
> Should Python run the bidi algorithm when parsing and disallow reordered
> tokens? Maybe optionally?
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/TGB377QWGIDPUWMAJSZLT22ERGPNZ5FZ/
> Code of Conduct: http://python.org/psf/codeofconduct/
>

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

jimjjewett at gmail

Nov 3, 2021, 4:02 PM

Post #11 of 48 (2004 views)

Permalink

Stephen J. Turnbull wrote:
> Jim J. Jewett writes:
> > At the time, we considered it, and we also considered a narrower
> > restriction on using multiple scripts in the same identifier, or at
> > least the same identifier portion (so it was OK if separated by
> > _).

> > This would ban "????", aka "pango". That's arguably a good idea
> (IMO, 0.9 wink), but might make some GTK/GNOME folks sad.

I am not quite motivated enough to search the archives, but I'm pretty sure the examples actually found were less prominent than that. There seemed to be at least one or two fora where it was something of a local idiom.

>... I don't recall ever seeing
> an identifier with ASCII and Japanese glommed together without a
> separator. It was almost always of the form "English verb - Japanese
> lexical component".

The problem was that some were written without a "-" or "_" to separate the halves. It looked fine -- the script change was obvious to even someone who didn't speak the non-English language. But having to support that meant any remaining restriction on mixed scripts would be either too weak to be worthwhile, or too complicated to write into the python language specification.

-jJ
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/CUTMZG55WY3CLNNKB6VTPCOUXJ22EEZY/
Code of Conduct: http://python.org/psf/codeofconduct/

Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

ptmcg at austin

Nov 13, 2021, 1:35 PM

Post #12 of 48 (2004 views)

Permalink

I’ve not been following the thread, but Steve Holden forwarded me the email from Petr Viktorin, that I might share some of the info I found while recently diving into this topic.

As part of working on the next edition of “Python in a Nutshell” with Steve, Alex Martelli, and Anna Ravencroft, Alex suggested that I add a cautionary section on homoglyphs, specifically citing “A” (LATIN CAPITAL LETTER A) and “?” (GREEK CAPITAL LETTER ALPHA) as an example problem pair. I wanted to look a little further at the use of characters in identifiers beyond the standard 7-bit ASCII, and so I found some of these same issues dealing with Unicode NFKC normalization. The first discovery was the overlapping normalization of “ªº” with “ao”. This was quite a shock to me, since I assumed that the inclusion of Unicode for identifier characters would preserve the uniqueness of the different code points. Even ligatures can be used, and will overlap with their multi-character ASCII forms. So we have added a second note in the upcoming edition on the risks of using these “homonorms” (which is a word I just made up for the occasion).

To explore the extreme case, I wrote a pyparsing transformer to convert identifiers in a body of Python source to mixed font, equivalent to the original source after NFKC normalization. Here are hello.py, and a snippet from unittest/utils.py:

def ????????????????????():

try:

????e????????????? = "Hello"

????????r?????? = "World"

?????????????????(f"{?????????????º_}, {?????????l??}!")

except ??????????????????????????? as ?????c:

????r???("failed: {}".??????????ª?(?????????))

if _??????????????__ == "__main__":

????e??????()

# snippet from unittest/util.py

_??????????????????????L?????????????????????? = 12

def _??????????????????????(????, p?????????????????????????, ???????????????????????????):

?????????? = ?????????(????) - ?r????????????x????????? - ???????????????????????

if s?i???? > _????????????????????H??????????????????L????????:

???? = '%s[%d chars]%s' % (????[:??????????????????????????????], ?????????p, ????[????????????(????) - ?????????????x?????????:])

return ?

You should able to paste these into your local UTF-8-aware editor or IDE and execute them as-is.

(If this doesn’t come through, you can also see this as a GitHub gist at Hello, World rendered in a variety of Unicode characters (github.com) <https://gist.github.com/ptmcg/bf35d5ada416080d481d789988b6b466> . I have a second gist containing the transformer, but it is still a private gist atm.)

Some other discoveries:

“·” (ASCII 183) is a valid identifier body character, making “_···” a valid Python identifier. This could actually be another security attack point, in which “s·join(‘x’)” could be easily misread as “s.join(‘x’)”, but would actually be a call to potentially malicious method “s·join”.

“_” seems to be a special case for normalization. Only the ASCII “_” character is valid as a leading identifier character; the Unicode characters that normalize to “_” (any of the characters in “??????”) can only be used as identifier body characters. “?” especially could be misread as “|” followed by a space, when it actually normalizes to “_”.

Potential beneficial uses:

I am considering taking my transformer code and experimenting with an orthogonal approach to syntax highlighting, using Unicode groups instead of colors. Module names using characters from one group, builtins from another, program variables from another, maybe distinguish local from global variables. Colorizing has always been an obvious syntax highlight feature, but is an accessibility issue for those with difficulty distinguishing colors. Unlike the “ransom note” code above, code highlighted in this way might even be quite pleasing to the eye.

-- Paul McGuire

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

stestagg at gmail

Nov 13, 2021, 2:10 PM

Post #13 of 48 (2004 views)

Permalink

This is my favourite version of the issue:

? = lambda ?, e: ? if ? > e else e
print(?(2, 1), ?(1, 2)) # python 3 outputs: 2 2

https://twitter.com/stestagg/status/685239650064162820?s=21

Steve

On Sat, 13 Nov 2021 at 22:05, <ptmcg@austin.rr.com> wrote:

> I’ve not been following the thread, but Steve Holden forwarded me the
> email from Petr Viktorin, that I might share some of the info I found while
> recently diving into this topic.
>
>
>
> As part of working on the next edition of “Python in a Nutshell” with
> Steve, Alex Martelli, and Anna Ravencroft, Alex suggested that I add a
> cautionary section on homoglyphs, specifically citing “A” (LATIN CAPITAL
> LETTER A) and “?” (GREEK CAPITAL LETTER ALPHA) as an example problem pair.
> I wanted to look a little further at the use of characters in identifiers
> beyond the standard 7-bit ASCII, and so I found some of these same issues
> dealing with Unicode NFKC normalization. The first discovery was the
> overlapping normalization of “ªº” with “ao”. This was quite a shock to me,
> since I assumed that the inclusion of Unicode for identifier characters
> would preserve the uniqueness of the different code points. Even ligatures
> can be used, and will overlap with their multi-character ASCII forms. So we
> have added a second note in the upcoming edition on the risks of using
> these “homonorms” (which is a word I just made up for the occasion).
>
>
>
> To explore the extreme case, I wrote a pyparsing transformer to convert
> identifiers in a body of Python source to mixed font, equivalent to the
> original source after NFKC normalization. Here are hello.py, and a snippet
> from unittest/utils.py:
>
>
>
> def ????????????????????():
>
> try:
>
> ????e????????????? = "Hello"
>
> ????????r?????? = "World"
>
> ?????????????????(f"{?????????????º_}, {?????????l??}!")
>
> except ??????????????????????????? as ?????c:
>
> ????r???("failed: {}".??????????ª?(?????????))
>
>
>
> if _??????????????__ == "__main__":
>
> ????e??????()
>
>
>
>
>
> # snippet from unittest/util.py
>
> _??????????????????????L?????????????????????? = 12
>
> def _??????????????????????(????, p?????????????????????????, ???????????????????????????):
>
> ?????????? = ?????????(????) - ?r????????????x????????? - ???????????????????????
>
> if s?i???? > _????????????????????H??????????????????L????????:
>
> ???? = '%s[%d chars]%s' % (????[:??????????????????????????????], ?????????p, ????[????????????(
> ????) - ?????????????x?????????:])
>
> return ?
>
>
>
>
>
> You should able to paste these into your local UTF-8-aware editor or IDE
> and execute them as-is.
>
>
>
> (If this doesn’t come through, you can also see this as a GitHub gist at Hello,
> World rendered in a variety of Unicode characters (github.com)
> <https://gist.github.com/ptmcg/bf35d5ada416080d481d789988b6b466>. I have
> a second gist containing the transformer, but it is still a private gist
> atm.)
>
>
>
>
>
> Some other discoveries:
>
> “·” (ASCII 183) is a valid identifier body character, making “_···” a
> valid Python identifier. This could actually be another security attack
> point, in which “s·join(‘x’)” could be easily misread as “s.join(‘x’)”, but
> would actually be a call to potentially malicious method “s·join”.
>
> “_” seems to be a special case for normalization. Only the ASCII “_”
> character is valid as a leading identifier character; the Unicode
> characters that normalize to “_” (any of the characters in “??????”) can
> only be used as identifier body characters. “?” especially could be
> misread as “|” followed by a space, when it actually normalizes to “_”.
>
>
>
>
>
> Potential beneficial uses:
>
> I am considering taking my transformer code and experimenting with an
> orthogonal approach to syntax highlighting, using Unicode groups instead of
> colors. Module names using characters from one group, builtins from
> another, program variables from another, maybe distinguish local from
> global variables. Colorizing has always been an obvious syntax highlight
> feature, but is an accessibility issue for those with difficulty
> distinguishing colors. Unlike the “ransom note” code above, code
> highlighted in this way might even be quite pleasing to the eye.
>
>
>
>
>
> -- Paul McGuire
>
>
>
>
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/GBLXJ2ZTIMLBD2MJQ4VDNUKFFTPPIIMO/
> Code of Conduct: http://python.org/psf/codeofconduct/
>

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

tjreedy at udel

Nov 13, 2021, 2:36 PM

Post #14 of 48 (2004 views)

Permalink

On 11/13/2021 4:35 PM, ptmcg@austin.rr.com wrote:
> I’ve not been following the thread, but Steve Holden forwarded me the

> To explore the extreme case, I wrote a pyparsing transformer to convert
> identifiers in a body of Python source to mixed font, equivalent to the
> original source after NFKC normalization. Here are hello.py, and a
> snippet from unittest/utils.py:
>
> def ????????????????????():
>
>     try:
>
> ????e????????????? = "Hello"
>
> ????????r?????? = "World"
>
>         ?????????????????(f"{?????????????º_}, {?????????l??}!")
>
>     except ??????????????????????????? as ?????c:
>
> ????r???("failed: {}".??????????ª?(?????????))
>
> if _??????????????__ == "__main__":
>
> ????e??????()
>
> # snippet from unittest/util.py
>
> _??????????????????????L?????????????????????? = 12
>
> def _??????????????????????(????, p?????????????????????????, ???????????????????????????):
>
>     ?????????? = ?????????(????) - ?r????????????x????????? - ???????????????????????
>
>     if s?i???? > _????????????????????H??????????????????L????????:
>
> ???? = '%s[%d chars]%s' % (????[:??????????????????????????????], ?????????p, ????[????????????(????) -
> ?????????????x?????????:])
>
>     return ?
>
> You should able to paste these into your local UTF-8-aware editor or IDE
> and execute them as-is.

Wow. After pasting the util.py snippet into current IDLE, which on my
Windows machine* displays the complete text:

>>> dir()
['_PLACEHOLDER_LEN', '__annotations__', '__builtins__', '__doc__',
'__loader__', '__name__', '__package__', '__spec__', '_shorten']
>>> _shorten('abc', 1, 1)
'abc'
>>> _shorten('abcdefghijklmnopqrw', 2, 2)
'ab[15 chars]rw'

* Does not at all work in CommandPrompt, even after supposedly changing
to a utf-8 codepage with 'chcp 65000'.

--
Terry Jan Reedy
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/NSGBCZQ2R6G2HGPAID4ZI35YCRMF7ERC/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

pythonchb at gmail

Nov 14, 2021, 9:17 AM

Post #15 of 48 (2003 views)

Permalink

On Sat, Nov 13, 2021 at 2:03 PM <ptmcg@austin.rr.com> wrote:

> def ????????????????????():
>
> try:
>
> ????e????????????? = "Hello"
>
> ????????r?????? = "World"
>
> ?????????????????(f"{?????????????º_}, {?????????l??}!")
>
> except ??????????????????????????? as ?????c:
>
> ????r???("failed: {}".??????????ª?(?????????))
>

Wow. Just Wow.

So why does Python apply NFKC normalization to variable names?? I can't
for the life of me figure out why that would be helpful at all.

The string methods, sure, but names?

And, in fact, the normalization is not used for string comparisons or
hashes as far as I can tell.

In [36]: weird
Out[36]: '?????????????????'

In [37]: normal
Out[37]: 'print'

In [38]: eval(weird + "('yup, that worked')")
yup, that worked

In [39]: weird == normal
Out[39]: False

In [40]: weird[0] in normal
Out[40]: False

This seems very odd (and dangerous) to me.

Is there a good reason? and is it too late to change it?

-CHB

>
>
> if _??????????????__ == "__main__":
>
> ????e??????()
>
>
>
>
>
> # snippet from unittest/util.py
>
> _??????????????????????L?????????????????????? = 12
>
> def _??????????????????????(????, p?????????????????????????, ???????????????????????????):
>
> ?????????? = ?????????(????) - ?r????????????x????????? - ???????????????????????
>
> if s?i???? > _????????????????????H??????????????????L????????:
>
> ???? = '%s[%d chars]%s' % (????[:??????????????????????????????], ?????????p, ????[????????????(
> ????) - ?????????????x?????????:])
>
> return ?
>
>
>
>
>
> You should able to paste these into your local UTF-8-aware editor or IDE
> and execute them as-is.
>
>
>
> (If this doesn’t come through, you can also see this as a GitHub gist at Hello,
> World rendered in a variety of Unicode characters (github.com)
> <https://gist.github.com/ptmcg/bf35d5ada416080d481d789988b6b466>. I have
> a second gist containing the transformer, but it is still a private gist
> atm.)
>
>
>
>
>
> Some other discoveries:
>
> “·” (ASCII 183) is a valid identifier body character, making “_···” a
> valid Python identifier. This could actually be another security attack
> point, in which “s·join(‘x’)” could be easily misread as “s.join(‘x’)”, but
> would actually be a call to potentially malicious method “s·join”.
>
> “_” seems to be a special case for normalization. Only the ASCII “_”
> character is valid as a leading identifier character; the Unicode
> characters that normalize to “_” (any of the characters in “??????”) can
> only be used as identifier body characters. “?” especially could be
> misread as “|” followed by a space, when it actually normalizes to “_”.
>
>
>
>
>
> Potential beneficial uses:
>
> I am considering taking my transformer code and experimenting with an
> orthogonal approach to syntax highlighting, using Unicode groups instead of
> colors. Module names using characters from one group, builtins from
> another, program variables from another, maybe distinguish local from
> global variables. Colorizing has always been an obvious syntax highlight
> feature, but is an accessibility issue for those with difficulty
> distinguishing colors. Unlike the “ransom note” code above, code
> highlighted in this way might even be quite pleasing to the eye.
>
>
>
>
>
> -- Paul McGuire
>
>
>
>
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/GBLXJ2ZTIMLBD2MJQ4VDNUKFFTPPIIMO/
> Code of Conduct: http://python.org/psf/codeofconduct/
>

--
Christopher Barker, PhD (Chris)

Python Language Consulting
- Teaching
- Scientific Software Development
- Desktop GUI and Web Development
- wxPython, numpy, scipy, Cython

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

jimjjewett at gmail

Nov 14, 2021, 9:39 AM

Post #16 of 48 (2003 views)

Permalink

ptmcg?austin.rr.com wrote:

> ... add a cautionary section on homoglyphs, specifically citing
> “A” (LATIN CAPITAL LETTER A) and “?” (GREEK CAPITAL LETTER ALPHA)
> as an example problem pair.

There is a unicode tech report about confusables, but it is never clear where to stop. Are I (upper case I), l (lower case l) and 1 (numeric 1) from ASCII already a problem? And if we do it at all, is there any way to avoid making Cyrillic languages second-class?

I'm not quickly finding the contemporary report, but these should be helpful if you want to go deeper:

http://www.unicode.org/reports/tr36/
http://unicode.org/reports/tr36/confusables.txt
https://util.unicode.org/UnicodeJsps/confusables.jsp

> I wanted to look a little further at the use of characters in identifiers
> beyond the standard 7-bit ASCII, and so I found some of these same
> issues dealing with Unicode NFKC normalization. The first discovery was
> the overlapping normalization of “ªº” with “ao”.

Here I don't see the problem. Things that look slightly different are really the same, and you can write it either way. So you can use what looks like a funny font, but the closest it comes to a security risk is that maybe you could access something without a casual reader realizing that you are doing so. They would know that you *could* access it, just not that you *did*.

> Some other discoveries:
> “·” (ASCII 183) is a valid identifier body character, making “_···” a valid
> Python identifier.

That and the apostrophe are Unicode consortium regrets, because they are normally punctuation, but there are also languages that use them as letters.
The apostrophe is (supposedly only) used by Afrikaans, I asked a native speaker about where/how often it was used, and the similarity to Dutch was enough that Guido felt comfortable excluding it. (It *may* have been similar to using the apostrophe for a contraction in English, and saying it therefore represents a letter, but the scope was clearly smaller.) But the dot is used in Catalan, and ... we didn't find anyone ready to say it wouldn't be needed for sensible identifiers. It is worth listing as a warning, and linters should probably complain.

> “_” seems to be a special case for normalization. Only the ASCII “_”
> character is valid as a leading identifier character; the Unicode
> characters that normalize to “_” (any of the characters in “??????”)
> can only be used as identifier body characters. “?” especially could be
> misread as “|” followed by a space, when it actually normalizes to “_”.

So go ahead and warn, but it isn't clear how that could be abused to look like something other than a syntax error, except maybe through soft keywords. (Ha! I snuck in a call to async?def that had been imported with *, and you didn't worry about the import *, or the apparently wild cursor position marker, or the strange async definition that was never used! No way I could have just issued a call to _flush and done the same thing!)

> Potential beneficial uses:
> I am considering taking my transformer code and experimenting with an
> orthogonal approach to syntax highlighting, using Unicode groups
> instead of colors. Module names using characters from one group,
> builtins from another, program variables from another, maybe
> distinguish local from global variables. Colorizing has always been an
> obvious syntax highlight feature, but is an accessibility issue for those
> with difficulty distinguishing colors.

I kind of like the idea, but ... if you're doing it on-the-fly in the editor, you could just use different fonts. If you're actually saving those changes, it seems likely to lead to a lot of spurious diffs if anyone uses a different editor.

-jJ
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/NPTL43EVT2FF76LXIBBWVHDU6NXH3HF5/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

python at mrabarnett

Nov 14, 2021, 10:17 AM

Post #17 of 48 (2003 views)

Permalink

On 2021-11-14 17:17, Christopher Barker wrote:
> On Sat, Nov 13, 2021 at 2:03 PM <ptmcg@austin.rr.com
> <mailto:ptmcg@austin.rr.com>> wrote:
>
> def ????????????????????():
>
> __
>
>     try:____
>
> ????e????????????? = "Hello"____
>
> ????????r?????? = "World"____
>
>         ?????????????????(f"{?????????????º_}, {?????????l??}!")____
>
>     except ??????????????????????????? as ?????c:____
>
> ????r???("failed: {}".??????????ª?(?????????))
>
>
> Wow. Just Wow.
>
> So why does Python apply NFKC normalization to variable names?? I can't
> for the life of me figure out why that would be helpful at all.
>
> The string methods, sure, but names?
>
> And, in fact, the normalization is not used for string comparisons or
> hashes as far as I can tell.
>
[snip]

It's probably to deal with "e?" vs "é", i.e. "\N{LATIN SMALL LETTER
E}\N{COMBINING ACUTE ACCENT}" vs "\N{LATIN SMALL LETTER E WITH ACUTE}",
which are different ways of writing the same thing.

Unfortunately, it goes too far, because it's unlikely that we want "?"
("\N{MODIFIER LETTER SMALL P}') to be equivalent to "P" ("\N{LATIN
CAPITAL LETTER P}".
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/PNZICEQGVEAQH7KNBCBSS4LPAO25JBF3/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

python-dev at python

Nov 14, 2021, 10:23 AM

Post #18 of 48 (2003 views)

Permalink

Indeed, normative annex https://www.unicode.org/reports/tr31/tr31-35.html
section 5 says: "if the programming language has case-sensitive
identifiers, then Normalization Form C is appropriate" (vs NFKC for a
language with case-insensitive identifiers) so to follow the standard we
should have used NFC rather than NFKC. Not sure if it's too late to fix
this "oops" in future Python versions.

Alex

On Sun, Nov 14, 2021 at 9:17 AM Christopher Barker <pythonchb@gmail.com>
wrote:

> On Sat, Nov 13, 2021 at 2:03 PM <ptmcg@austin.rr.com> wrote:
>
>> def ????????????????????():
>>
>> try:
>>
>> ????e????????????? = "Hello"
>>
>> ????????r?????? = "World"
>>
>> ?????????????????(f"{?????????????º_}, {?????????l??}!")
>>
>> except ??????????????????????????? as ?????c:
>>
>> ????r???("failed: {}".??????????ª?(?????????))
>>
>
> Wow. Just Wow.
>
> So why does Python apply NFKC normalization to variable names?? I can't
> for the life of me figure out why that would be helpful at all.
>
> The string methods, sure, but names?
>
> And, in fact, the normalization is not used for string comparisons or
> hashes as far as I can tell.
>
> In [36]: weird
> Out[36]: '?????????????????'
>
> In [37]: normal
> Out[37]: 'print'
>
> In [38]: eval(weird + "('yup, that worked')")
> yup, that worked
>
> In [39]: weird == normal
> Out[39]: False
>
> In [40]: weird[0] in normal
> Out[40]: False
>
> This seems very odd (and dangerous) to me.
>
> Is there a good reason? and is it too late to change it?
>
> -CHB
>
>
>
>
>
>
>
>
>
>>
>>
>> if _??????????????__ == "__main__":
>>
>> ????e??????()
>>
>>
>>
>>
>>
>> # snippet from unittest/util.py
>>
>> _??????????????????????L?????????????????????? = 12
>>
>> def _??????????????????????(????, p?????????????????????????, ???????????????????????????):
>>
>> ?????????? = ?????????(????) - ?r????????????x????????? - ???????????????????????
>>
>> if s?i???? > _????????????????????H??????????????????L????????:
>>
>> ???? = '%s[%d chars]%s' % (????[:??????????????????????????????], ?????????p, ????[????????????
>> (????) - ?????????????x?????????:])
>>
>> return ?
>>
>>
>>
>>
>>
>> You should able to paste these into your local UTF-8-aware editor or IDE
>> and execute them as-is.
>>
>>
>>
>> (If this doesn’t come through, you can also see this as a GitHub gist at Hello,
>> World rendered in a variety of Unicode characters (github.com)
>> <https://gist.github.com/ptmcg/bf35d5ada416080d481d789988b6b466>. I have
>> a second gist containing the transformer, but it is still a private gist
>> atm.)
>>
>>
>>
>>
>>
>> Some other discoveries:
>>
>> “·” (ASCII 183) is a valid identifier body character, making “_···” a
>> valid Python identifier. This could actually be another security attack
>> point, in which “s·join(‘x’)” could be easily misread as “s.join(‘x’)”, but
>> would actually be a call to potentially malicious method “s·join”.
>>
>> “_” seems to be a special case for normalization. Only the ASCII “_”
>> character is valid as a leading identifier character; the Unicode
>> characters that normalize to “_” (any of the characters in “??????”) can
>> only be used as identifier body characters. “?” especially could be
>> misread as “|” followed by a space, when it actually normalizes to “_”.
>>
>>
>>
>>
>>
>> Potential beneficial uses:
>>
>> I am considering taking my transformer code and experimenting with an
>> orthogonal approach to syntax highlighting, using Unicode groups instead of
>> colors. Module names using characters from one group, builtins from
>> another, program variables from another, maybe distinguish local from
>> global variables. Colorizing has always been an obvious syntax highlight
>> feature, but is an accessibility issue for those with difficulty
>> distinguishing colors. Unlike the “ransom note” code above, code
>> highlighted in this way might even be quite pleasing to the eye.
>>
>>
>>
>>
>>
>> -- Paul McGuire
>>
>>
>>
>>
>> _______________________________________________
>> Python-Dev mailing list -- python-dev@python.org
>> To unsubscribe send an email to python-dev-leave@python.org
>> https://mail.python.org/mailman3/lists/python-dev.python.org/
>> Message archived at
>> https://mail.python.org/archives/list/python-dev@python.org/message/GBLXJ2ZTIMLBD2MJQ4VDNUKFFTPPIIMO/
>> Code of Conduct: http://python.org/psf/codeofconduct/
>>
>
>
> --
> Christopher Barker, PhD (Chris)
>
> Python Language Consulting
> - Teaching
> - Scientific Software Development
> - Desktop GUI and Web Development
> - wxPython, numpy, scipy, Cython
>

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

pythonchb at gmail

Nov 14, 2021, 11:07 AM

Post #19 of 48 (2003 views)

Permalink

On Sun, Nov 14, 2021 at 10:27 AM MRAB <python@mrabarnett.plus.com> wrote:

> > So why does Python apply NFKC normalization to variable names??

> It's probably to deal with "e?" vs "é", i.e. "\N{LATIN SMALL LETTER
> E}\N{COMBINING ACUTE ACCENT}" vs "\N{LATIN SMALL LETTER E WITH ACUTE}",
> which are different ways of writing the same thing.
>

sure, but this is code, written by humans (or meta-programming). Maybe I'm
showing my english bias, but would it be that limiting to have identifiers
be based on codepoints, period?

Why does someone that wants to use, .e.g. "e?" in an identifier have to be
able to represent it two different ways in a code file?

But if so ...

> Unfortunately, it goes too far, because it's unlikely that we want "?"
> ("\N{MODIFIER LETTER SMALL P}') to be equivalent to "P" ("\N{LATIN
> CAPITAL LETTER P}".
>

Is it possible to only capture things like the combining characters and not
the "equivalent" ones like the above?

-CHB

--
Christopher Barker, PhD (Chris)

Python Language Consulting
- Teaching
- Scientific Software Development
- Desktop GUI and Web Development
- wxPython, numpy, scipy, Cython

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

lord.mauve at gmail

Nov 14, 2021, 11:20 AM

Post #20 of 48 (2003 views)

Permalink

On Sun, 14 Nov 2021, 19:07 Christopher Barker, <pythonchb@gmail.com> wrote:

> On Sun, Nov 14, 2021 at 10:27 AM MRAB <python@mrabarnett.plus.com> wrote:
>
>> Unfortunately, it goes too far, because it's unlikely that we want "?"
>> ("\N{MODIFIER LETTER SMALL P}') to be equivalent to "P" ("\N{LATIN
>> CAPITAL LETTER P}".
>>
>
> Is it possible to only capture things like the combining characters and
> not the "equivalent" ones like the above?
>

Yes, that is NFC. NKFC converts to equivalent characters and also composes;
NFC just composes.

>

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

Richard at Damon-Family

Nov 14, 2021, 11:23 AM

Post #21 of 48 (2003 views)

Permalink

On 11/14/21 2:07 PM, Christopher Barker wrote:
> Why does someone that wants to use, .e.g. "e?" in an identifier have
> to be able to represent it two different ways in a code file?
>
The issue here is that fundamentally, some editors will produce composed
characters and some decomposed characters to represent the same actual
'character'

These two methods are defined by Unicode to really represent the same
'character', it is just that some defined sequences of combining
codepoints just happen to have a composed 'abbreviation' defined also.

Having to exactly match the byte sequence says that some people will
have a VERY hard time entering usable code if there tools support
Unicode, but use the other convention.

--
Richard Damon

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/WXGHMDIAY2M77MUMBM4NU7LZTIQTEBNP/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

david.mertz at gmail

Nov 14, 2021, 11:36 AM

Post #22 of 48 (2003 views)

Permalink

On Sun, Nov 14, 2021, 2:14 PM Christopher Barker

> It's probably to deal with "e?" vs "é", i.e. "\N{LATIN SMALL LETTER
>> E}\N{COMBINING ACUTE ACCENT}" vs "\N{LATIN SMALL LETTER E WITH ACUTE}",
>> which are different ways of writing the same thing.
>>
>
> Why does someone that wants to use, .e.g. "e?" in an identifier have to be
> able to represent it two different ways in a code file?
>

Imagine that two different programmers work with the same code base, and
their text editors or keystrokes enter "é" in different ways.

Or imagine just one programmer doing so on two different
machines/environments.

As an example, I wrote this reply on my Android tablet (with such-and-such
OS version). I have no idea what actual codepoint(s) are entered when I
press and hold the "e" key for a couple seconds to pop up character
variations.

If I wrote it on OSX, I'd probably press "alt-e e" on my US International
key layout. Again, no idea what codepoints actually are entered. If I did
it on Linux, I'd use "ctrl-shift u 00e9". In that case, I actually know the
codepoint.

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

Richard at Damon-Family

Nov 14, 2021, 12:54 PM

Post #23 of 48 (2003 views)

Permalink

On 11/14/21 2:36 PM, David Mertz, Ph.D. wrote:
> On Sun, Nov 14, 2021, 2:14 PM Christopher Barker
>
> It's probably to deal with "e?" vs "é", i.e. "\N{LATIN SMALL
> LETTER
> E}\N{COMBINING ACUTE ACCENT}" vs "\N{LATIN SMALL LETTER E WITH
> ACUTE}",
> which are different ways of writing the same thing.
>
>
> Why does someone that wants to use, .e.g. "e?" in an identifier
> have to be able to represent it two different ways in a code file?
>
>
> Imagine that two different programmers work with the same code base,
> and their text editors or keystrokes enter "é" in different ways.
>
> Or imagine just one programmer doing so on two different
> machines/environments.
>
> As an example, I wrote this reply on my Android tablet (with
> such-and-such OS version). I have no idea what actual codepoint(s) are
> entered when I press and hold the "e" key for a couple seconds to pop
> up character variations.
>
> If I wrote it on OSX, I'd probably press "alt-e e" on my US
> International key layout. Again, no idea what codepoints actually are
> entered. If I did it on Linux, I'd use "ctrl-shift u 00e9". In that
> case, I actually know the codepoint.

But would have to look up the actual number to enter them.

Imagine of ALL your source code had to be entered via code-point numbers.

BTW, you should be able to enable 'composing' under Linux too, just like
under OSX with the right input driver loaded.

--
Richard Damon

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/N76K3RML5QIFW56BRRVUOW5HGKSJAIVA/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

steve at pearwood

Nov 14, 2021, 4:44 PM

Post #24 of 48 (2002 views)

Permalink

Out of all the approximately thousand bazillion ways to write obfuscated
Python code, which may or may not be malicious, why are Unicode
confusables worth this level of angst and concern?

I looked up "Unicode homoglyph" on CVE, and found a grand total of seven
hits:

https://www.cvedetails.com/google-search-results.php?q=unicode+homoglyph

all of which appear to be related to impersonation of account names. I
daresay if I expanded my search terms, I would probably find some more,
but it is clear that Unicode homoglyphs are not exactly a major threat.

In my opinion, the other Steve's (Stestagg) example of obfuscated code
with homoglyphs for e (as well as a few similar cases, such as
homoglyphs for A) mostly makes for an amusing curiosity, perhaps worth a
plugin for Pylint and other static checkers, but not much more. I'm not
entirely sure what Paul's more lurid examples are supposed to indicate.
If your threat relies on a malicious coder smuggling in identifiers like
"????????????????????" or "ªº" and having the reader not notice, then I'm not going to
lose much sleep over it.

Confusable account names and URL spoofing are proven, genuine threats.
Beyond that, IMO the actual threat window from confusables is pretty
small. Yes, you can write obfuscated code, and smuggle in calls to
unexpected functions:

result = l?n(sequence) # Cyrillic letter small Ie

but you still have to smuggle in a function to make it work:

def l?n(obj):
# something malicious

And if you can do that, the Unicode letter is redundant. I'm not sure
why any attacker would bother.

--
Steve
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/XNRW6JSFGO4DQOGVNY2FEZAUBN6P2HRR/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

pythonchb at gmail

Nov 14, 2021, 10:12 PM

Post #25 of 48 (2002 views)

Permalink

On Sun, Nov 14, 2021 at 4:53 PM Steven D'Aprano <steve@pearwood.info> wrote:

> Out of all the approximately thousand bazillion ways to write obfuscated
> Python code, which may or may not be malicious, why are Unicode
> confusables worth this level of angst and concern?
>

I for one am not full of angst nor particularly concerned. Though ti's a
fine idea to inform folks about h this issues.

I am, however, surprised and disappointed by the NKFC normalization.

For example, in writing math we often use different scripts to mean
different things (e.g. TeX's
Blackboard Bold). So if I were to use some of the Unicode Mathematical
Alphanumeric Symbols, I wouldn't want them to get normalized.

Then there's the question of when this normalization happens (and when it
doesn't). If one is doing any kind of metaprogramming, even just using
getattr() and setattr(), things could get very confusing:

In [55]: class Junk:
...: ?????????????º = "hello"
...:

In [56]: setattr(Junk, "?????????????????", "print")

In [57]: dir(Junk)
Out[57]:
'__weakref__',
<snip>
'hello',
'?????????????????']

In [58]: Junk.hello
Out[58]: 'hello'

In [59]: Junk.?????????????º
Out[59]: 'hello'

In [60]: Junk.print
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-60-f2a7d3de5d06> in <module>
----> 1 Junk.print

AttributeError: type object 'Junk' has no attribute 'print'

In [61]: Junk.?????????????????
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-61-004f4c8b2f07> in <module>
----> 1 Junk.?????????????????

AttributeError: type object 'Junk' has no attribute 'print'

In [62]: getattr(Junk, "?????????????????")
Out[62]: 'print'

Would a proposal to switch the normalization to NFC only have any hope of
being accepted?

and/or adding normaliztion to setattr() and maybe other places where names
are set in code?

-CHB

--
Christopher Barker, PhD (Chris)

Python Language Consulting
- Teaching
- Scientific Software Development
- Desktop GUI and Web Development
- wxPython, numpy, scipy, Cython