Mailing List Archive: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

encukou at gmail

Nov 2, 2021, 7:16 AM

Post #1 of 48 (2010 views)

On 01. 11. 21 18:32, Serhiy Storchaka wrote:
> This is excellent!
>
> 01.11.21 14:17, Petr Viktorin ????:
>>> CPython treats the control character NUL (``\0``) as end of input,
>>> but many editors simply skip it, possibly showing code that Python
>>> will not
>>> run as a regular part of a file.
>
> It is an implementation detail and we will get rid of it. It only
> happens when you read the Python script from a file. If you import it as
> a module or run with runpy, the NUL character is an error.

That brings us to possible changes in Python in this area, which is an
interesting topic.

As for \0, can we ban all ASCII & C1 control characters except
whitespace? I see no place for them in source code.

For homoglyphs/confusables, should there be a SyntaxWarning when an
identifier looks like ASCII but isn't?

For right-to-left text: does anyone actually name identifiers in
Hebrew/Arabic? AFAIK, we should allow a few non-printing
"joiner"/"non-joiner" characters to make it possible to use all Arabic
words. But it would be great to consult with users/teachers of the
languages.
Should Python run the bidi algorithm when parsing and disallow reordered
tokens? Maybe optionally?
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/TGB377QWGIDPUWMAJSZLT22ERGPNZ5FZ/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

storchaka at gmail

Nov 2, 2021, 8:55 AM

Post #2 of 48 (2010 views)

02.11.21 16:16, Petr Viktorin ????:
> As for \0, can we ban all ASCII & C1 control characters except
> whitespace? I see no place for them in source code.

All control characters except CR, LF, TAB and FF are banned outside
comments and string literals. I think it is worth to ban them in
comments and string literals too. In string literals you can use
backslash-escape sequences, and comments should be human readable, there
are no reason to include control characters in them. There is a
precedence of emitting warnings for some superficial escapes in strings.

> For homoglyphs/confusables, should there be a SyntaxWarning when an
> identifier looks like ASCII but isn't?

It would virtually ban Cyrillic. There is a lot of Cyrillic letters
which look like Latin letters, and there are complete words written in
Cyrillic which by accident look like other words written in Latin.

It is a work for linters, which can have many options for configuring
acceptable scripts, use spelling dictionaries and dictionaries of
homoglyphs, etc.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/DN24FK3A2DSO4HBGEDGJXERSAUYK6VK6/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

jimjjewett at gmail

Nov 2, 2021, 9:49 AM

Post #3 of 48 (2010 views)

Serhiy Storchaka wrote:
> 02.11.21 16:16, Petr Viktorin ????:
> > As for \0, can we ban all ASCII & C1 control characters except
> > whitespace? I see no place for them in source code.

> All control characters except CR, LF, TAB and FF are banned outside
> comments and string literals. I think it is worth to ban them in
> comments and string literals too. In string literals you can use
> backslash-escape sequences, and comments should be human readable, there
> are no reason to include control characters in them.

If escape sequences were also allowed in comments (or at least in strings within comments), this would make sense. I don't like banning them otherwise, since odd characters are often a good reason to need a comment, but it is definitely a "mention, not use" situation.

> > For homoglyphs/confusables, should there be a SyntaxWarning when an
> > identifier looks like ASCII but isn't?
> > It would virtually ban Cyrillic. There is a lot of Cyrillic letters
> which look like Latin letters, and there are complete words written in
> Cyrillic which by accident look like other words written in Latin.

At the time, we considered it, and we also considered a narrower restriction on using multiple scripts in the same identifier, or at least the same identifier portion (so it was OK if separated by _).

Simplicity won, in part because of existing practice in EMACS scripting, particularly with some Asian languages.

> It is a work for linters, which can have many options for configuring
> acceptable scripts, use spelling dictionaries and dictionaries of
> homoglyphs, etc.

It might be time for the documentation to mention a specific linter/configuration that does this. It also might be reasonable to do by default in IDLE or even the interactive shell.

-jJ
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/BCZI6HCZJ34XABFFZETJMWFQWOUG4UB4/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

stephenjturnbull at gmail

Nov 2, 2021, 10:26 PM

Post #4 of 48 (2010 views)

Serhiy Storchaka writes:

> All control characters except CR, LF, TAB and FF are banned outside
> comments and string literals. I think it is worth to ban them in
> comments and string literals too.

+1

> > For homoglyphs/confusables, should there be a SyntaxWarning when an
> > identifier looks like ASCII but isn't?
>
> It would virtually ban Cyrillic.

+1 (for the comment and for the implied -1 on SyntaxWarning, let's
keep the Cyrillic repertoire in Python!)

> It is a work for linters,

+1

Aside from the reasons Serhiy presents, I'd rather not tie
this kind of rather ambiguous improvement in Unicode handling to the
release cycle.

It might be worth having a pep9999 module/script in Python (perhaps
more likely, PyPI but maintained by whoever does the work to make
these improvements + Petr or somebody Petr trusts to do it), that
lints scripts specifically for confusables and other issues.

Steve
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/Z62GMKAJLHZJD3YSEOJKKBWUZSBYEIVA/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

stephenjturnbull at gmail

Nov 2, 2021, 10:44 PM

Post #5 of 48 (2010 views)

Jim J. Jewett writes:

> At the time, we considered it, and we also considered a narrower
> restriction on using multiple scripts in the same identifier, or at
> least the same identifier portion (so it was OK if separated by
> _).

This would ban "????", aka "pango". That's arguably a good idea
(IMO, 0.9 wink), but might make some GTK/GNOME folks sad.

> Simplicity won, in part because of existing practice in EMACS
> scripting, particularly with some Asian languages.

Interesting. I maintained a couple of Emacs libraries (dictionaries
and input methods) for Japanese in XEmacs, and while hyphen-separated
mixtures of ASCII and Japanese are common, I don't recall ever seeing
an identifier with ASCII and Japanese glommed together without a
separator. It was almost always of the form "English verb - Japanese
lexical component". Or do you consider that "relatively complicated"?

> It might be time for the documentation to mention a specific
> linter/configuration that does this. It also might be reasonable
> to do by default in IDLE or even the interactive shell.

It would have to be easy to turn off, perhaps even provide
instructions in the messages. I would guess that for code that uses
it at all, it would be common. So the warnings would likely make
those tools somewhere between really annoying and unusable.

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/FPO3EJISKDZUVMC3RMJJQZIKGCOG35CX/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

storchaka at gmail

Nov 3, 2021, 12:56 AM

Post #6 of 48 (2010 views)

02.11.21 18:49, Jim J. Jewett ????:
> If escape sequences were also allowed in comments (or at least in strings within comments), this would make sense. I don't like banning them otherwise, since odd characters are often a good reason to need a comment, but it is definitely a "mention, not use" situation.

If you mean backslash-escaped sequences like \uXXXX, there is no reason
to ban them in comments. Unlike to Java they do not have special meaning
outside of string literals. But if you mean terminal control sequences
(which change color or move cursor) they should not be allowed in comments.

> At the time, we considered it, and we also considered a narrower restriction on using multiple scripts in the same identifier, or at least the same identifier portion (so it was OK if separated by _).

I implemented this restrictions in one of my projects. The character set
was limited, and even this did not solve all issues with homoglyphs.

I think that we should not introduce such arbitrary limitations at the
parser level and left it to linters.

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/2TT3VX4D4FMRSEOV4O2ICYTC7VC5M2J4/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

steve at pearwood

Nov 3, 2021, 3:40 AM

Post #7 of 48 (2010 views)

On Tue, Nov 02, 2021 at 05:55:55PM +0200, Serhiy Storchaka wrote:

> All control characters except CR, LF, TAB and FF are banned outside
> comments and string literals. I think it is worth to ban them in
> comments and string literals too. In string literals you can use
> backslash-escape sequences, and comments should be human readable, there
> are no reason to include control characters in them. There is a
> precedence of emitting warnings for some superficial escapes in strings.

Agreed. I don't think there is any good reason for including control
characters (apart from whitespace) in comments.

In strings, I would consider allowing VT (vertical tab) as well, that is
whitespace.

>>> '\v'.isspace()
True

But I don't have a strong opinion on that.

[Petr]
> > For homoglyphs/confusables, should there be a SyntaxWarning when an
> > identifier looks like ASCII but isn't?

Let's not enshrine as a language "feature" that non Western European
languages are dangerous second-class citizens.

> It would virtually ban Cyrillic. There is a lot of Cyrillic letters
> which look like Latin letters, and there are complete words written in
> Cyrillic which by accident look like other words written in Latin.

Agreed.

> It is a work for linters, which can have many options for configuring
> acceptable scripts, use spelling dictionaries and dictionaries of
> homoglyphs, etc.

Linters and editors. I have no objection to people using editors that
highlight non-ASCII characters in blinking red letters, so long as I can
turn that option off :-)

--
Steve
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/RWE5FIWHUM5PSOJ6BI2PAO5TDE3KLC5D/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

encukou at gmail

Nov 3, 2021, 5:31 AM

Post #8 of 48 (2010 views)

We seem to agree that this is work for linters. That's reasonable; I'd
generalize it to "tools and policies". But even so, discussing what we'd
expect linters to do is on topic here.
Perhaps we can even find ways for the language to support linters --
type checking is also for external tools, but has language support.

For example: should the parser emit a lightweight audit event if it
finds a non-ASCII identifier? (See below for why ASCII is special.)
Or for encoding declarations?

On 03. 11. 21 6:26, Stephen J. Turnbull wrote:
> Serhiy Storchaka writes:
>
> > All control characters except CR, LF, TAB and FF are banned outside
> > comments and string literals. I think it is worth to ban them in
> > comments and string literals too.
>
> +1
>
> > > For homoglyphs/confusables, should there be a SyntaxWarning when an
> > > identifier looks like ASCII but isn't?
> >
> > It would virtually ban Cyrillic.
>
> +1 (for the comment and for the implied -1 on SyntaxWarning, let's
> keep the Cyrillic repertoire in Python!)

I don't think this would actually ban Cyrillic/Greek.
(My suggestion is not vanilla confusables detection; it might require
careful reading: "should there be a [linter] warning when an identifier
looks like ASCII but isn't?")

I am not a native speaker, but I did try a bit to find an actual
ASCII-like word in a language that uses Cyrillic. I didn't succeed; I
think they might be very rare.
Even if there was such a word -- or a one-letter abbreviation used as a
variable name -- it would be confusing to use. Removing the possibility
of confusion could *help* Cyrillic users. (I can't speak for them; this
is just a brainstorming idea.)

Steven adds:
> Let's not enshrine as a language "feature" that non Western European
> languages are dangerous second-class citizens.

That would be going too far, yes, but the fact is that non-English
languages *are* second-class citizens. Code that uses Python keywords
and stdlib must use English, and possibly another language. It is the
mixing of languages that can be dangerous/confusing, not the languages
themselves.

>
> > It is a work for linters,
>
> +1
>
> Aside from the reasons Serhiy presents, I'd rather not tie
> this kind of rather ambiguous improvement in Unicode handling to the
> release cycle.
>
> It might be worth having a pep9999 module/script in Python (perhaps
> more likely, PyPI but maintained by whoever does the work to make
> these improvements + Petr or somebody Petr trusts to do it), that
> lints scripts specifically for confusables and other issues.

If I have any say in it, the name definitely won't include a PEP number ;)
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/LB4O3YVDNVVNLYPMNH236QXGGUYG4BVI/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

storchaka at gmail

Nov 3, 2021, 8:44 AM

Post #9 of 48 (2010 views)

03.11.21 14:31, Petr Viktorin ????:
> For example: should the parser emit a lightweight audit event if it
> finds a non-ASCII identifier? (See below for why ASCII is special.)
> Or for encoding declarations?

There are audit events for import and compile. You can also register
import hooks if you want more fanny preprocessing than just
unicode-encoding. I do not think we need to add more specific audit
events, they were not designed for this.

And I think it is too late to detect suspicious code at the time of its
execution. It should be detected before adding that code to the code
base (review tools, pre-commit hooks).

> I don't think this would actually ban Cyrillic/Greek.
> (My suggestion is not vanilla confusables detection; it might require
> careful reading: "should there be a [linter] warning when an identifier
> looks like ASCII but isn't?")

Yes, but it should be optional and configurable and not be the part of
the Python compiler. This is not our business as Python core developers.

> I am not a native speaker, but I did try a bit to find an actual
> ASCII-like word in a language that uses Cyrillic. I didn't succeed; I
> think they might be very rare.

With simple script I have found 62 words common between English and
Ukrainian: ????/racy, ????/rope, ????/puma, ???/mix, etc. But there are
much more English and Ukrainian words which contains only letters which
can be confused with letters from other script. And identifiers can
contains abbreviations and shortening, they are not all can be found in
dictionaries.

> Even if there was such a word -- or a one-letter abbreviation used as a
> variable name -- it would be confusing to use. Removing the possibility
> of confusion could *help* Cyrillic users. (I can't speak for them; this
> is just a brainstorming idea.)

I never used non-Latin identifiers in Python, but I guess that where
they are used (in schools?) there is a mix of English and non-English
identifiers, and identifiers consisting of parts of English and
non-English words without even an underscore between them. I know
because in other languages they just use inconsistent transliteration.
Emitting any warning by default is a discrimination of non-English
users. It would be better to not add support of non-ASCII identifiers at
first place.

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/XHHXRWGKTDTZIYGS6AB3DKEVFH5D6BHV/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

chris.jerdonek at gmail

Nov 3, 2021, 11:00 AM

Post #10 of 48 (2010 views)

On Tue, Nov 2, 2021 at 7:21 AM Petr Viktorin <encukou@gmail.com> wrote:

> That brings us to possible changes in Python in this area, which is an
> interesting topic.

Is there a use case or need for allowing the comment-starting character “#”
to occur when text is still in the right-to-left direction? Disallowing
that would prevent Petr’s examples in which active code is displayed after
the comment mark, which to me seems to be one of the more egregious
examples. Or maybe this case is no worse than others and isn’t worth
singling out.

—Chris

>
> As for \0, can we ban all ASCII & C1 control characters except
> whitespace? I see no place for them in source code.
>
>
> For homoglyphs/confusables, should there be a SyntaxWarning when an
> identifier looks like ASCII but isn't?
>
> For right-to-left text: does anyone actually name identifiers in
> Hebrew/Arabic? AFAIK, we should allow a few non-printing
> "joiner"/"non-joiner" characters to make it possible to use all Arabic
> words. But it would be great to consult with users/teachers of the
> languages.
> Should Python run the bidi algorithm when parsing and disallow reordered
> tokens? Maybe optionally?
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/TGB377QWGIDPUWMAJSZLT22ERGPNZ5FZ/
> Code of Conduct: http://python.org/psf/codeofconduct/
>

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

jimjjewett at gmail

Nov 3, 2021, 4:02 PM

Post #11 of 48 (2010 views)

Stephen J. Turnbull wrote:
> Jim J. Jewett writes:
> > At the time, we considered it, and we also considered a narrower
> > restriction on using multiple scripts in the same identifier, or at
> > least the same identifier portion (so it was OK if separated by
> > _).

> > This would ban "????", aka "pango". That's arguably a good idea
> (IMO, 0.9 wink), but might make some GTK/GNOME folks sad.

I am not quite motivated enough to search the archives, but I'm pretty sure the examples actually found were less prominent than that. There seemed to be at least one or two fora where it was something of a local idiom.

>... I don't recall ever seeing
> an identifier with ASCII and Japanese glommed together without a
> separator. It was almost always of the form "English verb - Japanese
> lexical component".

The problem was that some were written without a "-" or "_" to separate the halves. It looked fine -- the script change was obvious to even someone who didn't speak the non-English language. But having to support that meant any remaining restriction on mixed scripts would be either too weak to be worthwhile, or too complicated to write into the python language specification.

-jJ
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/CUTMZG55WY3CLNNKB6VTPCOUXJ22EEZY/
Code of Conduct: http://python.org/psf/codeofconduct/

Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

ptmcg at austin

Nov 13, 2021, 1:35 PM

Post #12 of 48 (2010 views)

I’ve not been following the thread, but Steve Holden forwarded me the email from Petr Viktorin, that I might share some of the info I found while recently diving into this topic.

As part of working on the next edition of “Python in a Nutshell” with Steve, Alex Martelli, and Anna Ravencroft, Alex suggested that I add a cautionary section on homoglyphs, specifically citing “A” (LATIN CAPITAL LETTER A) and “?” (GREEK CAPITAL LETTER ALPHA) as an example problem pair. I wanted to look a little further at the use of characters in identifiers beyond the standard 7-bit ASCII, and so I found some of these same issues dealing with Unicode NFKC normalization. The first discovery was the overlapping normalization of “ªº” with “ao”. This was quite a shock to me, since I assumed that the inclusion of Unicode for identifier characters would preserve the uniqueness of the different code points. Even ligatures can be used, and will overlap with their multi-character ASCII forms. So we have added a second note in the upcoming edition on the risks of using these “homonorms” (which is a word I just made up for the occasion).

To explore the extreme case, I wrote a pyparsing transformer to convert identifiers in a body of Python source to mixed font, equivalent to the original source after NFKC normalization. Here are hello.py, and a snippet from unittest/utils.py:

def ????????????????????():

try:

????e????????????? = "Hello"

????????r?????? = "World"

?????????????????(f"{?????????????º_}, {?????????l??}!")

except ??????????????????????????? as ?????c:

????r???("failed: {}".??????????ª?(?????????))

if _??????????????__ == "__main__":

????e??????()

# snippet from unittest/util.py

_??????????????????????L?????????????????????? = 12

def _??????????????????????(????, p?????????????????????????, ???????????????????????????):

?????????? = ?????????(????) - ?r????????????x????????? - ???????????????????????

if s?i???? > _????????????????????H??????????????????L????????:

???? = '%s[%d chars]%s' % (????[:??????????????????????????????], ?????????p, ????[????????????(????) - ?????????????x?????????:])

return ?

You should able to paste these into your local UTF-8-aware editor or IDE and execute them as-is.

(If this doesn’t come through, you can also see this as a GitHub gist at Hello, World rendered in a variety of Unicode characters (github.com) <https://gist.github.com/ptmcg/bf35d5ada416080d481d789988b6b466> . I have a second gist containing the transformer, but it is still a private gist atm.)

Some other discoveries:

“·” (ASCII 183) is a valid identifier body character, making “_···” a valid Python identifier. This could actually be another security attack point, in which “s·join(‘x’)” could be easily misread as “s.join(‘x’)”, but would actually be a call to potentially malicious method “s·join”.

“_” seems to be a special case for normalization. Only the ASCII “_” character is valid as a leading identifier character; the Unicode characters that normalize to “_” (any of the characters in “??????”) can only be used as identifier body characters. “?” especially could be misread as “|” followed by a space, when it actually normalizes to “_”.

Potential beneficial uses:

I am considering taking my transformer code and experimenting with an orthogonal approach to syntax highlighting, using Unicode groups instead of colors. Module names using characters from one group, builtins from another, program variables from another, maybe distinguish local from global variables. Colorizing has always been an obvious syntax highlight feature, but is an accessibility issue for those with difficulty distinguishing colors. Unlike the “ransom note” code above, code highlighted in this way might even be quite pleasing to the eye.

-- Paul McGuire

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

stestagg at gmail

Nov 13, 2021, 2:10 PM

Post #13 of 48 (2010 views)

This is my favourite version of the issue:

? = lambda ?, e: ? if ? > e else e
print(?(2, 1), ?(1, 2)) # python 3 outputs: 2 2

https://twitter.com/stestagg/status/685239650064162820?s=21

Steve

On Sat, 13 Nov 2021 at 22:05, <ptmcg@austin.rr.com> wrote:

> I’ve not been following the thread, but Steve Holden forwarded me the
> email from Petr Viktorin, that I might share some of the info I found while
> recently diving into this topic.
>
>
>
> As part of working on the next edition of “Python in a Nutshell” with
> Steve, Alex Martelli, and Anna Ravencroft, Alex suggested that I add a
> cautionary section on homoglyphs, specifically citing “A” (LATIN CAPITAL
> LETTER A) and “?” (GREEK CAPITAL LETTER ALPHA) as an example problem pair.
> I wanted to look a little further at the use of characters in identifiers
> beyond the standard 7-bit ASCII, and so I found some of these same issues
> dealing with Unicode NFKC normalization. The first discovery was the
> overlapping normalization of “ªº” with “ao”. This was quite a shock to me,
> since I assumed that the inclusion of Unicode for identifier characters
> would preserve the uniqueness of the different code points. Even ligatures
> can be used, and will overlap with their multi-character ASCII forms. So we
> have added a second note in the upcoming edition on the risks of using
> these “homonorms” (which is a word I just made up for the occasion).
>
>
>
> To explore the extreme case, I wrote a pyparsing transformer to convert
> identifiers in a body of Python source to mixed font, equivalent to the
> original source after NFKC normalization. Here are hello.py, and a snippet
> from unittest/utils.py:
>
>
>
> def ????????????????????():
>
> try:
>
> ????e????????????? = "Hello"
>
> ????????r?????? = "World"
>
> ?????????????????(f"{?????????????º_}, {?????????l??}!")
>
> except ??????????????????????????? as ?????c:
>
> ????r???("failed: {}".??????????ª?(?????????))
>
>
>
> if _??????????????__ == "__main__":
>
> ????e??????()
>
>
>
>
>
> # snippet from unittest/util.py
>
> _??????????????????????L?????????????????????? = 12
>
> def _??????????????????????(????, p?????????????????????????, ???????????????????????????):
>
> ?????????? = ?????????(????) - ?r????????????x????????? - ???????????????????????
>
> if s?i???? > _????????????????????H??????????????????L????????:
>
> ???? = '%s[%d chars]%s' % (????[:??????????????????????????????], ?????????p, ????[????????????(
> ????) - ?????????????x?????????:])
>
> return ?
>
>
>
>
>
> You should able to paste these into your local UTF-8-aware editor or IDE
> and execute them as-is.
>
>
>
> (If this doesn’t come through, you can also see this as a GitHub gist at Hello,
> World rendered in a variety of Unicode characters (github.com)
> <https://gist.github.com/ptmcg/bf35d5ada416080d481d789988b6b466>. I have
> a second gist containing the transformer, but it is still a private gist
> atm.)
>
>
>
>
>
> Some other discoveries:
>
> “·” (ASCII 183) is a valid identifier body character, making “_···” a
> valid Python identifier. This could actually be another security attack
> point, in which “s·join(‘x’)” could be easily misread as “s.join(‘x’)”, but
> would actually be a call to potentially malicious method “s·join”.
>
> “_” seems to be a special case for normalization. Only the ASCII “_”
> character is valid as a leading identifier character; the Unicode
> characters that normalize to “_” (any of the characters in “??????”) can
> only be used as identifier body characters. “?” especially could be
> misread as “|” followed by a space, when it actually normalizes to “_”.
>
>
>
>
>
> Potential beneficial uses:
>
> I am considering taking my transformer code and experimenting with an
> orthogonal approach to syntax highlighting, using Unicode groups instead of
> colors. Module names using characters from one group, builtins from
> another, program variables from another, maybe distinguish local from
> global variables. Colorizing has always been an obvious syntax highlight
> feature, but is an accessibility issue for those with difficulty
> distinguishing colors. Unlike the “ransom note” code above, code
> highlighted in this way might even be quite pleasing to the eye.
>
>
>
>
>
> -- Paul McGuire
>
>
>
>
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/GBLXJ2ZTIMLBD2MJQ4VDNUKFFTPPIIMO/
> Code of Conduct: http://python.org/psf/codeofconduct/
>

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

tjreedy at udel

Nov 13, 2021, 2:36 PM

Post #14 of 48 (2010 views)

On 11/13/2021 4:35 PM, ptmcg@austin.rr.com wrote:
> I’ve not been following the thread, but Steve Holden forwarded me the

> To explore the extreme case, I wrote a pyparsing transformer to convert
> identifiers in a body of Python source to mixed font, equivalent to the
> original source after NFKC normalization. Here are hello.py, and a
> snippet from unittest/utils.py:
>
> def ????????????????????():
>
>     try:
>
> ????e????????????? = "Hello"
>
> ????????r?????? = "World"
>
>         ?????????????????(f"{?????????????º_}, {?????????l??}!")
>
>     except ??????????????????????????? as ?????c:
>
> ????r???("failed: {}".??????????ª?(?????????))
>
> if _??????????????__ == "__main__":
>
> ????e??????()
>
> # snippet from unittest/util.py
>
> _??????????????????????L?????????????????????? = 12
>
> def _??????????????????????(????, p?????????????????????????, ???????????????????????????):
>
>     ?????????? = ?????????(????) - ?r????????????x????????? - ???????????????????????
>
>     if s?i???? > _????????????????????H??????????????????L????????:
>
> ???? = '%s[%d chars]%s' % (????[:??????????????????????????????], ?????????p, ????[????????????(????) -
> ?????????????x?????????:])
>
>     return ?
>
> You should able to paste these into your local UTF-8-aware editor or IDE
> and execute them as-is.

Wow. After pasting the util.py snippet into current IDLE, which on my
Windows machine* displays the complete text:

>>> dir()
['_PLACEHOLDER_LEN', '__annotations__', '__builtins__', '__doc__',
'__loader__', '__name__', '__package__', '__spec__', '_shorten']
>>> _shorten('abc', 1, 1)
'abc'
>>> _shorten('abcdefghijklmnopqrw', 2, 2)
'ab[15 chars]rw'

* Does not at all work in CommandPrompt, even after supposedly changing
to a utf-8 codepage with 'chcp 65000'.

--
Terry Jan Reedy
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/NSGBCZQ2R6G2HGPAID4ZI35YCRMF7ERC/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

pythonchb at gmail

Nov 14, 2021, 9:17 AM

Post #15 of 48 (2009 views)

On Sat, Nov 13, 2021 at 2:03 PM <ptmcg@austin.rr.com> wrote:

> def ????????????????????():
>
> try:
>
> ????e????????????? = "Hello"
>
> ????????r?????? = "World"
>
> ?????????????????(f"{?????????????º_}, {?????????l??}!")
>
> except ??????????????????????????? as ?????c:
>
> ????r???("failed: {}".??????????ª?(?????????))
>

Wow. Just Wow.

So why does Python apply NFKC normalization to variable names?? I can't
for the life of me figure out why that would be helpful at all.

The string methods, sure, but names?

And, in fact, the normalization is not used for string comparisons or
hashes as far as I can tell.

In [36]: weird
Out[36]: '?????????????????'

In [37]: normal
Out[37]: 'print'

In [38]: eval(weird + "('yup, that worked')")
yup, that worked

In [39]: weird == normal
Out[39]: False

In [40]: weird[0] in normal
Out[40]: False

This seems very odd (and dangerous) to me.

Is there a good reason? and is it too late to change it?

-CHB

>
>
> if _??????????????__ == "__main__":
>
> ????e??????()
>
>
>
>
>
> # snippet from unittest/util.py
>
> _??????????????????????L?????????????????????? = 12
>
> def _??????????????????????(????, p?????????????????????????, ???????????????????????????):
>
> ?????????? = ?????????(????) - ?r????????????x????????? - ???????????????????????
>
> if s?i???? > _????????????????????H??????????????????L????????:
>
> ???? = '%s[%d chars]%s' % (????[:??????????????????????????????], ?????????p, ????[????????????(
> ????) - ?????????????x?????????:])
>
> return ?
>
>
>
>
>
> You should able to paste these into your local UTF-8-aware editor or IDE
> and execute them as-is.
>
>
>
> (If this doesn’t come through, you can also see this as a GitHub gist at Hello,
> World rendered in a variety of Unicode characters (github.com)
> <https://gist.github.com/ptmcg/bf35d5ada416080d481d789988b6b466>. I have
> a second gist containing the transformer, but it is still a private gist
> atm.)
>
>
>
>
>
> Some other discoveries:
>
> “·” (ASCII 183) is a valid identifier body character, making “_···” a
> valid Python identifier. This could actually be another security attack
> point, in which “s·join(‘x’)” could be easily misread as “s.join(‘x’)”, but
> would actually be a call to potentially malicious method “s·join”.
>
> “_” seems to be a special case for normalization. Only the ASCII “_”
> character is valid as a leading identifier character; the Unicode
> characters that normalize to “_” (any of the characters in “??????”) can
> only be used as identifier body characters. “?” especially could be
> misread as “|” followed by a space, when it actually normalizes to “_”.
>
>
>
>
>
> Potential beneficial uses:
>
> I am considering taking my transformer code and experimenting with an
> orthogonal approach to syntax highlighting, using Unicode groups instead of
> colors. Module names using characters from one group, builtins from
> another, program variables from another, maybe distinguish local from
> global variables. Colorizing has always been an obvious syntax highlight
> feature, but is an accessibility issue for those with difficulty
> distinguishing colors. Unlike the “ransom note” code above, code
> highlighted in this way might even be quite pleasing to the eye.
>
>
>
>
>
> -- Paul McGuire
>
>
>
>
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/GBLXJ2ZTIMLBD2MJQ4VDNUKFFTPPIIMO/
> Code of Conduct: http://python.org/psf/codeofconduct/
>

--
Christopher Barker, PhD (Chris)

Python Language Consulting
- Teaching
- Scientific Software Development
- Desktop GUI and Web Development
- wxPython, numpy, scipy, Cython

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

jimjjewett at gmail

Nov 14, 2021, 9:39 AM

Post #16 of 48 (2009 views)

ptmcg?austin.rr.com wrote:

> ... add a cautionary section on homoglyphs, specifically citing
> “A” (LATIN CAPITAL LETTER A) and “?” (GREEK CAPITAL LETTER ALPHA)
> as an example problem pair.

There is a unicode tech report about confusables, but it is never clear where to stop. Are I (upper case I), l (lower case l) and 1 (numeric 1) from ASCII already a problem? And if we do it at all, is there any way to avoid making Cyrillic languages second-class?

I'm not quickly finding the contemporary report, but these should be helpful if you want to go deeper:

http://www.unicode.org/reports/tr36/
http://unicode.org/reports/tr36/confusables.txt
https://util.unicode.org/UnicodeJsps/confusables.jsp

> I wanted to look a little further at the use of characters in identifiers
> beyond the standard 7-bit ASCII, and so I found some of these same
> issues dealing with Unicode NFKC normalization. The first discovery was
> the overlapping normalization of “ªº” with “ao”.

Here I don't see the problem. Things that look slightly different are really the same, and you can write it either way. So you can use what looks like a funny font, but the closest it comes to a security risk is that maybe you could access something without a casual reader realizing that you are doing so. They would know that you *could* access it, just not that you *did*.

> Some other discoveries:
> “·” (ASCII 183) is a valid identifier body character, making “_···” a valid
> Python identifier.

That and the apostrophe are Unicode consortium regrets, because they are normally punctuation, but there are also languages that use them as letters.
The apostrophe is (supposedly only) used by Afrikaans, I asked a native speaker about where/how often it was used, and the similarity to Dutch was enough that Guido felt comfortable excluding it. (It *may* have been similar to using the apostrophe for a contraction in English, and saying it therefore represents a letter, but the scope was clearly smaller.) But the dot is used in Catalan, and ... we didn't find anyone ready to say it wouldn't be needed for sensible identifiers. It is worth listing as a warning, and linters should probably complain.

> “_” seems to be a special case for normalization. Only the ASCII “_”
> character is valid as a leading identifier character; the Unicode
> characters that normalize to “_” (any of the characters in “??????”)
> can only be used as identifier body characters. “?” especially could be
> misread as “|” followed by a space, when it actually normalizes to “_”.

So go ahead and warn, but it isn't clear how that could be abused to look like something other than a syntax error, except maybe through soft keywords. (Ha! I snuck in a call to async?def that had been imported with *, and you didn't worry about the import *, or the apparently wild cursor position marker, or the strange async definition that was never used! No way I could have just issued a call to _flush and done the same thing!)

> Potential beneficial uses:
> I am considering taking my transformer code and experimenting with an
> orthogonal approach to syntax highlighting, using Unicode groups
> instead of colors. Module names using characters from one group,
> builtins from another, program variables from another, maybe
> distinguish local from global variables. Colorizing has always been an
> obvious syntax highlight feature, but is an accessibility issue for those
> with difficulty distinguishing colors.

I kind of like the idea, but ... if you're doing it on-the-fly in the editor, you could just use different fonts. If you're actually saving those changes, it seems likely to lead to a lot of spurious diffs if anyone uses a different editor.

-jJ
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/NPTL43EVT2FF76LXIBBWVHDU6NXH3HF5/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

python at mrabarnett

Nov 14, 2021, 10:17 AM

Post #17 of 48 (2009 views)

On 2021-11-14 17:17, Christopher Barker wrote:
> On Sat, Nov 13, 2021 at 2:03 PM <ptmcg@austin.rr.com
> <mailto:ptmcg@austin.rr.com>> wrote:
>
> def ????????????????????():
>
> __
>
>     try:____
>
> ????e????????????? = "Hello"____
>
> ????????r?????? = "World"____
>
>         ?????????????????(f"{?????????????º_}, {?????????l??}!")____
>
>     except ??????????????????????????? as ?????c:____
>
> ????r???("failed: {}".??????????ª?(?????????))
>
>
> Wow. Just Wow.
>
> So why does Python apply NFKC normalization to variable names?? I can't
> for the life of me figure out why that would be helpful at all.
>
> The string methods, sure, but names?
>
> And, in fact, the normalization is not used for string comparisons or
> hashes as far as I can tell.
>
[snip]

It's probably to deal with "e?" vs "é", i.e. "\N{LATIN SMALL LETTER
E}\N{COMBINING ACUTE ACCENT}" vs "\N{LATIN SMALL LETTER E WITH ACUTE}",
which are different ways of writing the same thing.

Unfortunately, it goes too far, because it's unlikely that we want "?"
("\N{MODIFIER LETTER SMALL P}') to be equivalent to "P" ("\N{LATIN
CAPITAL LETTER P}".
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/PNZICEQGVEAQH7KNBCBSS4LPAO25JBF3/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

python-dev at python

Nov 14, 2021, 10:23 AM

Post #18 of 48 (2009 views)

Indeed, normative annex https://www.unicode.org/reports/tr31/tr31-35.html
section 5 says: "if the programming language has case-sensitive
identifiers, then Normalization Form C is appropriate" (vs NFKC for a
language with case-insensitive identifiers) so to follow the standard we
should have used NFC rather than NFKC. Not sure if it's too late to fix
this "oops" in future Python versions.

Alex

On Sun, Nov 14, 2021 at 9:17 AM Christopher Barker <pythonchb@gmail.com>
wrote:

> On Sat, Nov 13, 2021 at 2:03 PM <ptmcg@austin.rr.com> wrote:
>
>> def ????????????????????():
>>
>> try:
>>
>> ????e????????????? = "Hello"
>>
>> ????????r?????? = "World"
>>
>> ?????????????????(f"{?????????????º_}, {?????????l??}!")
>>
>> except ??????????????????????????? as ?????c:
>>
>> ????r???("failed: {}".??????????ª?(?????????))
>>
>
> Wow. Just Wow.
>
> So why does Python apply NFKC normalization to variable names?? I can't
> for the life of me figure out why that would be helpful at all.
>
> The string methods, sure, but names?
>
> And, in fact, the normalization is not used for string comparisons or
> hashes as far as I can tell.
>
> In [36]: weird
> Out[36]: '?????????????????'
>
> In [37]: normal
> Out[37]: 'print'
>
> In [38]: eval(weird + "('yup, that worked')")
> yup, that worked
>
> In [39]: weird == normal
> Out[39]: False
>
> In [40]: weird[0] in normal
> Out[40]: False
>
> This seems very odd (and dangerous) to me.
>
> Is there a good reason? and is it too late to change it?
>
> -CHB
>
>
>
>
>
>
>
>
>
>>
>>
>> if _??????????????__ == "__main__":
>>
>> ????e??????()
>>
>>
>>
>>
>>
>> # snippet from unittest/util.py
>>
>> _??????????????????????L?????????????????????? = 12
>>
>> def _??????????????????????(????, p?????????????????????????, ???????????????????????????):
>>
>> ?????????? = ?????????(????) - ?r????????????x????????? - ???????????????????????
>>
>> if s?i???? > _????????????????????H??????????????????L????????:
>>
>> ???? = '%s[%d chars]%s' % (????[:??????????????????????????????], ?????????p, ????[????????????
>> (????) - ?????????????x?????????:])
>>
>> return ?
>>
>>
>>
>>
>>
>> You should able to paste these into your local UTF-8-aware editor or IDE
>> and execute them as-is.
>>
>>
>>
>> (If this doesn’t come through, you can also see this as a GitHub gist at Hello,
>> World rendered in a variety of Unicode characters (github.com)
>> <https://gist.github.com/ptmcg/bf35d5ada416080d481d789988b6b466>. I have
>> a second gist containing the transformer, but it is still a private gist
>> atm.)
>>
>>
>>
>>
>>
>> Some other discoveries:
>>
>> “·” (ASCII 183) is a valid identifier body character, making “_···” a
>> valid Python identifier. This could actually be another security attack
>> point, in which “s·join(‘x’)” could be easily misread as “s.join(‘x’)”, but
>> would actually be a call to potentially malicious method “s·join”.
>>
>> “_” seems to be a special case for normalization. Only the ASCII “_”
>> character is valid as a leading identifier character; the Unicode
>> characters that normalize to “_” (any of the characters in “??????”) can
>> only be used as identifier body characters. “?” especially could be
>> misread as “|” followed by a space, when it actually normalizes to “_”.
>>
>>
>>
>>
>>
>> Potential beneficial uses:
>>
>> I am considering taking my transformer code and experimenting with an
>> orthogonal approach to syntax highlighting, using Unicode groups instead of
>> colors. Module names using characters from one group, builtins from
>> another, program variables from another, maybe distinguish local from
>> global variables. Colorizing has always been an obvious syntax highlight
>> feature, but is an accessibility issue for those with difficulty
>> distinguishing colors. Unlike the “ransom note” code above, code
>> highlighted in this way might even be quite pleasing to the eye.
>>
>>
>>
>>
>>
>> -- Paul McGuire
>>
>>
>>
>>
>> _______________________________________________
>> Python-Dev mailing list -- python-dev@python.org
>> To unsubscribe send an email to python-dev-leave@python.org
>> https://mail.python.org/mailman3/lists/python-dev.python.org/
>> Message archived at
>> https://mail.python.org/archives/list/python-dev@python.org/message/GBLXJ2ZTIMLBD2MJQ4VDNUKFFTPPIIMO/
>> Code of Conduct: http://python.org/psf/codeofconduct/
>>
>
>
> --
> Christopher Barker, PhD (Chris)
>
> Python Language Consulting
> - Teaching
> - Scientific Software Development
> - Desktop GUI and Web Development
> - wxPython, numpy, scipy, Cython
>

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

pythonchb at gmail

Nov 14, 2021, 11:07 AM

Post #19 of 48 (2009 views)

On Sun, Nov 14, 2021 at 10:27 AM MRAB <python@mrabarnett.plus.com> wrote:

> > So why does Python apply NFKC normalization to variable names??

> It's probably to deal with "e?" vs "é", i.e. "\N{LATIN SMALL LETTER
> E}\N{COMBINING ACUTE ACCENT}" vs "\N{LATIN SMALL LETTER E WITH ACUTE}",
> which are different ways of writing the same thing.
>

sure, but this is code, written by humans (or meta-programming). Maybe I'm
showing my english bias, but would it be that limiting to have identifiers
be based on codepoints, period?

Why does someone that wants to use, .e.g. "e?" in an identifier have to be
able to represent it two different ways in a code file?

But if so ...

> Unfortunately, it goes too far, because it's unlikely that we want "?"
> ("\N{MODIFIER LETTER SMALL P}') to be equivalent to "P" ("\N{LATIN
> CAPITAL LETTER P}".
>

Is it possible to only capture things like the combining characters and not
the "equivalent" ones like the above?

-CHB

--
Christopher Barker, PhD (Chris)

Python Language Consulting
- Teaching
- Scientific Software Development
- Desktop GUI and Web Development
- wxPython, numpy, scipy, Cython

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

lord.mauve at gmail

Nov 14, 2021, 11:20 AM

Post #20 of 48 (2009 views)

On Sun, 14 Nov 2021, 19:07 Christopher Barker, <pythonchb@gmail.com> wrote:

> On Sun, Nov 14, 2021 at 10:27 AM MRAB <python@mrabarnett.plus.com> wrote:
>
>> Unfortunately, it goes too far, because it's unlikely that we want "?"
>> ("\N{MODIFIER LETTER SMALL P}') to be equivalent to "P" ("\N{LATIN
>> CAPITAL LETTER P}".
>>
>
> Is it possible to only capture things like the combining characters and
> not the "equivalent" ones like the above?
>

Yes, that is NFC. NKFC converts to equivalent characters and also composes;
NFC just composes.

>

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

Richard at Damon-Family

Nov 14, 2021, 11:23 AM

Post #21 of 48 (2009 views)

On 11/14/21 2:07 PM, Christopher Barker wrote:
> Why does someone that wants to use, .e.g. "e?" in an identifier have
> to be able to represent it two different ways in a code file?
>
The issue here is that fundamentally, some editors will produce composed
characters and some decomposed characters to represent the same actual
'character'

These two methods are defined by Unicode to really represent the same
'character', it is just that some defined sequences of combining
codepoints just happen to have a composed 'abbreviation' defined also.

Having to exactly match the byte sequence says that some people will
have a VERY hard time entering usable code if there tools support
Unicode, but use the other convention.

--
Richard Damon

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/WXGHMDIAY2M77MUMBM4NU7LZTIQTEBNP/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

david.mertz at gmail

Nov 14, 2021, 11:36 AM

Post #22 of 48 (2009 views)

On Sun, Nov 14, 2021, 2:14 PM Christopher Barker

> It's probably to deal with "e?" vs "é", i.e. "\N{LATIN SMALL LETTER
>> E}\N{COMBINING ACUTE ACCENT}" vs "\N{LATIN SMALL LETTER E WITH ACUTE}",
>> which are different ways of writing the same thing.
>>
>
> Why does someone that wants to use, .e.g. "e?" in an identifier have to be
> able to represent it two different ways in a code file?
>

Imagine that two different programmers work with the same code base, and
their text editors or keystrokes enter "é" in different ways.

Or imagine just one programmer doing so on two different
machines/environments.

As an example, I wrote this reply on my Android tablet (with such-and-such
OS version). I have no idea what actual codepoint(s) are entered when I
press and hold the "e" key for a couple seconds to pop up character
variations.

If I wrote it on OSX, I'd probably press "alt-e e" on my US International
key layout. Again, no idea what codepoints actually are entered. If I did
it on Linux, I'd use "ctrl-shift u 00e9". In that case, I actually know the
codepoint.

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

Richard at Damon-Family

Nov 14, 2021, 12:54 PM

Post #23 of 48 (2009 views)

On 11/14/21 2:36 PM, David Mertz, Ph.D. wrote:
> On Sun, Nov 14, 2021, 2:14 PM Christopher Barker
>
> It's probably to deal with "e?" vs "é", i.e. "\N{LATIN SMALL
> LETTER
> E}\N{COMBINING ACUTE ACCENT}" vs "\N{LATIN SMALL LETTER E WITH
> ACUTE}",
> which are different ways of writing the same thing.
>
>
> Why does someone that wants to use, .e.g. "e?" in an identifier
> have to be able to represent it two different ways in a code file?
>
>
> Imagine that two different programmers work with the same code base,
> and their text editors or keystrokes enter "é" in different ways.
>
> Or imagine just one programmer doing so on two different
> machines/environments.
>
> As an example, I wrote this reply on my Android tablet (with
> such-and-such OS version). I have no idea what actual codepoint(s) are
> entered when I press and hold the "e" key for a couple seconds to pop
> up character variations.
>
> If I wrote it on OSX, I'd probably press "alt-e e" on my US
> International key layout. Again, no idea what codepoints actually are
> entered. If I did it on Linux, I'd use "ctrl-shift u 00e9". In that
> case, I actually know the codepoint.

But would have to look up the actual number to enter them.

Imagine of ALL your source code had to be entered via code-point numbers.

BTW, you should be able to enable 'composing' under Linux too, just like
under OSX with the right input driver loaded.

--
Richard Damon

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/N76K3RML5QIFW56BRRVUOW5HGKSJAIVA/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

steve at pearwood

Nov 14, 2021, 4:44 PM

Post #24 of 48 (2008 views)

Out of all the approximately thousand bazillion ways to write obfuscated
Python code, which may or may not be malicious, why are Unicode
confusables worth this level of angst and concern?

I looked up "Unicode homoglyph" on CVE, and found a grand total of seven
hits:

https://www.cvedetails.com/google-search-results.php?q=unicode+homoglyph

all of which appear to be related to impersonation of account names. I
daresay if I expanded my search terms, I would probably find some more,
but it is clear that Unicode homoglyphs are not exactly a major threat.

In my opinion, the other Steve's (Stestagg) example of obfuscated code
with homoglyphs for e (as well as a few similar cases, such as
homoglyphs for A) mostly makes for an amusing curiosity, perhaps worth a
plugin for Pylint and other static checkers, but not much more. I'm not
entirely sure what Paul's more lurid examples are supposed to indicate.
If your threat relies on a malicious coder smuggling in identifiers like
"????????????????????" or "ªº" and having the reader not notice, then I'm not going to
lose much sleep over it.

Confusable account names and URL spoofing are proven, genuine threats.
Beyond that, IMO the actual threat window from confusables is pretty
small. Yes, you can write obfuscated code, and smuggle in calls to
unexpected functions:

result = l?n(sequence) # Cyrillic letter small Ie

but you still have to smuggle in a function to make it work:

def l?n(obj):
# something malicious

And if you can do that, the Unicode letter is redundant. I'm not sure
why any attacker would bother.

--
Steve
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/XNRW6JSFGO4DQOGVNY2FEZAUBN6P2HRR/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

pythonchb at gmail

Nov 14, 2021, 10:12 PM

Post #25 of 48 (2008 views)

On Sun, Nov 14, 2021 at 4:53 PM Steven D'Aprano <steve@pearwood.info> wrote:

> Out of all the approximately thousand bazillion ways to write obfuscated
> Python code, which may or may not be malicious, why are Unicode
> confusables worth this level of angst and concern?
>

I for one am not full of angst nor particularly concerned. Though ti's a
fine idea to inform folks about h this issues.

I am, however, surprised and disappointed by the NKFC normalization.

For example, in writing math we often use different scripts to mean
different things (e.g. TeX's
Blackboard Bold). So if I were to use some of the Unicode Mathematical
Alphanumeric Symbols, I wouldn't want them to get normalized.

Then there's the question of when this normalization happens (and when it
doesn't). If one is doing any kind of metaprogramming, even just using
getattr() and setattr(), things could get very confusing:

In [55]: class Junk:
...: ?????????????º = "hello"
...:

In [56]: setattr(Junk, "?????????????????", "print")

In [57]: dir(Junk)
Out[57]:
'__weakref__',
<snip>
'hello',
'?????????????????']

In [58]: Junk.hello
Out[58]: 'hello'

In [59]: Junk.?????????????º
Out[59]: 'hello'

In [60]: Junk.print
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-60-f2a7d3de5d06> in <module>
----> 1 Junk.print

AttributeError: type object 'Junk' has no attribute 'print'

In [61]: Junk.?????????????????
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-61-004f4c8b2f07> in <module>
----> 1 Junk.?????????????????

AttributeError: type object 'Junk' has no attribute 'print'

In [62]: getattr(Junk, "?????????????????")
Out[62]: 'print'

Would a proposal to switch the normalization to NFC only have any hope of
being accepted?

and/or adding normaliztion to setattr() and maybe other places where names
are set in code?

-CHB

--
Christopher Barker, PhD (Chris)

Python Language Consulting
- Teaching
- Scientific Software Development
- Desktop GUI and Web Development
- wxPython, numpy, scipy, Cython

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

stephenjturnbull at gmail

Nov 15, 2021, 12:25 AM

Post #26 of 48 (1633 views)

Christopher Barker writes:

> Would a proposal to switch the normalization to NFC only have any hope of
> being accepted?

Hope, yes. Counting you, it's been proposed twice. :-) I don't know
whether it would get through. We know this won't affect the stdlib,
since that's restricted to ASCII. I suppose we could trawl PyPI and
GitHub for "compatibles" (the Unicode term for "K" normalizations).

> For example, in writing math we often use different scripts to mean
> different things (e.g. TeX's Blackboard Bold). So if I were to use
> some of the Unicode Mathematical Alphanumeric Symbols, I wouldn't
> want them to get normalized.

Independent of the question of the normalization of Python
identifiers, I think using those characters this way is a bad idea.
In fact, I think adding these symbols to Unicode was a bad idea; they
should be handled at a higher level in the linguistic stack (by
semantic markup).

You're confusing two things here. In Unicode, a script is a
collection of characters used for a specific language, typically a set
of Unicode blocks of characters (more or less; there are a lot of Han
ideographs that are recognizable as such to Japanese but are not part
of the repertoire of the Japanese script). That is, these characters
are *different* from others that look like them.

Blackboard Bold is more what we would usually call a "font": the
(math) italic "x" and the (math) bold italic "x" are the same "x", but
one denotes a scalar and the other a vector in many math books. A
roman "R" probably denotes the statistical application, an italic "R"
the reaction function in game theory model, and a Blackboard Bold "R"
the set of real numbers. But these are all the same character.

It's a bad idea to rely on different (Unicode) scripts that use the
same glyphs for different characters to look different from each
other, unless you "own" the fonts to be used. As far as I know
there's no way for a Python program to specify the font to be used to
display itself though. :-)

It's also a UX problem. At slightly higher layer in the stack, I'm
used to using Japanese input methods to input sigma and pi which
produce characters in the Greek block, and at least the upper case
forms that denote sum and product have separate characters in the math
operators block. I understand why people who literally write
mathematics in Greek might want those not normalized, but I sure am
going to keep using "Greek sigma", not "math sigma"! The probability
that I'm going to have a Greek uppercase sigma in my papers is nil,
the probability of a summation symbol near unity. But the summation
symbol is not easily available, I have to scroll through all the
preceding Unicode blocks to find Mathematical Operators. So I am
perfectly happy with uppercase Greek sigma for that role (as is
XeTeX!!)

And the thing is, of course those Greek letters really are Greek
letters: they were chosen because pi is the homophone of p which is
the first letter of "product", and sigma is the homophone of s which
is the first letter of "sum". ? for ?ngström is similar, it's the
initial letter of a Swedish name.

Sure, we could fix the input methods (and search methods!! -- people
are going to input the character they know that corresponds to the
glyph *they* see, not the bit pattern the *CPU* sees). But that's as
bad as trying to fix mail clients. Not worth the effort because I'm
pretty sure you're gonna fail -- it's one of those "you'll have to pry
this crappy software that annoys admins around the world from my cold
dead fingers" issues, which is why their devs refuse to fix them.

Steve
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/5GHPVNJLLOKBYPE7FSU5766XYP6IJPEK/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

arj.python at gmail

Nov 15, 2021, 12:33 AM

Post #27 of 48 (1633 views)

Well,

Yet another issue is adding vulnerabilities in plain sight.

Human code reviewers will see this:

if user.admin == "something":

Static analysers will see

if user.admin == "something<hidden chars>":

but will not flag it as it's up to the user to verify the logic of things

and as such soft authors can plant backdoors in plain sight

Kind Regards,

Abdur-Rahmaan Janhangeer
about <https://compileralchemy.github.io/> | blog
<https://www.pythonkitchen.com>
github <https://github.com/Abdur-RahmaanJ>
Mauritius

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

aeros167 at gmail

Nov 15, 2021, 12:37 AM

Post #28 of 48 (1633 views)

On Sat, Nov 13, 2021 at 5:04 PM <ptmcg@austin.rr.com> wrote:

>
>
> def ????????????????????():
>
> try:
>
> ????e????????????? = "Hello"
>
> ????????r?????? = "World"
>
> ?????????????????(f"{?????????????º_}, {?????????l??}!")
>
> except ??????????????????????????? as ?????c:
>
> ????r???("failed: {}".??????????ª?(?????????))
>
>
>
> if _??????????????__ == "__main__":
>
> ????e??????()
>
>
>
>
>
> # snippet from unittest/util.py
>
> _??????????????????????L?????????????????????? = 12
>
> def _??????????????????????(????, p?????????????????????????, ???????????????????????????):
>
> ?????????? = ?????????(????) - ?r????????????x????????? - ???????????????????????
>
> if s?i???? > _????????????????????H??????????????????L????????:
>
> ???? = '%s[%d chars]%s' % (????[:??????????????????????????????], ?????????p, ????[????????????(
> ????) - ?????????????x?????????:])
>
> return ?
>

0_o color me impressed, I did not think that would be legal syntax. Would
be interesting to include in a textbook, if for nothing else other than to
academically demonstrate that it is possible, as I suspect many are not
aware.

--
--Kyle R. Stanley, Python Core Developer (what is a core dev?
<https://devguide.python.org/coredev/>)
*Pronouns: they/them **(why is my pronoun here?*
<http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>
)

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

encukou at gmail

Nov 15, 2021, 12:54 AM

Post #29 of 48 (1633 views)

On 15. 11. 21 9:25, Stephen J. Turnbull wrote:
> Christopher Barker writes:
>
> > Would a proposal to switch the normalization to NFC only have any hope of
> > being accepted?
>
> Hope, yes. Counting you, it's been proposed twice. :-) I don't know
> whether it would get through. We know this won't affect the stdlib,
> since that's restricted to ASCII. I suppose we could trawl PyPI and
> GitHub for "compatibles" (the Unicode term for "K" normalizations).

I don't think PyPI/GitHub are good resources to trawl.

Non-ASCII identifiers were added for the benefit of people who use
non-English languages. But both on PyPI and GitHub are overwhelmingly
projects written in English -- especially if you look at the more
popular projects.
It would be interesting to reach out to the target audience here... but
they're not on this list, either. Do we actually know anyone using this?

I do teach beginners in a non-English language, but tell them that they
need to learn English if they want to do any serious programming. Any
code that's to be shared more widely than a country effectively has to
be in English. It seems to me that at the level where you worry about
supply chain attacks and you're doing code audits, something like
CPython's policy (ASCII only except proper names and Unicode-related
tests) is a good idea.
Or not? I don't know anyone who actually uses non-ASCII identifiers for
a serious project.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/AVCLMBIXWPNIIKRFMGTS5SETUCGAONLK/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

steve at pearwood

Nov 15, 2021, 2:45 AM

Post #30 of 48 (1633 views)

On Mon, Nov 15, 2021 at 12:33:54PM +0400, Abdur-Rahmaan Janhangeer wrote:

> Yet another issue is adding vulnerabilities in plain sight.
> Human code reviewers will see this:
>
> if user.admin == "something":
>
> Static analysers will see
>
> if user.admin == "something<hidden chars>":

Okay, you have a string literal with hidden characters. Assuming that
your editor actually renders them as invisible characters, rather than
"something???" or "something???" or "something???" or equivalent.

Now what happens? where do you go from there to a vunerability or
backdoor? I think it might be a bit obvious that there is something
funny going on if I see:

if (user.admin == "root" and check_password_securely()
or user.admin == "root"
# Second string has hidden characters, do not remove it.
):
elevate_privileges()

even without the comment :-)

In another thread, Serhiy already suggested we ban invisible control
characters (other than whitespace) in comments and strings.

https://mail.python.org/archives/list/python-dev@python.org/message/DN24FK3A2DSO4HBGEDGJXERSAUYK6VK6/

I think that is a good idea.

But beyond the C0 and C1 control characters, we should be conservative
about banning "hidden characters" without a *concrete* threat. For
example, variation selectors are "hidden", but they change the visual
look of emoji and other characters. Even if you think that being able to
set the skin tone of your emoji or choose different national flags using
variation selectors is pure frippery, they are also necessary for
Mongolian and some CJK ideographs.

http://unicode.org/reports/tr28/tr28-3.html#13_7_variation_selectors

I'm not sure about bidirectional controls; I have to leave that to
people with more experience in bidirectional text than I do. I think
that many editors in common use don't support bidirectional text, or at
least the ones I use don't seem to support it fully or correctly. But
for what little it is worth, my feeling is that people who use RTL or
bidirectional strings and have editors that support them will be annoyed
if we ban them from strings for the comfort of people who may never in
their life come across a string containing such bidirectional text.

But, if there is a concrete threat beyond "it looks weird", that it
another issue.

> but will not flag it as it's up to the user to verify the logic of
> things

There is no reason why linters and code checkers shouldn't check for
invisible characters, Unicode confusables or mixed script identifiers
and flag them. The interpreter shouldn't concern itself with such purely
stylistic issues unless there is a concrete threat that can only be
handled by the interpreter itself.

--
Steve
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/KSIBL3KMONIETBKXSBPPMA27MACWIH33/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

arj.python at gmail

Nov 15, 2021, 3:20 AM

Post #31 of 48 (1633 views)

Greetings,

> Now what happens? where do you go from there to a vunerability or
backdoor? I think it might be a bit obvious that there is something
funny going on if I see:

if (user.admin == "root" and check_password_securely()
or user.admin == "root"
# Second string has hidden characters, do not remove it.
):
elevate_privileges()

Well, it's not so obvious. From Ross Anderson and Nicholas Boucher
src: https://trojansource.codes/trojan-source.pdf

See appendix H. for Python.

with implementations:

https://github.com/nickboucher/trojan-source/tree/main/Python

Rely precisely on bidirectional control chars and/or replacing look alikes

> There is no reason why linters and code checkers shouldn't check for
invisible characters, Unicode confusables or mixed script identifiers
and flag them. The interpreter shouldn't concern itself with such purely
stylistic issues unless there is a concrete threat that can only be
handled by the interpreter itself.

I mean current linters. But it will be good to check those for sure.
As a programmer, i don't want a language which bans unicode stuffs.
If there's something that should be fixed, it's the unicode standard, maybe
defining a sane mode where weird unicode stuffs are not allowed. Can also
be from language side in the event where it's not being considered in the
standard
itself.

I don't see it as a language fault nor as a client fault as they are
considering
the unicode docs but the response was mixed with some languages decided to
patch it
from their side, some linters implementing detection for it as well as some
editors flagging
it and rendering it as the exploit intended.

Kind Regards,

Abdur-Rahmaan Janhangeer
about <https://compileralchemy.github.io/> | blog
<https://www.pythonkitchen.com>
github <https://github.com/Abdur-RahmaanJ>
Mauritius

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

steve at pearwood

Nov 15, 2021, 3:36 AM

Post #32 of 48 (1633 views)

On Sun, Nov 14, 2021 at 10:12:39PM -0800, Christopher Barker wrote:

> I am, however, surprised and disappointed by the NKFC normalization.
>
> For example, in writing math we often use different scripts to mean
> different things (e.g. TeX's Blackboard Bold). So if I were to use
> some of the Unicode Mathematical Alphanumeric Symbols, I wouldn't want
> them to get normalized.

Hmmm... would you really want these to all be different identifiers?

???? ???? ???? ???? B

You're assuming the reader of the code has the right typeface to view
them (rather than as mere boxes), and that their eyesight is good enough
to distinguish the variations even if their editor applies bold or
italic as part of syntax highlighting. That's very bold of you :-)

In any case, the question of NFKC versus NFC was certainly considered,
but unfortunately PEP 3131 doesn't document why NFKC was chosen.

https://www.python.org/dev/peps/pep-3131/

Before we change the normalisation rules, it would probably be a good
idea to trawl through the archives of the mailing list and work out why
NFKC was chosen in the first place, or contact Martin von Löwis and see
if he remembers.

> Then there's the question of when this normalization happens (and when it
> doesn't). If one is doing any kind of metaprogramming, even just using
> getattr() and setattr(), things could get very confusing:

For ordinary identifiers, they are normalised at some point during
compilation or interpretation. It probably doesn't matter exactly when.

Strings should *not* be normalised when using subscripting on a dict,
not even on globals():

https://bugs.python.org/issue42680

I'm not sure about setattr and getattr. I think that they should be
normalised. But apparently they aren't:

>>> from types import SimpleNamespace
>>> obj = SimpleNamespace(B=1)
>>> setattr(obj, '????', 2)
>>> obj
namespace(B=1, ????=2)
>>> obj.B
1
>>> obj.????
1

See also here:

https://bugs.python.org/issue35105

--
Steve
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/7XZJPFED3YJSJ73YSPWCQPN6NLTNEMBI/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

rosuav at gmail

Nov 15, 2021, 3:43 AM

Post #33 of 48 (1633 views)

On Mon, Nov 15, 2021 at 10:22 PM Abdur-Rahmaan Janhangeer
<arj.python@gmail.com> wrote:
>
> Greetings,
>
>
> > Now what happens? where do you go from there to a vunerability or
> backdoor? I think it might be a bit obvious that there is something
> funny going on if I see:
>
> if (user.admin == "root" and check_password_securely()
> or user.admin == "root"
> # Second string has hidden characters, do not remove it.
> ):
> elevate_privileges()
>
>
> Well, it's not so obvious. From Ross Anderson and Nicholas Boucher
> src: https://trojansource.codes/trojan-source.pdf
>
> See appendix H. for Python.
>
> with implementations:
>
> https://github.com/nickboucher/trojan-source/tree/main/Python
>
> Rely precisely on bidirectional control chars and/or replacing look alikes

The point of those kinds of attacks is that syntax highlighters and
related code review tools would misinterpret them. So I pulled them
all up in both GitHub's view and the editor I personally use (SciTE,
albeit a fairly old version now). GitHub specifically flags it as a
possible exploit in a couple of cases, but also syntax highlights the
return keyword appropriately. SciTE doesn't give any sort of warnings,
but again, correctly highlights the code - early-return shows "return"
as a keyword, invisible-function shows the name "is_" as the function
name and the rest not, homoglyph-function shows a quite
distinct-looking letter that definitely isn't an H.

The problems here are not Python's, they are code reviewers', and that
means they're really attacks against the code review tools. It's no
different from using the variable m in one place and rn in another,
and hoping that code review uses a proportionally-spaced font that
makes those look similar. So to count as a viable attack, there needs
to be at least one tool that misparses these; so far, I haven't found
one, but if I do, wouldn't it be more appropriate to raise the bug
report against the tool?

> > There is no reason why linters and code checkers shouldn't check for
> invisible characters, Unicode confusables or mixed script identifiers
> and flag them. The interpreter shouldn't concern itself with such purely
> stylistic issues unless there is a concrete threat that can only be
> handled by the interpreter itself.
>
>
> I mean current linters. But it will be good to check those for sure.
> As a programmer, i don't want a language which bans unicode stuffs.
> If there's something that should be fixed, it's the unicode standard, maybe
> defining a sane mode where weird unicode stuffs are not allowed. Can also
> be from language side in the event where it's not being considered in the standard
> itself.

Uhhm..... "weird unicode stuffs"? Please clarify.

> I don't see it as a language fault nor as a client fault as they are considering
> the unicode docs but the response was mixed with some languages decided to patch it
> from their side, some linters implementing detection for it as well as some editors flagging
> it and rendering it as the exploit intended.

I see it as an editor issue (or code review tool, as the case may be).
You'd be hard-pressed to get something past code review if it looks to
everyone else like you slipped a "return" statement at the end of a
docstring.

So far, I've seen fewer problems from "weird unicode stuffs" than from
the quoted-printable encoding, and that's an attack that involves
nothing but ASCII text. It's also an attack that far more code review
tools seem to be vulnerable to.

ChrisA
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/OUPC6LGFXIILBTNEC4FYTERBX7VKQHDX/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

Nov 15, 2021, 4:08 AM

Post #34 of 48 (1633 views)

On 15.11.2021 12:36, Steven D'Aprano wrote:
> On Sun, Nov 14, 2021 at 10:12:39PM -0800, Christopher Barker wrote:
>
>> I am, however, surprised and disappointed by the NKFC normalization.
>>
>> For example, in writing math we often use different scripts to mean
>> different things (e.g. TeX's Blackboard Bold). So if I were to use
>> some of the Unicode Mathematical Alphanumeric Symbols, I wouldn't want
>> them to get normalized.
>
> Hmmm... would you really want these to all be different identifiers?
>
> ???? ???? ???? ???? B
>
> You're assuming the reader of the code has the right typeface to view
> them (rather than as mere boxes), and that their eyesight is good enough
> to distinguish the variations even if their editor applies bold or
> italic as part of syntax highlighting. That's very bold of you :-)
>
> In any case, the question of NFKC versus NFC was certainly considered,
> but unfortunately PEP 3131 doesn't document why NFKC was chosen.
>
> https://www.python.org/dev/peps/pep-3131/
>
> Before we change the normalisation rules, it would probably be a good
> idea to trawl through the archives of the mailing list and work out why
> NFKC was chosen in the first place, or contact Martin von Löwis and see
> if he remembers.

This was raised in the discussion, but never conclusively answered:

https://mail.python.org/pipermail/python-3000/2007-May/007995.html

NFKC is the standard normalization form when you want remove any
typography related variants/hints from the text before comparing
strings. See http://www.unicode.org/reports/tr15/

I guess that's why Martin chose this form, since the point
was to maintain readability, even if different variants of a
character are used in the source code. A "B" in the source code
should be interpreted as an ASCII B, even when written
as ???? ???? ???? or ????.

This simplifies writing code and does away with many of the
security issues you could otherwise run into (where e.g. the
absence of an identifier causes the application flow to
be different).

>> Then there's the question of when this normalization happens (and when it
>> doesn't).

It happens in the parser when reading a non-ASCII identifier
(see Parser/pegen.c), so only applies to source code, not attributes
you dynamically add to e.g. class or module namespaces.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Nov 15 2021)
>>> Python Projects, Coaching and Support ... https://www.egenix.com/
>>> Python Product Development ... https://consulting.egenix.com/
________________________________________________________________________

::: We implement business ideas - efficiently in both time and costs :::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
https://www.egenix.com/company/contact/
https://www.malemburg.com/

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/SNN2WZ3MOH5IACSZVHGS6DKTNMKO5JBV/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

stephenjturnbull at gmail

Nov 15, 2021, 8:18 AM

Post #35 of 48 (1633 views)

Abdur-Rahmaan Janhangeer writes:

> As a programmer, i don't want a language which bans unicode stuffs.

But that's what Unicode says should be done (see below).

> If there's something that should be fixed, it's the unicode standard,

Unicode is not going to get "fixed". Most features are important for
some natural language or other. One could argue that (for example)
math symbols that are adopted directly from some character repertoire
should not have been -- I did so elsewhere, although not terribly
seriously.

> maybe defining a sane mode where weird unicode stuffs are not
> allowed.

Unicode denies responsibility for that by permitting arbitrary
subsetting. It does have a couple of (very broad) subsets predefined,
ie, the normalization formats.

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/63FDIQQNJKCH7C3NMEN3ECRHTA7JHJ2W/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

tjreedy at udel

Nov 15, 2021, 9:28 AM

Post #36 of 48 (1633 views)

On 11/15/2021 5:45 AM, Steven D'Aprano wrote:

> In another thread, Serhiy already suggested we ban invisible control
> characters (other than whitespace) in comments and strings.

He said in string *literals*. One would put them in stromgs by using
visible escape sequences.

>>> '\033' is '\x1b' is '\u001b'
True

> https://mail.python.org/archives/list/python-dev@python.org/message/DN24FK3A2DSO4HBGEDGJXERSAUYK6VK6/
>
> I think that is a good idea.

If one is outputting terminal control sequences, making the escape char
visible is a good idea anyway. It would be easier if '\e' worked. (But
see below.)

> But beyond the C0 and C1 control characters, we should be conservative
> about banning "hidden characters" without a *concrete* threat. For
> example, variation selectors are "hidden", but they change the visual
> look of emoji and other characters.
I can imagine that a complete emoji point and click input method might
have one select the emoji and the variation and output the pair
together. An option to output the selection character as the
appropriate python-specific '\unnnn' is unlikely, and even if there
were, who would know what it meant? Users would want the selected
variation visible if the editor supported such.

If terminal escape sequences were also selected by point and click, my
comment above would change.

--
Terry Jan Reedy

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/4IMXVQFZI3VDHA4D2YZD4KTBU7GSEFPW/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

arj.python at gmail

Nov 15, 2021, 9:35 AM

Post #37 of 48 (1633 views)

> GitHub specifically flags it as a
possible exploit in a couple of cases, but also syntax highlights the
return keyword appropriately.

My guess is that Github did patch it afterwards as the paper does list
Github
as vulnerable

> Uhhm..... "weird unicode stuffs"? Please clarify.

Wriggly texts just because they appear different

Well, it's tool based but maybe compiler checks aka checks from
the language side is something that should be insisted upon too to
patch inconsistent checks across editors.

The reason i was saying it's related to encodings is that when languages
are impacted en masse, maybe it hints to a revision in the unicode standards
at the very least warnings. As Steven above even before i posted the paper
was hinting towards the vulnerability so maybe those in charge of the
unicode
standards should study and predict angles of attacks.

Kind Regards,

Abdur-Rahmaan Janhangeer
about <https://compileralchemy.github.io/> | blog
<https://www.pythonkitchen.com>
github <https://github.com/Abdur-RahmaanJ>
Mauritius

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

steve at pearwood

Nov 15, 2021, 3:01 PM

Post #38 of 48 (1633 views)

On Mon, Nov 15, 2021 at 12:28:01PM -0500, Terry Reedy wrote:
> On 11/15/2021 5:45 AM, Steven D'Aprano wrote:
>
> >In another thread, Serhiy already suggested we ban invisible control
> >characters (other than whitespace) in comments and strings.
>
> He said in string *literals*. One would put them in stromgs by using
> visible escape sequences.

Thanks Terry for the clarification, of course I didn't mean to imply
that we should ban control characters in strings completely. Only actual
control characters embedded in string literals in the source, just as we
already currently ban them outside of comments and strings.

--
Steve
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/XCPSQYKOX4YXDIAACDLL3I5OYWFGFLD7/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

steve at pearwood

Nov 15, 2021, 4:40 PM

Post #39 of 48 (1633 views)

On Mon, Nov 15, 2021 at 03:20:26PM +0400, Abdur-Rahmaan Janhangeer wrote:

> Well, it's not so obvious. From Ross Anderson and Nicholas Boucher
> src: https://trojansource.codes/trojan-source.pdf

Thanks for the link. But it discusses a whole range of Unicode attacks,
and the specific attack you mentioned (Invisible Character Attacks) is
described in section D page 7 as "unlikely to work in practice".

As they say, compilers and interpreters in general already display
errors, or at least a warning, for invisible characters in code.

In addition, there is the difficulty that its not just enough to use
invisible characters to call a different function, you have to smuggle
in the hostile function that you actually want to call.

It does seem that the Trojan-Source attack listed in the paper is new,
but others (such as the homoglyph attacks that get most people's
attention) are neither new nor especially easy to actually exploit.
Unicode has been warning about it for many years. We discussed it in PEP
3131. This is not new, and not easy to exploit.

Perhaps that's why there are no, or very few, actual exploits of this in
the wild. Homoglyph attacks against user-names and URLs, absolutely, but
homoglyph attacks against source code are a different story.

Yes, you can cunningly have two classes like ? and A and the Python
interpreter will treat them as distinct, but you still have to smuggle
in your hostile code in ? (greek Alpha) without anyone noticing, and you
have to avoid anyone asking why you have two classes with the same name.
And that's the hard part.

We don't need Unicode for homoglyph attacks. func0 and funcO may look
identical, or nearly identical, but you still have to smuggle in your
hostile code into funcO without anyone noticing, and that's why there
are so few real-world homoglyph attacks.

Whereas the Trojan Source attacks using BIDI controls does seem to be
genuinely exploitable.

--
Steve
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/FSHGS4AOAGTWKSWAADZWH5L2GGBWHHXE/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

steve at pearwood

Nov 15, 2021, 5:03 PM

Post #40 of 48 (1633 views)

On Mon, Nov 15, 2021 at 10:43:12PM +1100, Chris Angelico wrote:

> The problems here are not Python's, they are code reviewers', and that
> means they're really attacks against the code review tools.

I think that's a bit strong. Boucher and Anderson's paper describes
multiple kinds of vulnerabilities. At a fairly quick glance, the BIDI
attacks does seem to be a novel attack, and probably exploitable.

But unfortunately it seems to be the Unicode confusables or homoglyph
attack that seems to be getting most of the attention, and that's not
new, it is as old as ASCII, and not so easily exploitable. Being able to
have ? (Cyrillic) ? (Greek alpha) and A (Latin) in the same code base
makes for a nice way to write obfuscated code, but it's *obviously*
obfuscated and not so easy to smuggle in hostile code.

Whereas the BIDI attacks do (apparently) make it easy to smuggle in
code: using invisible BIDI control codes, you can introduce source code
where the way the editor renders the code, and the way the coder reads
it, is different from the way the interpreter or compiler runs it.

That is, I think, new and exploitable: something that looks like a
comment is actually code that the interpreter runs, and something that
looks like code is actually a string or comment which is not executed,
but editors may syntax-colour it as if it were code.

Obviously we can mitigate against this by improving the editors (at the
very least, all editors should have a Show Invisible Characters option).
Linters and code checks should also flag problematic code containing
BIDI codes, or attacks against docstrings.

Beyond that, it is not clear to me what, if anything, we should do in
response to this new class of Trojan Source attacks, beyond documenting
it.

--
Steve
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/SXF2BG47UZTI7QM7GB3XCTGEV576UZOE/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

rosuav at gmail

Nov 15, 2021, 5:56 PM

Post #41 of 48 (1633 views)

On Tue, Nov 16, 2021 at 12:13 PM Steven D'Aprano <steve@pearwood.info> wrote:
>
> On Mon, Nov 15, 2021 at 10:43:12PM +1100, Chris Angelico wrote:
>
> > The problems here are not Python's, they are code reviewers', and that
> > means they're really attacks against the code review tools.
>
> I think that's a bit strong. Boucher and Anderson's paper describes
> multiple kinds of vulnerabilities. At a fairly quick glance, the BIDI
> attacks does seem to be a novel attack, and probably exploitable.

The BIDI attacks basically amount to making this:

def func():
"""This is a docstring"""; return

look like this:

def func():
"""This is a docstring; return"""

If you see something that looks like the second, but the word "return"
is syntax-highlighted as a keyword instead of part of the string, the
attack has failed. (Or if you ignore that, then your code review is
flawed, and you're letting malicious code in.) The attack depends for
its success on some human approving some piece of code that doesn't do
what they think it does, and that means it has to look like what it
doesn't do - which is an attack against what the code looks like,
since what it does is very well defined.

> Whereas the BIDI attacks do (apparently) make it easy to smuggle in
> code: using invisible BIDI control codes, you can introduce source code
> where the way the editor renders the code, and the way the coder reads
> it, is different from the way the interpreter or compiler runs it.

Right: the way the editor renders the code, that's the essential part.
That's why I consider this an attack against some editor (or set of
editors). When you find an editor that is vulnerable to this, file a
bug report against that editor.

The way the coder reads it will be heavily based upon the way the
editor colours it.

> That is, I think, new and exploitable: something that looks like a
> comment is actually code that the interpreter runs, and something that
> looks like code is actually a string or comment which is not executed,
> but editors may syntax-colour it as if it were code.

Right. Exactly my point: editors may syntax-colour it incorrectly.

That's why I consider this not an attack on the language, but on the
editor. As long as the editor parses it the exact same way that the
interpreter does, there isn't a problem.

ChrisA
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/3X6K5YYBRATECDRTN57XNT3QNP2J6ZBG/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

jimjjewett at gmail

Nov 16, 2021, 3:49 PM

Post #42 of 48 (1633 views)

Compatibility variants can look different, but they can also look identical. Allowing any non-ASCII characters was worrisome because of the security implications of confusables. Squashing compatibility characters seemed the more conservative choice at the time. Stestagg's example:
? = lambda ?, e: ? if ? > e else e
shows it wasn't perfect, but adding more invisible differences does have risks, even beyond the backwards incompatibility and the problem with (hopefully rare, but are we sure?) editors that don't distinguish between them in the way a programming language would prefer.

I think (but won't swear) that there were also several problematic characters that really should have been treated as (at most) glyph variants, but ... weren't. If I Recall Correctly, the largest number were Arabic presentation forms, but there were also a few characters that were in Unicode only to support round-trip conversion with a legacy charset, even if that charset had been declared buggy. In at least a few of these cases, it seemed likely that a beginning user would expect them to be equivalent.

-jJ
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/GNT3AG2SCVLMCJAZXSTIWFKKAYG25E7O/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

jimjjewett at gmail

Nov 16, 2021, 4:23 PM

Post #43 of 48 (1633 views)

Stephen J. Turnbull wrote:
> Christopher Barker writes:

> > For example, in writing math we often use different scripts to mean
> > different things (e.g. TeX's Blackboard Bold). So if I were to use
> > some of the Unicode Mathematical Alphanumeric Symbols, I wouldn't
> > want them to get normalized.

Agreed, for careful writers. But Stephen's answer about people using the wrong one and expecting it to work means that normalization is probably the lesser of evils for most people, and the ones who don't want it normalized are more likely to be able to specify custom processing when it is important enough. (The compatibility characters aren't normalized in strings, largely because that should still be possible.)

> In fact, I think adding these symbols to Unicode was a bad idea; they
> should be handled at a higher level in the linguistic stack (by
> semantic markup).

When I was a math student, these were clearly different symbols, with much less relation to each other than a mere case difference.
So by the Unicode consortium's goals, they are independent characters that should each be defined. I admit that isn't ideal for most use cases outside of math, but ... supporting those other cases is what compatibility normalization is for.

> It's also a UX problem. At slightly higher layer in the stack, I'm
> used to using Japanese input methods to input sigma and pi which
> produce characters in the Greek block, and at least the upper case
> forms that denote sum and product have separate characters in the math
> operators block. I understand why people who literally write
> mathematics in Greek might want those not normalized, but I sure am
> going to keep using "Greek sigma", not "math sigma"! The probability
> that I'm going to have a Greek uppercase sigma in my papers is nil,
> the probability of a summation symbol near unity. But the summation
> symbol is not easily available, I have to scroll through all the
> preceding Unicode blocks to find Mathematical Operators. So I am
> perfectly happy with uppercase Greek sigma for that role (as is
> XeTeX!!)

I think that is mostly a backwards compatibility problem; XeTeX itself had to worry about compatibility with TeX (which preceded Unicode) and with the fonts actually available and then with earlier versions of XeTeX.

-jJ
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/JNFLAQUKNCWCJSMBNJZGHVD5ZELOTU6G/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

jimjjewett at gmail

Nov 16, 2021, 5:26 PM

Post #44 of 48 (1633 views)

Steven D'Aprano wrote:
> I think
> that many editors in common use don't support bidirectional text, or at
> least the ones I use don't seem to support it fully or correctly. ...
> But, if there is a concrete threat beyond "it looks weird", that it
> another issue.

Based on the original post (and how it looked in my web browser, after various automated reformattings, it seems that one of the failure modes that buggy editors have is that

stuff can be part of the code, even though it looks like part of a comment, or vice versa

This problem might be limited to only some of the bidi controls, and there might even be a workaround specific to # ... but it is an issue. I do not currently have an opinion on how important of an issue it is, or how adequate the workarounds are.

-jJ
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ECO4R655UGPCVFFVAOQZ3DUZVHQY75BX/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

stephenjturnbull at gmail

Nov 16, 2021, 7:58 PM

Post #45 of 48 (1633 views)

Executive summary:

I guess the bottom line is that I'm sympathetic to both the NFC and
NFKC positions.

I think that wetware is such that people will go to the trouble of
picking out a letter-like symbol from a palette rarely, and in my
environment that's not going to happen at all because I use Japanese
phonetic input to get most symbols ("sekibun" = integral, "siguma" =
sigma), and I don't use calligraphic R for the real line, I use
\newcommand{\R}{{\cal R}}, except on a physical whiteboard, where I
use blackboard bold (go figure that one out!) So to my mind the
letter-like block in Unicode is a failed experiemnt.

Jim J. Jewett writes:

> When I was a math student, these were clearly different symbols,
> with much less relation to each other than a mere case difference.

Arguable. The letter-like symbols block has script (cursive),
blackboard bold, and Fraktur versions of R. I've seen all of them as
well as plain Roman, bold, italic and bold italic facts used to denote
the real line, and I've personally used most of them for that purpose
depending on availability of fonts and input methods and medium (ie,
computer text vs. hand-written). I've also seen several of them used
for reaction functions or spaces thereof in game theory (although
blackboard bold and Fraktur seem to be used uniquely for the real
line). Clearly the common denominator is the uppercase latin letter
"R", and the glyph being recognizably "R" is necessary and sufficient
to each of those purposes. The story for uppercase sigma as sum is
somewhat similar: sum is by far not the only use of that letter,
although I don't know of any other operator symbol for sum over a set
or series (outside of programming languages, which I think we can
discount).

I agree that we should consider math to be a separate language, but it
doesn't have a consistent script independent of the origins of the
symbols. Even today none of my engineering and economics students can
type any symbols except those in the JIS repertoire, which they type
by original name ("siguma", "ramuda", "arefu", "yajirushi" == arrow,
etc, "sekibun" == integration does bring up the integral sign in at
least some modern input methods, but it doesn't have a script name,
while "kasann" == addition does not bring up sigma, although "siguma"
does, and "essu" brings up sigma -- but only in "ASCII emoji" strings,
go figure). I have seen students use fullwidth R for the real line,
though, but distinguishing that is a deprecated compatibility feature
of Unicode (and of Japanese practice -- even in very formal university
documents such as grade reports for a final doctoral examination I've
seen numbers and names containing mixed half-width and full-width
ASCII).

So I think "letter-like" was a reasonable idea (I'm pretty sure this
block goes back to the '90s but I'm too lazy to check), but it hasn't
turned out well, and I doubt it ever will.

> So by the Unicode consortium's goals, they are independent
> characters that should each be defined. I admit that isn't ideal
> for most use cases outside of math,

I don't think it even makes sense *inside* of math for the letter-like
symbols. The nature of math means that any "R" will be grabbed for
something whose name starts with "r" as soon as that's convenient.
Something like the integral sign (which is a stretched "S" for "sum"),
OK -- although category theory uses that for "ends" which still don't
look anything like integrals even if you turn them inside out, rotate
90 degrees, and paint them blue.

> > It's also a UX problem. At slightly higher layer in the stack, I'm
> > used to using Japanese input methods to input sigma and pi which
> > produce characters in the Greek block, and at least the upper case
> > forms that denote sum and product have separate characters in the math
> > operators block.
>
> I think that is mostly a backwards compatibility problem; XeTeX
> itself had to worry about compatibility with TeX (which preceded
> Unicode) and with the fonts actually available and then with
> earlier versions of XeTeX.

IMO, the analogy fails because the backward compatibility issue for
Unicode is in the wetware, not in the software.

Steve

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/YTIIFIF75RMWP5J3GCSXWVXSUP5SX7AA/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

eryksun at gmail

Nov 18, 2021, 2:00 AM

Post #46 of 48 (1632 views)

On 11/13/21, Terry Reedy <tjreedy@udel.edu> wrote:
> On 11/13/2021 4:35 PM, ptmcg@austin.rr.com wrote:
>>
>> _??????????????????????L?????????????????????? = 12
>>
>> def _??????????????????????(????, p?????????????????????????, ???????????????????????????):
>>
>> ?????????? = ?????????(????) - ?r????????????x????????? - ???????????????????????
>>
>> if s?i???? > _????????????????????H??????????????????L????????:
>>
>> ???? = '%s[%d chars]%s' % (????[:??????????????????????????????], ?????????p, ????[????????????(????) -
>> ?????????????x?????????:])
>>
>> return ?
>>
> * Does not at all work in CommandPrompt

It works for me when pasted into the REPL using the console in Windows
10. I pasted the code into a raw multiline string assignment and then
executed the string with exec(). The only issue is that most of the
pasted characters are displayed using the font's default glyph since
the console host doesn't have font fallback support. Even Windows
Terminal doesn't have font fallback support yet in the command-line
editing mode that Python's REPL uses. But Windows Terminal does
implement font fallback for normal output rendering, so if you assign
the pasted text to string `s`, then print(s) should display properly.

> even after supposedly changing to a utf-8 codepage with 'chcp 65000'.

Changing the console code page is unnecessary with Python 3.6+, which
uses the console's wide-character API. Also, even though it's
irrelevant for the REPL, UTF-8 is code page 65001. Code page 65000 is
UTF-7.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/7FGNJ7TMASDOMQAS2LSSQAD2PPURT5W6/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

steve at holdenweb

Nov 29, 2021, 1:14 AM

Post #47 of 48 (1622 views)

On Mon, Nov 15, 2021 at 8:42 AM Kyle Stanley <aeros167@gmail.com> wrote:

> On Sat, Nov 13, 2021 at 5:04 PM <ptmcg@austin.rr.com> wrote:
>
>>
>>
>> def ????????????????????():
>>
> [... Python code it's easy to believe isn't grammatical ...]

> return ?
>>
>
> 0_o color me impressed, I did not think that would be legal syntax. Would
> be interesting to include in a textbook, if for nothing else other than to
> academically demonstrate that it is possible, as I suspect many are not
> aware.
>

I'm afraid the best Paul, Alex, Anna and I can hope to do is bring it to
the attention of readers of Python in a Nutshell's fourth edition (on
current plans, hitting the shelves about the same time as 3.11, please tell
your friends ;-) ). Sadly, I'm not aware of any academic classes that use
the Nutshell as a course text, so it seems unlikely to gain the
attention of academic communities.

Given the wider reach of this list, however, one might hope that by the
time the next edition comes out this will be old news due to the
publication of blogs and the like. With luck, a small fraction of the
programming community will become better-informed about Unicode and the
design or programming languages. It's interesting that the egalitarian wish
to allow use of native "alphabetics" has turned out to be such a viper's
nest.

Particular thanks to Stephen J. Turnbull for his thoughtful and
well-informed contribution above.

Kind regards,
Steve

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

pythonchb at gmail

Nov 29, 2021, 9:45 AM

Post #48 of 48 (1622 views)

On Mon, Nov 29, 2021 at 1:21 AM Steve Holden <steve@holdenweb.com> wrote:

> It's interesting that the egalitarian wish to allow use of native
> "alphabetics" has turned out to be such a viper's nest.
>

Indeed.

However, is there no way to restrict identifiers at least to the alphabets
of natural languages? Maybe it wouldn’t help much, but does anyone need to
use letter-like symbols designed for math expressions? I would say maybe,
but certainly not have them auto-converted to the “normal” letter?

For that matter, why have any auto-conversion all?

The answer may be that it’s too late to change now, but I don’t think I’ve
seen a compelling (or any?) use case for that conversion.

-CHB

Particular thanks to Stephen J. Turnbull for his thoughtful and
> well-informed contribution above.
>
> Kind regards,
> Steve
>
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/FNSI6EXCWMMCXEJNYWVVR5LMFOM6M5ZB/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
--
Christopher Barker, PhD (Chris)

Python Language Consulting
- Teaching
- Scientific Software Development
- Desktop GUI and Web Development
- wxPython, numpy, scipy, Cython