Mailing List Archive

1 2  View All
Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]
Christopher Barker writes:

> Would a proposal to switch the normalization to NFC only have any hope of
> being accepted?

Hope, yes. Counting you, it's been proposed twice. :-) I don't know
whether it would get through. We know this won't affect the stdlib,
since that's restricted to ASCII. I suppose we could trawl PyPI and
GitHub for "compatibles" (the Unicode term for "K" normalizations).

> For example, in writing math we often use different scripts to mean
> different things (e.g. TeX's Blackboard Bold). So if I were to use
> some of the Unicode Mathematical Alphanumeric Symbols, I wouldn't
> want them to get normalized.

Independent of the question of the normalization of Python
identifiers, I think using those characters this way is a bad idea.
In fact, I think adding these symbols to Unicode was a bad idea; they
should be handled at a higher level in the linguistic stack (by
semantic markup).

You're confusing two things here. In Unicode, a script is a
collection of characters used for a specific language, typically a set
of Unicode blocks of characters (more or less; there are a lot of Han
ideographs that are recognizable as such to Japanese but are not part
of the repertoire of the Japanese script). That is, these characters
are *different* from others that look like them.

Blackboard Bold is more what we would usually call a "font": the
(math) italic "x" and the (math) bold italic "x" are the same "x", but
one denotes a scalar and the other a vector in many math books. A
roman "R" probably denotes the statistical application, an italic "R"
the reaction function in game theory model, and a Blackboard Bold "R"
the set of real numbers. But these are all the same character.

It's a bad idea to rely on different (Unicode) scripts that use the
same glyphs for different characters to look different from each
other, unless you "own" the fonts to be used. As far as I know
there's no way for a Python program to specify the font to be used to
display itself though. :-)

It's also a UX problem. At slightly higher layer in the stack, I'm
used to using Japanese input methods to input sigma and pi which
produce characters in the Greek block, and at least the upper case
forms that denote sum and product have separate characters in the math
operators block. I understand why people who literally write
mathematics in Greek might want those not normalized, but I sure am
going to keep using "Greek sigma", not "math sigma"! The probability
that I'm going to have a Greek uppercase sigma in my papers is nil,
the probability of a summation symbol near unity. But the summation
symbol is not easily available, I have to scroll through all the
preceding Unicode blocks to find Mathematical Operators. So I am
perfectly happy with uppercase Greek sigma for that role (as is
XeTeX!!)

And the thing is, of course those Greek letters really are Greek
letters: they were chosen because pi is the homophone of p which is
the first letter of "product", and sigma is the homophone of s which
is the first letter of "sum". ? for ?ngström is similar, it's the
initial letter of a Swedish name.

Sure, we could fix the input methods (and search methods!! -- people
are going to input the character they know that corresponds to the
glyph *they* see, not the bit pattern the *CPU* sees). But that's as
bad as trying to fix mail clients. Not worth the effort because I'm
pretty sure you're gonna fail -- it's one of those "you'll have to pry
this crappy software that annoys admins around the world from my cold
dead fingers" issues, which is why their devs refuse to fix them.

Steve
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/5GHPVNJLLOKBYPE7FSU5766XYP6IJPEK/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]
Well,

Yet another issue is adding vulnerabilities in plain sight.

Human code reviewers will see this:

if user.admin == "something":

Static analysers will see

if user.admin == "something<hidden chars>":

but will not flag it as it's up to the user to verify the logic of things

and as such soft authors can plant backdoors in plain sight

Kind Regards,

Abdur-Rahmaan Janhangeer
about <https://compileralchemy.github.io/> | blog
<https://www.pythonkitchen.com>
github <https://github.com/Abdur-RahmaanJ>
Mauritius
Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]
On Sat, Nov 13, 2021 at 5:04 PM <ptmcg@austin.rr.com> wrote:

>
>
> def ????????????????????():
>
> try:
>
> ????e????????????? = "Hello"
>
> ????????r?????? = "World"
>
> ?????????????????(f"{?????????????º_}, {?????????l??}!")
>
> except ??????????????????????????? as ?????c:
>
> ????r???("failed: {}".??????????ª?(?????????))
>
>
>
> if _??????????????__ == "__main__":
>
> ????e??????()
>
>
>
>
>
> # snippet from unittest/util.py
>
> _??????????????????????L?????????????????????? = 12
>
> def _??????????????????????(????, p?????????????????????????, ???????????????????????????):
>
> ?????????? = ?????????(????) - ?r????????????x????????? - ???????????????????????
>
> if s?i???? > _????????????????????H??????????????????L????????:
>
> ???? = '%s[%d chars]%s' % (????[:??????????????????????????????], ?????????p, ????[????????????(
> ????) - ?????????????x?????????:])
>
> return ?
>

0_o color me impressed, I did not think that would be legal syntax. Would
be interesting to include in a textbook, if for nothing else other than to
academically demonstrate that it is possible, as I suspect many are not
aware.

--
--Kyle R. Stanley, Python Core Developer (what is a core dev?
<https://devguide.python.org/coredev/>)
*Pronouns: they/them **(why is my pronoun here?*
<http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>
)
Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]
On 15. 11. 21 9:25, Stephen J. Turnbull wrote:
> Christopher Barker writes:
>
> > Would a proposal to switch the normalization to NFC only have any hope of
> > being accepted?
>
> Hope, yes. Counting you, it's been proposed twice. :-) I don't know
> whether it would get through. We know this won't affect the stdlib,
> since that's restricted to ASCII. I suppose we could trawl PyPI and
> GitHub for "compatibles" (the Unicode term for "K" normalizations).

I don't think PyPI/GitHub are good resources to trawl.

Non-ASCII identifiers were added for the benefit of people who use
non-English languages. But both on PyPI and GitHub are overwhelmingly
projects written in English -- especially if you look at the more
popular projects.
It would be interesting to reach out to the target audience here... but
they're not on this list, either. Do we actually know anyone using this?


I do teach beginners in a non-English language, but tell them that they
need to learn English if they want to do any serious programming. Any
code that's to be shared more widely than a country effectively has to
be in English. It seems to me that at the level where you worry about
supply chain attacks and you're doing code audits, something like
CPython's policy (ASCII only except proper names and Unicode-related
tests) is a good idea.
Or not? I don't know anyone who actually uses non-ASCII identifiers for
a serious project.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/AVCLMBIXWPNIIKRFMGTS5SETUCGAONLK/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]
On Mon, Nov 15, 2021 at 12:33:54PM +0400, Abdur-Rahmaan Janhangeer wrote:

> Yet another issue is adding vulnerabilities in plain sight.
> Human code reviewers will see this:
>
> if user.admin == "something":
>
> Static analysers will see
>
> if user.admin == "something<hidden chars>":

Okay, you have a string literal with hidden characters. Assuming that
your editor actually renders them as invisible characters, rather than
"something???" or "something???" or "something???" or equivalent.

Now what happens? where do you go from there to a vunerability or
backdoor? I think it might be a bit obvious that there is something
funny going on if I see:

if (user.admin == "root" and check_password_securely()
or user.admin == "root"
# Second string has hidden characters, do not remove it.
):
elevate_privileges()

even without the comment :-)

In another thread, Serhiy already suggested we ban invisible control
characters (other than whitespace) in comments and strings.

https://mail.python.org/archives/list/python-dev@python.org/message/DN24FK3A2DSO4HBGEDGJXERSAUYK6VK6/

I think that is a good idea.

But beyond the C0 and C1 control characters, we should be conservative
about banning "hidden characters" without a *concrete* threat. For
example, variation selectors are "hidden", but they change the visual
look of emoji and other characters. Even if you think that being able to
set the skin tone of your emoji or choose different national flags using
variation selectors is pure frippery, they are also necessary for
Mongolian and some CJK ideographs.

http://unicode.org/reports/tr28/tr28-3.html#13_7_variation_selectors

I'm not sure about bidirectional controls; I have to leave that to
people with more experience in bidirectional text than I do. I think
that many editors in common use don't support bidirectional text, or at
least the ones I use don't seem to support it fully or correctly. But
for what little it is worth, my feeling is that people who use RTL or
bidirectional strings and have editors that support them will be annoyed
if we ban them from strings for the comfort of people who may never in
their life come across a string containing such bidirectional text.

But, if there is a concrete threat beyond "it looks weird", that it
another issue.


> but will not flag it as it's up to the user to verify the logic of
> things

There is no reason why linters and code checkers shouldn't check for
invisible characters, Unicode confusables or mixed script identifiers
and flag them. The interpreter shouldn't concern itself with such purely
stylistic issues unless there is a concrete threat that can only be
handled by the interpreter itself.


--
Steve
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/KSIBL3KMONIETBKXSBPPMA27MACWIH33/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]
Greetings,


> Now what happens? where do you go from there to a vunerability or
backdoor? I think it might be a bit obvious that there is something
funny going on if I see:

if (user.admin == "root" and check_password_securely()
or user.admin == "root"
# Second string has hidden characters, do not remove it.
):
elevate_privileges()


Well, it's not so obvious. From Ross Anderson and Nicholas Boucher
src: https://trojansource.codes/trojan-source.pdf

See appendix H. for Python.

with implementations:

https://github.com/nickboucher/trojan-source/tree/main/Python

Rely precisely on bidirectional control chars and/or replacing look alikes

> There is no reason why linters and code checkers shouldn't check for
invisible characters, Unicode confusables or mixed script identifiers
and flag them. The interpreter shouldn't concern itself with such purely
stylistic issues unless there is a concrete threat that can only be
handled by the interpreter itself.


I mean current linters. But it will be good to check those for sure.
As a programmer, i don't want a language which bans unicode stuffs.
If there's something that should be fixed, it's the unicode standard, maybe
defining a sane mode where weird unicode stuffs are not allowed. Can also
be from language side in the event where it's not being considered in the
standard
itself.

I don't see it as a language fault nor as a client fault as they are
considering
the unicode docs but the response was mixed with some languages decided to
patch it
from their side, some linters implementing detection for it as well as some
editors flagging
it and rendering it as the exploit intended.

Kind Regards,

Abdur-Rahmaan Janhangeer
about <https://compileralchemy.github.io/> | blog
<https://www.pythonkitchen.com>
github <https://github.com/Abdur-RahmaanJ>
Mauritius
Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]
On Sun, Nov 14, 2021 at 10:12:39PM -0800, Christopher Barker wrote:

> I am, however, surprised and disappointed by the NKFC normalization.
>
> For example, in writing math we often use different scripts to mean
> different things (e.g. TeX's Blackboard Bold). So if I were to use
> some of the Unicode Mathematical Alphanumeric Symbols, I wouldn't want
> them to get normalized.

Hmmm... would you really want these to all be different identifiers?

???? ???? ???? ???? B

You're assuming the reader of the code has the right typeface to view
them (rather than as mere boxes), and that their eyesight is good enough
to distinguish the variations even if their editor applies bold or
italic as part of syntax highlighting. That's very bold of you :-)

In any case, the question of NFKC versus NFC was certainly considered,
but unfortunately PEP 3131 doesn't document why NFKC was chosen.

https://www.python.org/dev/peps/pep-3131/

Before we change the normalisation rules, it would probably be a good
idea to trawl through the archives of the mailing list and work out why
NFKC was chosen in the first place, or contact Martin von Löwis and see
if he remembers.


> Then there's the question of when this normalization happens (and when it
> doesn't). If one is doing any kind of metaprogramming, even just using
> getattr() and setattr(), things could get very confusing:

For ordinary identifiers, they are normalised at some point during
compilation or interpretation. It probably doesn't matter exactly when.

Strings should *not* be normalised when using subscripting on a dict,
not even on globals():

https://bugs.python.org/issue42680

I'm not sure about setattr and getattr. I think that they should be
normalised. But apparently they aren't:

>>> from types import SimpleNamespace
>>> obj = SimpleNamespace(B=1)
>>> setattr(obj, '????', 2)
>>> obj
namespace(B=1, ????=2)
>>> obj.B
1
>>> obj.????
1

See also here:

https://bugs.python.org/issue35105



--
Steve
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/7XZJPFED3YJSJ73YSPWCQPN6NLTNEMBI/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]
On Mon, Nov 15, 2021 at 10:22 PM Abdur-Rahmaan Janhangeer
<arj.python@gmail.com> wrote:
>
> Greetings,
>
>
> > Now what happens? where do you go from there to a vunerability or
> backdoor? I think it might be a bit obvious that there is something
> funny going on if I see:
>
> if (user.admin == "root" and check_password_securely()
> or user.admin == "root"
> # Second string has hidden characters, do not remove it.
> ):
> elevate_privileges()
>
>
> Well, it's not so obvious. From Ross Anderson and Nicholas Boucher
> src: https://trojansource.codes/trojan-source.pdf
>
> See appendix H. for Python.
>
> with implementations:
>
> https://github.com/nickboucher/trojan-source/tree/main/Python
>
> Rely precisely on bidirectional control chars and/or replacing look alikes

The point of those kinds of attacks is that syntax highlighters and
related code review tools would misinterpret them. So I pulled them
all up in both GitHub's view and the editor I personally use (SciTE,
albeit a fairly old version now). GitHub specifically flags it as a
possible exploit in a couple of cases, but also syntax highlights the
return keyword appropriately. SciTE doesn't give any sort of warnings,
but again, correctly highlights the code - early-return shows "return"
as a keyword, invisible-function shows the name "is_" as the function
name and the rest not, homoglyph-function shows a quite
distinct-looking letter that definitely isn't an H.

The problems here are not Python's, they are code reviewers', and that
means they're really attacks against the code review tools. It's no
different from using the variable m in one place and rn in another,
and hoping that code review uses a proportionally-spaced font that
makes those look similar. So to count as a viable attack, there needs
to be at least one tool that misparses these; so far, I haven't found
one, but if I do, wouldn't it be more appropriate to raise the bug
report against the tool?

> > There is no reason why linters and code checkers shouldn't check for
> invisible characters, Unicode confusables or mixed script identifiers
> and flag them. The interpreter shouldn't concern itself with such purely
> stylistic issues unless there is a concrete threat that can only be
> handled by the interpreter itself.
>
>
> I mean current linters. But it will be good to check those for sure.
> As a programmer, i don't want a language which bans unicode stuffs.
> If there's something that should be fixed, it's the unicode standard, maybe
> defining a sane mode where weird unicode stuffs are not allowed. Can also
> be from language side in the event where it's not being considered in the standard
> itself.

Uhhm..... "weird unicode stuffs"? Please clarify.

> I don't see it as a language fault nor as a client fault as they are considering
> the unicode docs but the response was mixed with some languages decided to patch it
> from their side, some linters implementing detection for it as well as some editors flagging
> it and rendering it as the exploit intended.

I see it as an editor issue (or code review tool, as the case may be).
You'd be hard-pressed to get something past code review if it looks to
everyone else like you slipped a "return" statement at the end of a
docstring.

So far, I've seen fewer problems from "weird unicode stuffs" than from
the quoted-printable encoding, and that's an attack that involves
nothing but ASCII text. It's also an attack that far more code review
tools seem to be vulnerable to.

ChrisA
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/OUPC6LGFXIILBTNEC4FYTERBX7VKQHDX/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]
On 15.11.2021 12:36, Steven D'Aprano wrote:
> On Sun, Nov 14, 2021 at 10:12:39PM -0800, Christopher Barker wrote:
>
>> I am, however, surprised and disappointed by the NKFC normalization.
>>
>> For example, in writing math we often use different scripts to mean
>> different things (e.g. TeX's Blackboard Bold). So if I were to use
>> some of the Unicode Mathematical Alphanumeric Symbols, I wouldn't want
>> them to get normalized.
>
> Hmmm... would you really want these to all be different identifiers?
>
> ???? ???? ???? ???? B
>
> You're assuming the reader of the code has the right typeface to view
> them (rather than as mere boxes), and that their eyesight is good enough
> to distinguish the variations even if their editor applies bold or
> italic as part of syntax highlighting. That's very bold of you :-)
>
> In any case, the question of NFKC versus NFC was certainly considered,
> but unfortunately PEP 3131 doesn't document why NFKC was chosen.
>
> https://www.python.org/dev/peps/pep-3131/
>
> Before we change the normalisation rules, it would probably be a good
> idea to trawl through the archives of the mailing list and work out why
> NFKC was chosen in the first place, or contact Martin von Löwis and see
> if he remembers.

This was raised in the discussion, but never conclusively answered:

https://mail.python.org/pipermail/python-3000/2007-May/007995.html

NFKC is the standard normalization form when you want remove any
typography related variants/hints from the text before comparing
strings. See http://www.unicode.org/reports/tr15/

I guess that's why Martin chose this form, since the point
was to maintain readability, even if different variants of a
character are used in the source code. A "B" in the source code
should be interpreted as an ASCII B, even when written
as ???? ???? ???? or ????.

This simplifies writing code and does away with many of the
security issues you could otherwise run into (where e.g. the
absence of an identifier causes the application flow to
be different).

>> Then there's the question of when this normalization happens (and when it
>> doesn't).

It happens in the parser when reading a non-ASCII identifier
(see Parser/pegen.c), so only applies to source code, not attributes
you dynamically add to e.g. class or module namespaces.

--
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Nov 15 2021)
>>> Python Projects, Coaching and Support ... https://www.egenix.com/
>>> Python Product Development ... https://consulting.egenix.com/
________________________________________________________________________

::: We implement business ideas - efficiently in both time and costs :::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
https://www.egenix.com/company/contact/
https://www.malemburg.com/

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/SNN2WZ3MOH5IACSZVHGS6DKTNMKO5JBV/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]
Abdur-Rahmaan Janhangeer writes:

> As a programmer, i don't want a language which bans unicode stuffs.

But that's what Unicode says should be done (see below).

> If there's something that should be fixed, it's the unicode standard,

Unicode is not going to get "fixed". Most features are important for
some natural language or other. One could argue that (for example)
math symbols that are adopted directly from some character repertoire
should not have been -- I did so elsewhere, although not terribly
seriously.

> maybe defining a sane mode where weird unicode stuffs are not
> allowed.

Unicode denies responsibility for that by permitting arbitrary
subsetting. It does have a couple of (very broad) subsets predefined,
ie, the normalization formats.


_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/63FDIQQNJKCH7C3NMEN3ECRHTA7JHJ2W/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]
On 11/15/2021 5:45 AM, Steven D'Aprano wrote:

> In another thread, Serhiy already suggested we ban invisible control
> characters (other than whitespace) in comments and strings.

He said in string *literals*. One would put them in stromgs by using
visible escape sequences.

>>> '\033' is '\x1b' is '\u001b'
True

> https://mail.python.org/archives/list/python-dev@python.org/message/DN24FK3A2DSO4HBGEDGJXERSAUYK6VK6/
>
> I think that is a good idea.

If one is outputting terminal control sequences, making the escape char
visible is a good idea anyway. It would be easier if '\e' worked. (But
see below.)

> But beyond the C0 and C1 control characters, we should be conservative
> about banning "hidden characters" without a *concrete* threat. For
> example, variation selectors are "hidden", but they change the visual
> look of emoji and other characters.
I can imagine that a complete emoji point and click input method might
have one select the emoji and the variation and output the pair
together. An option to output the selection character as the
appropriate python-specific '\unnnn' is unlikely, and even if there
were, who would know what it meant? Users would want the selected
variation visible if the editor supported such.

If terminal escape sequences were also selected by point and click, my
comment above would change.

--
Terry Jan Reedy

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/4IMXVQFZI3VDHA4D2YZD4KTBU7GSEFPW/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]
> GitHub specifically flags it as a
possible exploit in a couple of cases, but also syntax highlights the
return keyword appropriately.

My guess is that Github did patch it afterwards as the paper does list
Github
as vulnerable

> Uhhm..... "weird unicode stuffs"? Please clarify.

Wriggly texts just because they appear different

Well, it's tool based but maybe compiler checks aka checks from
the language side is something that should be insisted upon too to
patch inconsistent checks across editors.

The reason i was saying it's related to encodings is that when languages
are impacted en masse, maybe it hints to a revision in the unicode standards
at the very least warnings. As Steven above even before i posted the paper
was hinting towards the vulnerability so maybe those in charge of the
unicode
standards should study and predict angles of attacks.

Kind Regards,

Abdur-Rahmaan Janhangeer
about <https://compileralchemy.github.io/> | blog
<https://www.pythonkitchen.com>
github <https://github.com/Abdur-RahmaanJ>
Mauritius
Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]
On Mon, Nov 15, 2021 at 12:28:01PM -0500, Terry Reedy wrote:
> On 11/15/2021 5:45 AM, Steven D'Aprano wrote:
>
> >In another thread, Serhiy already suggested we ban invisible control
> >characters (other than whitespace) in comments and strings.
>
> He said in string *literals*. One would put them in stromgs by using
> visible escape sequences.

Thanks Terry for the clarification, of course I didn't mean to imply
that we should ban control characters in strings completely. Only actual
control characters embedded in string literals in the source, just as we
already currently ban them outside of comments and strings.


--
Steve
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/XCPSQYKOX4YXDIAACDLL3I5OYWFGFLD7/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]
On Mon, Nov 15, 2021 at 03:20:26PM +0400, Abdur-Rahmaan Janhangeer wrote:

> Well, it's not so obvious. From Ross Anderson and Nicholas Boucher
> src: https://trojansource.codes/trojan-source.pdf

Thanks for the link. But it discusses a whole range of Unicode attacks,
and the specific attack you mentioned (Invisible Character Attacks) is
described in section D page 7 as "unlikely to work in practice".

As they say, compilers and interpreters in general already display
errors, or at least a warning, for invisible characters in code.

In addition, there is the difficulty that its not just enough to use
invisible characters to call a different function, you have to smuggle
in the hostile function that you actually want to call.

It does seem that the Trojan-Source attack listed in the paper is new,
but others (such as the homoglyph attacks that get most people's
attention) are neither new nor especially easy to actually exploit.
Unicode has been warning about it for many years. We discussed it in PEP
3131. This is not new, and not easy to exploit.

Perhaps that's why there are no, or very few, actual exploits of this in
the wild. Homoglyph attacks against user-names and URLs, absolutely, but
homoglyph attacks against source code are a different story.

Yes, you can cunningly have two classes like ? and A and the Python
interpreter will treat them as distinct, but you still have to smuggle
in your hostile code in ? (greek Alpha) without anyone noticing, and you
have to avoid anyone asking why you have two classes with the same name.
And that's the hard part.

We don't need Unicode for homoglyph attacks. func0 and funcO may look
identical, or nearly identical, but you still have to smuggle in your
hostile code into funcO without anyone noticing, and that's why there
are so few real-world homoglyph attacks.

Whereas the Trojan Source attacks using BIDI controls does seem to be
genuinely exploitable.


--
Steve
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/FSHGS4AOAGTWKSWAADZWH5L2GGBWHHXE/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]
On Mon, Nov 15, 2021 at 10:43:12PM +1100, Chris Angelico wrote:

> The problems here are not Python's, they are code reviewers', and that
> means they're really attacks against the code review tools.

I think that's a bit strong. Boucher and Anderson's paper describes
multiple kinds of vulnerabilities. At a fairly quick glance, the BIDI
attacks does seem to be a novel attack, and probably exploitable.

But unfortunately it seems to be the Unicode confusables or homoglyph
attack that seems to be getting most of the attention, and that's not
new, it is as old as ASCII, and not so easily exploitable. Being able to
have ? (Cyrillic) ? (Greek alpha) and A (Latin) in the same code base
makes for a nice way to write obfuscated code, but it's *obviously*
obfuscated and not so easy to smuggle in hostile code.

Whereas the BIDI attacks do (apparently) make it easy to smuggle in
code: using invisible BIDI control codes, you can introduce source code
where the way the editor renders the code, and the way the coder reads
it, is different from the way the interpreter or compiler runs it.

That is, I think, new and exploitable: something that looks like a
comment is actually code that the interpreter runs, and something that
looks like code is actually a string or comment which is not executed,
but editors may syntax-colour it as if it were code.

Obviously we can mitigate against this by improving the editors (at the
very least, all editors should have a Show Invisible Characters option).
Linters and code checks should also flag problematic code containing
BIDI codes, or attacks against docstrings.

Beyond that, it is not clear to me what, if anything, we should do in
response to this new class of Trojan Source attacks, beyond documenting
it.

--
Steve
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/SXF2BG47UZTI7QM7GB3XCTGEV576UZOE/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]
On Tue, Nov 16, 2021 at 12:13 PM Steven D'Aprano <steve@pearwood.info> wrote:
>
> On Mon, Nov 15, 2021 at 10:43:12PM +1100, Chris Angelico wrote:
>
> > The problems here are not Python's, they are code reviewers', and that
> > means they're really attacks against the code review tools.
>
> I think that's a bit strong. Boucher and Anderson's paper describes
> multiple kinds of vulnerabilities. At a fairly quick glance, the BIDI
> attacks does seem to be a novel attack, and probably exploitable.

The BIDI attacks basically amount to making this:

def func():
"""This is a docstring"""; return

look like this:

def func():
"""This is a docstring; return"""

If you see something that looks like the second, but the word "return"
is syntax-highlighted as a keyword instead of part of the string, the
attack has failed. (Or if you ignore that, then your code review is
flawed, and you're letting malicious code in.) The attack depends for
its success on some human approving some piece of code that doesn't do
what they think it does, and that means it has to look like what it
doesn't do - which is an attack against what the code looks like,
since what it does is very well defined.

> Whereas the BIDI attacks do (apparently) make it easy to smuggle in
> code: using invisible BIDI control codes, you can introduce source code
> where the way the editor renders the code, and the way the coder reads
> it, is different from the way the interpreter or compiler runs it.

Right: the way the editor renders the code, that's the essential part.
That's why I consider this an attack against some editor (or set of
editors). When you find an editor that is vulnerable to this, file a
bug report against that editor.

The way the coder reads it will be heavily based upon the way the
editor colours it.

> That is, I think, new and exploitable: something that looks like a
> comment is actually code that the interpreter runs, and something that
> looks like code is actually a string or comment which is not executed,
> but editors may syntax-colour it as if it were code.

Right. Exactly my point: editors may syntax-colour it incorrectly.

That's why I consider this not an attack on the language, but on the
editor. As long as the editor parses it the exact same way that the
interpreter does, there isn't a problem.

ChrisA
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/3X6K5YYBRATECDRTN57XNT3QNP2J6ZBG/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]
Compatibility variants can look different, but they can also look identical. Allowing any non-ASCII characters was worrisome because of the security implications of confusables. Squashing compatibility characters seemed the more conservative choice at the time. Stestagg's example:
? = lambda ?, e: ? if ? > e else e
shows it wasn't perfect, but adding more invisible differences does have risks, even beyond the backwards incompatibility and the problem with (hopefully rare, but are we sure?) editors that don't distinguish between them in the way a programming language would prefer.

I think (but won't swear) that there were also several problematic characters that really should have been treated as (at most) glyph variants, but ... weren't. If I Recall Correctly, the largest number were Arabic presentation forms, but there were also a few characters that were in Unicode only to support round-trip conversion with a legacy charset, even if that charset had been declared buggy. In at least a few of these cases, it seemed likely that a beginning user would expect them to be equivalent.

-jJ
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/GNT3AG2SCVLMCJAZXSTIWFKKAYG25E7O/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]
Stephen J. Turnbull wrote:
> Christopher Barker writes:

> > For example, in writing math we often use different scripts to mean
> > different things (e.g. TeX's Blackboard Bold). So if I were to use
> > some of the Unicode Mathematical Alphanumeric Symbols, I wouldn't
> > want them to get normalized.

Agreed, for careful writers. But Stephen's answer about people using the wrong one and expecting it to work means that normalization is probably the lesser of evils for most people, and the ones who don't want it normalized are more likely to be able to specify custom processing when it is important enough. (The compatibility characters aren't normalized in strings, largely because that should still be possible.)

> In fact, I think adding these symbols to Unicode was a bad idea; they
> should be handled at a higher level in the linguistic stack (by
> semantic markup).

When I was a math student, these were clearly different symbols, with much less relation to each other than a mere case difference.
So by the Unicode consortium's goals, they are independent characters that should each be defined. I admit that isn't ideal for most use cases outside of math, but ... supporting those other cases is what compatibility normalization is for.

> It's also a UX problem. At slightly higher layer in the stack, I'm
> used to using Japanese input methods to input sigma and pi which
> produce characters in the Greek block, and at least the upper case
> forms that denote sum and product have separate characters in the math
> operators block. I understand why people who literally write
> mathematics in Greek might want those not normalized, but I sure am
> going to keep using "Greek sigma", not "math sigma"! The probability
> that I'm going to have a Greek uppercase sigma in my papers is nil,
> the probability of a summation symbol near unity. But the summation
> symbol is not easily available, I have to scroll through all the
> preceding Unicode blocks to find Mathematical Operators. So I am
> perfectly happy with uppercase Greek sigma for that role (as is
> XeTeX!!)

I think that is mostly a backwards compatibility problem; XeTeX itself had to worry about compatibility with TeX (which preceded Unicode) and with the fonts actually available and then with earlier versions of XeTeX.

-jJ
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/JNFLAQUKNCWCJSMBNJZGHVD5ZELOTU6G/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]
Steven D'Aprano wrote:
> I think
> that many editors in common use don't support bidirectional text, or at
> least the ones I use don't seem to support it fully or correctly. ...
> But, if there is a concrete threat beyond "it looks weird", that it
> another issue.

Based on the original post (and how it looked in my web browser, after various automated reformattings, it seems that one of the failure modes that buggy editors have is that

stuff can be part of the code, even though it looks like part of a comment, or vice versa

This problem might be limited to only some of the bidi controls, and there might even be a workaround specific to # ... but it is an issue. I do not currently have an opinion on how important of an issue it is, or how adequate the workarounds are.

-jJ
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ECO4R655UGPCVFFVAOQZ3DUZVHQY75BX/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]
Executive summary:

I guess the bottom line is that I'm sympathetic to both the NFC and
NFKC positions.

I think that wetware is such that people will go to the trouble of
picking out a letter-like symbol from a palette rarely, and in my
environment that's not going to happen at all because I use Japanese
phonetic input to get most symbols ("sekibun" = integral, "siguma" =
sigma), and I don't use calligraphic R for the real line, I use
\newcommand{\R}{{\cal R}}, except on a physical whiteboard, where I
use blackboard bold (go figure that one out!) So to my mind the
letter-like block in Unicode is a failed experiemnt.

Jim J. Jewett writes:

> When I was a math student, these were clearly different symbols,
> with much less relation to each other than a mere case difference.

Arguable. The letter-like symbols block has script (cursive),
blackboard bold, and Fraktur versions of R. I've seen all of them as
well as plain Roman, bold, italic and bold italic facts used to denote
the real line, and I've personally used most of them for that purpose
depending on availability of fonts and input methods and medium (ie,
computer text vs. hand-written). I've also seen several of them used
for reaction functions or spaces thereof in game theory (although
blackboard bold and Fraktur seem to be used uniquely for the real
line). Clearly the common denominator is the uppercase latin letter
"R", and the glyph being recognizably "R" is necessary and sufficient
to each of those purposes. The story for uppercase sigma as sum is
somewhat similar: sum is by far not the only use of that letter,
although I don't know of any other operator symbol for sum over a set
or series (outside of programming languages, which I think we can
discount).

I agree that we should consider math to be a separate language, but it
doesn't have a consistent script independent of the origins of the
symbols. Even today none of my engineering and economics students can
type any symbols except those in the JIS repertoire, which they type
by original name ("siguma", "ramuda", "arefu", "yajirushi" == arrow,
etc, "sekibun" == integration does bring up the integral sign in at
least some modern input methods, but it doesn't have a script name,
while "kasann" == addition does not bring up sigma, although "siguma"
does, and "essu" brings up sigma -- but only in "ASCII emoji" strings,
go figure). I have seen students use fullwidth R for the real line,
though, but distinguishing that is a deprecated compatibility feature
of Unicode (and of Japanese practice -- even in very formal university
documents such as grade reports for a final doctoral examination I've
seen numbers and names containing mixed half-width and full-width
ASCII).

So I think "letter-like" was a reasonable idea (I'm pretty sure this
block goes back to the '90s but I'm too lazy to check), but it hasn't
turned out well, and I doubt it ever will.

> So by the Unicode consortium's goals, they are independent
> characters that should each be defined. I admit that isn't ideal
> for most use cases outside of math,

I don't think it even makes sense *inside* of math for the letter-like
symbols. The nature of math means that any "R" will be grabbed for
something whose name starts with "r" as soon as that's convenient.
Something like the integral sign (which is a stretched "S" for "sum"),
OK -- although category theory uses that for "ends" which still don't
look anything like integrals even if you turn them inside out, rotate
90 degrees, and paint them blue.

> > It's also a UX problem. At slightly higher layer in the stack, I'm
> > used to using Japanese input methods to input sigma and pi which
> > produce characters in the Greek block, and at least the upper case
> > forms that denote sum and product have separate characters in the math
> > operators block.
>
> I think that is mostly a backwards compatibility problem; XeTeX
> itself had to worry about compatibility with TeX (which preceded
> Unicode) and with the fonts actually available and then with
> earlier versions of XeTeX.

IMO, the analogy fails because the backward compatibility issue for
Unicode is in the wetware, not in the software.

Steve

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/YTIIFIF75RMWP5J3GCSXWVXSUP5SX7AA/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]
On 11/13/21, Terry Reedy <tjreedy@udel.edu> wrote:
> On 11/13/2021 4:35 PM, ptmcg@austin.rr.com wrote:
>>
>> _??????????????????????L?????????????????????? = 12
>>
>> def _??????????????????????(????, p?????????????????????????, ???????????????????????????):
>>
>> ?????????? = ?????????(????) - ?r????????????x????????? - ???????????????????????
>>
>> if s?i???? > _????????????????????H??????????????????L????????:
>>
>> ???? = '%s[%d chars]%s' % (????[:??????????????????????????????], ?????????p, ????[????????????(????) -
>> ?????????????x?????????:])
>>
>> return ?
>>
> * Does not at all work in CommandPrompt

It works for me when pasted into the REPL using the console in Windows
10. I pasted the code into a raw multiline string assignment and then
executed the string with exec(). The only issue is that most of the
pasted characters are displayed using the font's default glyph since
the console host doesn't have font fallback support. Even Windows
Terminal doesn't have font fallback support yet in the command-line
editing mode that Python's REPL uses. But Windows Terminal does
implement font fallback for normal output rendering, so if you assign
the pasted text to string `s`, then print(s) should display properly.

> even after supposedly changing to a utf-8 codepage with 'chcp 65000'.

Changing the console code page is unnecessary with Python 3.6+, which
uses the console's wide-character API. Also, even though it's
irrelevant for the REPL, UTF-8 is code page 65001. Code page 65000 is
UTF-7.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/7FGNJ7TMASDOMQAS2LSSQAD2PPURT5W6/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]
On Mon, Nov 15, 2021 at 8:42 AM Kyle Stanley <aeros167@gmail.com> wrote:

> On Sat, Nov 13, 2021 at 5:04 PM <ptmcg@austin.rr.com> wrote:
>
>>
>>
>> def ????????????????????():
>>
> [... Python code it's easy to believe isn't grammatical ...]

> return ?
>>
>
> 0_o color me impressed, I did not think that would be legal syntax. Would
> be interesting to include in a textbook, if for nothing else other than to
> academically demonstrate that it is possible, as I suspect many are not
> aware.
>

I'm afraid the best Paul, Alex, Anna and I can hope to do is bring it to
the attention of readers of Python in a Nutshell's fourth edition (on
current plans, hitting the shelves about the same time as 3.11, please tell
your friends ;-) ). Sadly, I'm not aware of any academic classes that use
the Nutshell as a course text, so it seems unlikely to gain the
attention of academic communities.

Given the wider reach of this list, however, one might hope that by the
time the next edition comes out this will be old news due to the
publication of blogs and the like. With luck, a small fraction of the
programming community will become better-informed about Unicode and the
design or programming languages. It's interesting that the egalitarian wish
to allow use of native "alphabetics" has turned out to be such a viper's
nest.

Particular thanks to Stephen J. Turnbull for his thoughtful and
well-informed contribution above.

Kind regards,
Steve
Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]
On Mon, Nov 29, 2021 at 1:21 AM Steve Holden <steve@holdenweb.com> wrote:

> It's interesting that the egalitarian wish to allow use of native
> "alphabetics" has turned out to be such a viper's nest.
>

Indeed.

However, is there no way to restrict identifiers at least to the alphabets
of natural languages? Maybe it wouldn’t help much, but does anyone need to
use letter-like symbols designed for math expressions? I would say maybe,
but certainly not have them auto-converted to the “normal” letter?

For that matter, why have any auto-conversion all?

The answer may be that it’s too late to change now, but I don’t think I’ve
seen a compelling (or any?) use case for that conversion.

-CHB


Particular thanks to Stephen J. Turnbull for his thoughtful and
> well-informed contribution above.
>
> Kind regards,
> Steve
>
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/FNSI6EXCWMMCXEJNYWVVR5LMFOM6M5ZB/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
--
Christopher Barker, PhD (Chris)

Python Language Consulting
- Teaching
- Scientific Software Development
- Desktop GUI and Web Development
- wxPython, numpy, scipy, Cython

1 2  View All