Mailing List Archive: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

Nov 15, 2021, 12:25 AM

Post #26 of 48 (1605 views)

Christopher Barker writes:

> Would a proposal to switch the normalization to NFC only have any hope of
> being accepted?

Hope, yes. Counting you, it's been proposed twice. :-) I don't know
whether it would get through. We know this won't affect the stdlib,
since that's restricted to ASCII. I suppose we could trawl PyPI and
GitHub for "compatibles" (the Unicode term for "K" normalizations).

> For example, in writing math we often use different scripts to mean
> different things (e.g. TeX's Blackboard Bold). So if I were to use
> some of the Unicode Mathematical Alphanumeric Symbols, I wouldn't
> want them to get normalized.

Independent of the question of the normalization of Python
identifiers, I think using those characters this way is a bad idea.
In fact, I think adding these symbols to Unicode was a bad idea; they
should be handled at a higher level in the linguistic stack (by
semantic markup).

You're confusing two things here. In Unicode, a script is a
collection of characters used for a specific language, typically a set
of Unicode blocks of characters (more or less; there are a lot of Han
ideographs that are recognizable as such to Japanese but are not part
of the repertoire of the Japanese script). That is, these characters
are *different* from others that look like them.

Blackboard Bold is more what we would usually call a "font": the
(math) italic "x" and the (math) bold italic "x" are the same "x", but
one denotes a scalar and the other a vector in many math books. A
roman "R" probably denotes the statistical application, an italic "R"
the reaction function in game theory model, and a Blackboard Bold "R"
the set of real numbers. But these are all the same character.

It's a bad idea to rely on different (Unicode) scripts that use the
same glyphs for different characters to look different from each
other, unless you "own" the fonts to be used. As far as I know
there's no way for a Python program to specify the font to be used to
display itself though. :-)

It's also a UX problem. At slightly higher layer in the stack, I'm
used to using Japanese input methods to input sigma and pi which
produce characters in the Greek block, and at least the upper case
forms that denote sum and product have separate characters in the math
operators block. I understand why people who literally write
mathematics in Greek might want those not normalized, but I sure am
going to keep using "Greek sigma", not "math sigma"! The probability
that I'm going to have a Greek uppercase sigma in my papers is nil,
the probability of a summation symbol near unity. But the summation
symbol is not easily available, I have to scroll through all the
preceding Unicode blocks to find Mathematical Operators. So I am
perfectly happy with uppercase Greek sigma for that role (as is
XeTeX!!)

And the thing is, of course those Greek letters really are Greek
letters: they were chosen because pi is the homophone of p which is
the first letter of "product", and sigma is the homophone of s which
is the first letter of "sum". ? for ?ngström is similar, it's the
initial letter of a Swedish name.

Sure, we could fix the input methods (and search methods!! -- people
are going to input the character they know that corresponds to the
glyph *they* see, not the bit pattern the *CPU* sees). But that's as
bad as trying to fix mail clients. Not worth the effort because I'm
pretty sure you're gonna fail -- it's one of those "you'll have to pry
this crappy software that annoys admins around the world from my cold
dead fingers" issues, which is why their devs refuse to fix them.

Steve
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/5GHPVNJLLOKBYPE7FSU5766XYP6IJPEK/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python) [ In reply to ]

arj.python at gmail

Nov 15, 2021, 12:33 AM

Post #27 of 48 (1605 views)