Mailing List Archive: Boundaries between numbers and identifiers

Boundaries between numbers and identifiers

Apr 26, 2018, 11:37 AM

Post #1 of 11 (177 views)

In Python 2.5 `0or[]` was accepted by the Python parser. It became an
error in 2.6 because "0o" became recognizing as an incomplete octal
number. `1or[]` still is accepted.

On other hand, `1if 2else 3` is accepted despites the fact that "2e" can
be recognized as an incomplete floating point number. In this case the
tokenizer pushes "e" back and returns "2".

Shouldn't it do the same with "0o"? It is possible to make `0or[]` be
parseable again. Python implementation is able to tokenize this example:

$ echo '0or[]' | ./python -m tokenize
1,0-1,1: NUMBER '0'
1,1-1,3: NAME 'or'
1,3-1,4: OP '['
1,4-1,5: OP ']'
1,5-1,6: NEWLINE '\n'
2,0-2,0: ENDMARKER ''

On other hand, all these examples look weird. There is an assymmetry:
`1or 2` is a valid syntax, but `1 or2` is not. It is hard to recognize
visually the boundary between a number and the following identifier or
keyword, especially if numbers can contain letters ("b", "e", "j", "o",
"x") and underscores, and identifiers can contain digits. On both sides
of the boundary can be letters, digits, and underscores.

I propose to change the Python syntax by adding a requirement that there
should be a whitespace or delimiter between a numeric literal and the
following keyword.

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com

Re: Boundaries between numbers and identifiers [ In reply to ]

lukasz at langa

Apr 26, 2018, 12:02 PM

Post #2 of 11 (177 views)

Permalink

> On Apr 26, 2018, at 11:37 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
>
> I propose to change the Python syntax by adding a requirement that there should be a whitespace or delimiter between a numeric literal and the following keyword.

-1

This would make Python 3.8 reject code due to stylistic preference. Code that it actually can unambiguously parse today.

I agree that a formatting style that omits whitespace between numerals and other tokens is terrible. However, if you start downright rejecting it, you will likely punish the wrong people. Users of third-party libraries will be met with random parsing errors in files they have no control over. This is not helpful.

And given BPO-33338 the standard library tokenizer would have to keep parsing those things as is.

Making 0or[] working again is also not worth it since that's been broken since Python 2.6 and hopefully nobody is running Python 2.5-only code anymore.

What we should instead is to make the standard library tokenizer reflect the behavior of Python 2.6+.

-- ?
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com

Re: Boundaries between numbers and identifiers [ In reply to ]

storchaka at gmail

Apr 26, 2018, 12:53 PM

Post #3 of 11 (176 views)

Permalink

26.04.18 22:02, Lukasz Langa ????:
>> On Apr 26, 2018, at 11:37 AM, Serhiy Storchaka <storchaka@gmail.com> wrote:
>>
>> I propose to change the Python syntax by adding a requirement that there should be a whitespace or delimiter between a numeric literal and the following keyword.
> -1
>
> This would make Python 3.8 reject code due to stylistic preference. Code that it actually can unambiguously parse today.

Of course I don't propose to make it a syntax error in 3.8. It should
first emit a SyntaxWarning and be converted into an error only in 3.10.

Or maybe first add a rule for this in PEP 8 and make it a syntax error
in distant future, after all style checkers include it.

> I agree that a formatting style that omits whitespace between numerals and other tokens is terrible. However, if you start downright rejecting it, you will likely punish the wrong people. Users of third-party libraries will be met with random parsing errors in files they have no control over. This is not helpful.
>
> And given BPO-33338 the standard library tokenizer would have to keep parsing those things as is.
>
> Making 0or[] working again is also not worth it since that's been broken since Python 2.6 and hopefully nobody is running Python 2.5-only code anymore.
>
> What we should instead is to make the standard library tokenizer reflect the behavior of Python 2.6+.

The behavior of the standard library tokenizer doesn't contradict rules.
It is the most natural behavior of regex-based tokenizer. Actually the
behavior of the building tokenizer can be incorrect. In any case
accepting `1if 2else 3` and rejecting `0or[]` looks weird. They should
use the same rule. "0or" and "2else" should be considered ambiguous or
unambiguous in the same way.
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: https://mail.python.org/mailman/options/python-dev/list-python-dev%40lists.gossamer-threads.com

Re: Boundaries between numbers and identifiers [ In reply to ]

storchaka at gmail

Apr 13, 2021, 12:52 PM

Post #4 of 11 (107 views)

Permalink

26.04.18 21:37, Serhiy Storchaka ????:
> In Python 2.5 `0or[]` was accepted by the Python parser. It became an
> error in 2.6 because "0o" became recognizing as an incomplete octal
> number. `1or[]` still is accepted.
>
> On other hand, `1if 2else 3` is accepted despites the fact that "2e" can
> be recognized as an incomplete floating point number. In this case the
> tokenizer pushes "e" back and returns "2".
>
> Shouldn't it do the same with "0o"? It is possible to make `0or[]` be
> parseable again. Python implementation is able to tokenize this example:
>
> $ echo '0or[]' | ./python -m tokenize
> 1,0-1,1:            NUMBER         '0'
> 1,1-1,3:            NAME           'or'
> 1,3-1,4:            OP             '['
> 1,4-1,5:            OP             ']'
> 1,5-1,6:            NEWLINE        '\n'
> 2,0-2,0:            ENDMARKER      ''
>
> On other hand, all these examples look weird. There is an assymmetry:
> `1or 2` is a valid syntax, but `1 or2` is not. It is hard to recognize
> visually the boundary between a number and the following identifier or
> keyword, especially if numbers can contain letters ("b", "e", "j", "o",
> "x") and underscores, and identifiers can contain digits. On both sides
> of the boundary can be letters, digits, and underscores.
>
> I propose to change the Python syntax by adding a requirement that there
> should be a whitespace or delimiter between a numeric literal and the
> following keyword.
>

New example was found recently (see https://bugs.python.org/issue43833).

>>> [0x1for x in (1,2)]
[31]

It is parsed as [0x1f or x in (1,2)] instead of [0x1 for x in (1,2)].

Since this code is clearly ambiguous, it makes more sense to emit a
SyntaxWarning if there is no space between number and identifier.

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/XYAEYTON7Q2XNUYYQ65DHSHE2JRHNYYX/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Boundaries between numbers and identifiers [ In reply to ]

guido at python

Apr 13, 2021, 1:54 PM

Post #5 of 11 (107 views)

Permalink

On Tue, Apr 13, 2021 at 12:55 PM Serhiy Storchaka <storchaka@gmail.com>
wrote:

> 26.04.18 21:37, Serhiy Storchaka ????:
> > In Python 2.5 `0or[]` was accepted by the Python parser. It became an
> > error in 2.6 because "0o" became recognizing as an incomplete octal
> > number. `1or[]` still is accepted.
> >
> > On other hand, `1if 2else 3` is accepted despites the fact that "2e" can
> > be recognized as an incomplete floating point number. In this case the
> > tokenizer pushes "e" back and returns "2".
> >
> > Shouldn't it do the same with "0o"? It is possible to make `0or[]` be
> > parseable again. Python implementation is able to tokenize this example:
> >
> > $ echo '0or[]' | ./python -m tokenize
> > 1,0-1,1: NUMBER '0'
> > 1,1-1,3: NAME 'or'
> > 1,3-1,4: OP '['
> > 1,4-1,5: OP ']'
> > 1,5-1,6: NEWLINE '\n'
> > 2,0-2,0: ENDMARKER ''
> >
> > On other hand, all these examples look weird. There is an assymmetry:
> > `1or 2` is a valid syntax, but `1 or2` is not. It is hard to recognize
> > visually the boundary between a number and the following identifier or
> > keyword, especially if numbers can contain letters ("b", "e", "j", "o",
> > "x") and underscores, and identifiers can contain digits. On both sides
> > of the boundary can be letters, digits, and underscores.
> >
> > I propose to change the Python syntax by adding a requirement that there
> > should be a whitespace or delimiter between a numeric literal and the
> > following keyword.
> >
>
> New example was found recently (see https://bugs.python.org/issue43833).
>
> >>> [0x1for x in (1,2)]
> [31]
>
> It is parsed as [0x1f or x in (1,2)] instead of [0x1 for x in (1,2)].
>
> Since this code is clearly ambiguous, it makes more sense to emit a
> SyntaxWarning if there is no space between number and identifier.
>

I would totally make that a SyntaxError, and backwards compatibility be
damned.

--
--Guido van Rossum (python.org/~guido)
*Pronouns: he/him **(why is my pronoun here?)*
<http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>

Re: Boundaries between numbers and identifiers [ In reply to ]

barry at python

Apr 13, 2021, 3:24 PM

Post #6 of 11 (107 views)

Permalink

On Apr 13, 2021, at 12:52, Serhiy Storchaka <storchaka@gmail.com> wrote:
>
> New example was found recently (see https://bugs.python.org/issue43833).
>
>>>> [0x1for x in (1,2)]
> [31]
>
> It is parsed as [0x1f or x in (1,2)] instead of [0x1 for x in (1,2)].

That’s a wonderfully terrible example! Who’s maintaining the list? :D

-Barry

Re: Boundaries between numbers and identifiers [ In reply to ]

greg.ewing at canterbury

Apr 13, 2021, 3:42 PM

Post #7 of 11 (107 views)

Permalink

On 14/04/21 8:54 am, Guido van Rossum wrote:
> On Tue, Apr 13, 2021 at 12:55 PM Serhiy Storchaka <storchaka@gmail.com
>
> >>> [0x1for x in (1,2)]
>
> I would totally make that a SyntaxError, and backwards compatibility be
> damned.

Indeed. Python is not Fotran!

--
Greg
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/WBZ2IQFNL2ICZ6WCMURCXXHDDTWWCWIY/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Boundaries between numbers and identifiers [ In reply to ]

vstinner at python

Apr 13, 2021, 4:01 PM

Post #8 of 11 (107 views)

Permalink

It would be useful to first estimate how many projects would be broken
by such incompatible change (stricter syntax).

Inada-san wrote
https://github.com/methane/notes/blob/master/2020/wchar-cache/download_sdist.py
to download source files using
https://hugovk.github.io/top-pypi-packages/ API (top 4000 PyPI
projects).

Victor

On Tue, Apr 13, 2021 at 10:59 PM Guido van Rossum <guido@python.org> wrote:
>
> On Tue, Apr 13, 2021 at 12:55 PM Serhiy Storchaka <storchaka@gmail.com> wrote:
>>
>> 26.04.18 21:37, Serhiy Storchaka ????:
>> > In Python 2.5 `0or[]` was accepted by the Python parser. It became an
>> > error in 2.6 because "0o" became recognizing as an incomplete octal
>> > number. `1or[]` still is accepted.
>> >
>> > On other hand, `1if 2else 3` is accepted despites the fact that "2e" can
>> > be recognized as an incomplete floating point number. In this case the
>> > tokenizer pushes "e" back and returns "2".
>> >
>> > Shouldn't it do the same with "0o"? It is possible to make `0or[]` be
>> > parseable again. Python implementation is able to tokenize this example:
>> >
>> > $ echo '0or[]' | ./python -m tokenize
>> > 1,0-1,1: NUMBER '0'
>> > 1,1-1,3: NAME 'or'
>> > 1,3-1,4: OP '['
>> > 1,4-1,5: OP ']'
>> > 1,5-1,6: NEWLINE '\n'
>> > 2,0-2,0: ENDMARKER ''
>> >
>> > On other hand, all these examples look weird. There is an assymmetry:
>> > `1or 2` is a valid syntax, but `1 or2` is not. It is hard to recognize
>> > visually the boundary between a number and the following identifier or
>> > keyword, especially if numbers can contain letters ("b", "e", "j", "o",
>> > "x") and underscores, and identifiers can contain digits. On both sides
>> > of the boundary can be letters, digits, and underscores.
>> >
>> > I propose to change the Python syntax by adding a requirement that there
>> > should be a whitespace or delimiter between a numeric literal and the
>> > following keyword.
>> >
>>
>> New example was found recently (see https://bugs.python.org/issue43833).
>>
>> >>> [0x1for x in (1,2)]
>> [31]
>>
>> It is parsed as [0x1f or x in (1,2)] instead of [0x1 for x in (1,2)].
>>
>> Since this code is clearly ambiguous, it makes more sense to emit a
>> SyntaxWarning if there is no space between number and identifier.
>
>
> I would totally make that a SyntaxError, and backwards compatibility be damned.
>
> --
> --Guido van Rossum (python.org/~guido)
> Pronouns: he/him (why is my pronoun here?)
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/OU3USHVMXZJD4SA3FJGQQVQYAORHY5BM/
> Code of Conduct: http://python.org/psf/codeofconduct/

--
Night gathers, and now my watch begins. It shall not end until my death.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/5VYOQRW4DOVDNSIB3G7GBHSUL5ZC3QZO/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Boundaries between numbers and identifiers [ In reply to ]

mertz at gnosis

Apr 13, 2021, 7:22 PM

Post #9 of 11 (107 views)

Permalink

I feel like all of these examples, if found in the wild, are far more
likely to be uncaught bugs than programmer intent. Being strict about
spaces (or parents, brackets, etc. in other contexts) around numbers is
much more straightforward than a number of edge cases where is not obvious
what will happen.

On Tue, Apr 13, 2021, 6:24 PM Barry Warsaw <barry@python.org> wrote:

> On Apr 13, 2021, at 12:52, Serhiy Storchaka <storchaka@gmail.com> wrote:
> >
> > New example was found recently (see https://bugs.python.org/issue43833).
> >
> >>>> [0x1for x in (1,2)]
> > [31]
> >
> > It is parsed as [0x1f or x in (1,2)] instead of [0x1 for x in (1,2)].
>
> That’s a wonderfully terrible example! Who’s maintaining the list? :D
>
> -Barry
>
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/7JXD7SOHACL5SFTA4SBIOWPEG625LD34/
> Code of Conduct: http://python.org/psf/codeofconduct/
>

Re: Boundaries between numbers and identifiers [ In reply to ]

vstinner at python

Apr 14, 2021, 4:51 AM

Post #10 of 11 (107 views)

Permalink

Also, would it be possible to enhance to tokenizer to report a
SyntaxWarning, rather than a SyntaxError?

Victor
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/CH7SLXKIKX47KVCWEJEMOB35BCIM7Y5U/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Boundaries between numbers and identifiers [ In reply to ]

damian.peter.shaw at gmail

Apr 15, 2021, 6:20 AM

Post #11 of 11 (107 views)

Permalink

This isn't a "professional" or probably even "valid" use case for Python
but one area this behavior is heavily used is code golf. For those not
familiar with code golf is a type of puzzle where the objective is to
complete a set of requirements in the least number of source code
characters as possible. Out of mainstream languages Python is surprisingly
good code golf.

This is just for fun puzzle solving and not a reason to keep or change
syntax in any particular way, in fact succeeding at code golf may even be
loosely correlated to bad syntax rules as puzzles tend to be completed in
one of the least readable ways a language can be written in. But at least
be aware if this becomes forbidden syntax that's likely the most affected
area of Python usage.

But it also made me think it could affect code minifiers, which is
apparently a real use case in Python:
https://github.com/dflook/python-minifier (Seems this minifier doesn't
actually remove the spaces between numbers and keywords where is could but
fascinating niche of Python I did not know about)

Regards
Damian
(he/him)

On Wed, Apr 14, 2021 at 7:56 AM Victor Stinner <vstinner@python.org> wrote:

> Also, would it be possible to enhance to tokenizer to report a
> SyntaxWarning, rather than a SyntaxError?
>
> Victor
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/CH7SLXKIKX47KVCWEJEMOB35BCIM7Y5U/
> Code of Conduct: http://python.org/psf/codeofconduct/
>

Mailing List Archive

Mailing List Archive

Attached Files: