Mailing List Archive

A proposal to modify `None` so that it hashes to a constant
I wrote a doc stating my case here:
https://docs.google.com/document/d/1et5x5HckTJhUQsz2lcC1avQrgDufXFnHMin7GlI5XPI/edit#

Briefly,

1. The main motivation for it is to allow users to get a predictable result
on a given input (for programs that are doing pure compute, in domains like
operations research / compilation), any time they run their program. Having
stable repro is important for debugging. Notebooks with statistical
analysis are another similar case where this is needed: you might want
other people to run your notebook and get the same result you did.

2. The reason the hash non-determinism of None matters in practice is that
it can infect commonly used mapping key types, such as frozen dataclasses
containing `Optional[int]` fields.

3. Non-determinism emerging from other value types like `str` can be
disabled by the user using `PYTHONHASHSEED`, but there's no such protection
against `None`.

All it takes is for your program to compute a set somewhere with affected
keys, and iterate on it - and determinism is lost.

The need to modify None itself is caused by two factors
- `Optional` being implemented effectively as `T | None` in Python as a
strongly established practice
- The fact that `__hash__` is an intrinsic property of a type in Python,
the hashing function cannot be externally supplied to its builtin container
types. So we have to modify the type None itself, rather than write some
alternative hasher that we could use if we care about deterministic
behavior across runs.

This was debated at length over the forum and in discord.
I also posted a PR for it, and it was closed, see:

https://github.com/python/cpython/issues/99540
https://github.com/python/cpython/pull/99541

Asking for opinions, and to re-open the PR, provided there is enough
support for such a change to take place.
Re: A proposal to modify `None` so that it hashes to a constant [ In reply to ]
On Mon, Nov 28, 2022 at 11:13:34PM +0000, Oscar Benjamin wrote:
> On Mon, 28 Nov 2022 at 22:56, Brett Cannon <brett@python.org> wrote:

> > That's actually by design. Sets are not meant to be deterministic
> > conceptually as they are essentially a bag of stuff. If you want
> > deterministic ordering you should convert it to a list and sort the
> > list.
>
> What does "sets are not meant to be deterministic" even mean?

I'm not Brett, so I'm not answering for him, but many people (sometimes
including me) misuse "not deterministic" to refer to set order and the
old dict order. Of course set order is deterministic, it is determined
by the combination of the set implementation, the hashing algorithm, and
the history of the set -- all the items that have ever appeared in the
set (both those removed and those that remain).

Set order is deterministic in the same way that roulette wheels are
deterministic.

We don't have a good term for this: the items don't appear in random
order. But it is not predictable either, except in very special cases.
"Arbitrary" is not right either, since that implies we can easily choose
whatever set order we want.

I think that physicists call this "deterministic chaos"

https://www.quora.com/Theoretical-Physics-What-is-deterministic-but-unpredictable

(sorry for the quora link) so I guess we might say that set iteration
order is "deterministically chaotic" if you want to be precise, but life
is too short for that level of pedantry (and that's coming from me, a
Grade A Pedant *wink*) so people often describe it as random, arbitrary,
or non-deterministic when it's none of those things :-).

Maybe we could call set order "pseudo-random".

Getting back to the design part, I think that what Brett is trying to
get across is not that the chaotic set order is in and of itself a
requirement, but that given the other requirements, chaotic set order is
currently considered a necessary condition.

As I understand it, we could make sets ordered, but only at the cost of
space (much more memory) or time (slower) or both.

I am sure that Guido is correct that **if** somebody comes up with a
fast, efficient ordered set implementation that doesn't perform worse
than the current implementation, we will happily swap to giving sets a
predictable order, as we did with dicts. (Practicality beats purity --
even if sets are *philosophically* unordered, preserving input order is
too useful to give up unless we gain something in return.)

But I don't think it is fair or kind to call Brett's argument FUD. At
the very least it is uncharitable interpretation of Brett's position.

> It would be useful to have a straight-forward way to sort a set into a
> deterministic ordering but no such feature exists after the Py3K
> changes (sorted used to do this in Python 2.x).

`sorted()` works fine on homogeneous sets. It is only heterogeneous sets
that are a problem, and in practice, that usually means None mixed in
with some other type.


--
Steve
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/UAVNSMS3XFAQQ6ZRD27IXPTWZFCHP6M4/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: A proposal to modify `None` so that it hashes to a constant [ In reply to ]
On Tue, Nov 29, 2022 at 01:34:54PM +1300, Greg Ewing wrote:

> I got the impression that there were some internal language reasons
> to want stable dicts, e.g. so that the class dict passed to __prepare__
> preserves the order in which names are assigned in the class body. Are
> there any such use cases for stable sets?

Some people wanted order preserving kwargs, I think for web frameworks.
There was even a discussion for a while about using OrderedDict for
kwargs and leaving dicts unordered.

For me, the biggest advantage of preserving input order in dicts is that
it is now easier to write doctests using the dict's repr. It would be
nice to be able to do the same for sets, but not nice enough to justify
making them bigger or slower.


--
Steve
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/5HWYYKDJLGDZT5IZIXM42EWF2WPFXKBJ/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: A proposal to modify `None` so that it hashes to a constant [ In reply to ]
On Tue, Nov 29, 2022 at 02:07:34AM +0000, Oscar Benjamin wrote:

> Let's split this into two separate questions:

Let's not. Your first question about non-deterministic set order being
"innately good" is a straw man: as we've already discussed, set order is
not non-deterministic (except in the informal sense of "hard to
predict") and I don't think anyone is arguing in favour of keeping set
order unpredictable even if there are faster, more compact, simpler
implementations which preserve order.

Talking about determinism is just muddying the waters. Sets are
deterministic: their order is determinied by the implementation, the set
history, and potentially any environmental sources of entropy used in
address randomisation. Deterministic does not mean predictable.

If we want to get this discussion onto a more useful path, we should
start with a security question:

The hash of None changes because of address randomisation. Address
randomisation is enabled as a security measure. Are there any security
consequences of giving None a constant hash?

I can't see any, but then I couldn't see the security consequences of
predictable string hashes until they were pointed out to me. So it would
be really good to have some security experts comment on whether this is
safe or not.


> why are we even asking about "set order" rather than the
> benefits of determinism in general?

Because this entire discussion is motivated by the OP who wants
consistent set order across multiple runs of his Python application.
That's what he needs; having None hash to a constant value is just a
means to that end.

His sets contain objects whose hashes depend on `Optional[int]`, which
means sometimes they include None, and when he runs under an interpreter
built with address randomisation, the hash of None can change.

Even if we grant None a constant hash, that still does not guarantee
consistent set order across runs. At best, we might get such consistent
order as an undocumented and changeable implementation detail, until we
change the implementation of hashing, or of sets, or of something
seemingly unrelated like address randomisation.


--
Steve
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/UL2RZQLAAWK6ZUQPHA3RRN22TPCE7PLL/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: A proposal to modify `None` so that it hashes to a constant [ In reply to ]
On Sun, Nov 27, 2022 at 11:36 AM Yoni Lavi <yoni.lavi.p@gmail.com> wrote:

> I wrote a doc stating my case here:
>
> https://docs.google.com/document/d/1et5x5HckTJhUQsz2lcC1avQrgDufXFnHMin7GlI5XPI/edit#
>
> Briefly,
>
> 1. The main motivation for it is to allow users to get a predictable
> result on a given input (for programs that are doing pure compute, in
> domains like operations research / compilation), any time they run their
> program. Having stable repro is important for debugging. Notebooks with
> statistical analysis are another similar case where this is needed: you
> might want other people to run your notebook and get the same result you
> did.
>

But the hash of an object is not guaranteed to be stable by the language,
so I would argue someone expecting that is expected to convert
random-access data structures to ones that are consistent when necessary
(e.g. sorted lists).


>
> 2. The reason the hash non-determinism of None matters in practice is that
> it can infect commonly used mapping key types, such as frozen dataclasses
> containing `Optional[int]` fields.
>

I don't see why the hashing within a dict needs to be consistent as that's
not a guarantee we make with Python.


>
> 3. Non-determinism emerging from other value types like `str` can be
> disabled by the user using `PYTHONHASHSEED`, but there's no such protection
> against `None`.
>

If I remember correctly, PYTHONHASHSEED was added to help folks migrate
when we added randomness to hashing as they had accidentally come to expect
a consistent iteration order on dictionary keys. I wouldn't take its
existence to suggest that PYTHONHASHSEED is meant to make **all** hashing
consistent (e.g. people who implement their own __hash__ don't have to
follow that expectation).


>
> All it takes is for your program to compute a set somewhere with affected
> keys, and iterate on it - and determinism is lost.
>

That's actually by design. Sets are not meant to be deterministic
conceptually as they are essentially a bag of stuff. If you want
deterministic ordering you should convert it to a list and sort the list.


>
> The need to modify None itself is caused by two factors
> - `Optional` being implemented effectively as `T | None` in Python as a
> strongly established practice
> - The fact that `__hash__` is an intrinsic property of a type in Python,
> the hashing function cannot be externally supplied to its builtin container
> types. So we have to modify the type None itself, rather than write some
> alternative hasher that we could use if we care about deterministic
> behavior across runs.
>
> This was debated at length over the forum and in discord.
> I also posted a PR for it, and it was closed, see:
>
> https://github.com/python/cpython/issues/99540
> https://github.com/python/cpython/pull/99541
>
> Asking for opinions, and to re-open the PR, provided there is enough
> support for such a change to take place.
>

I personally agree with the arguments made in the issue, so I'm afraid I
don't' support making the change as we worked hard to stop people from
relying on consistent hashing/iteration from random-access data structures
like dict and set.

-Brett


>
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/KUH4HZYKPBO57A73QKCGU4GD2JNY3VMH/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
Re: A proposal to modify `None` so that it hashes to a constant [ In reply to ]
On Tue, 29 Nov 2022 at 09:51, Brett Cannon <brett@python.org> wrote:
> ... we worked hard to stop people from relying on consistent hashing/iteration from random-access data structures like dict and set.
>

Say what? Who's been working hard to stop people from relying on
consistent iteration order for a dict?

ChrisA
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/WE5D4ZVU2ZUBDW7VX27PJBPWJDWLONED/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: A proposal to modify `None` so that it hashes to a constant [ In reply to ]
On Mon, 28 Nov 2022 at 22:56, Brett Cannon <brett@python.org> wrote:
>
> On Sun, Nov 27, 2022 at 11:36 AM Yoni Lavi <yoni.lavi.p@gmail.com> wrote:
>>
>> All it takes is for your program to compute a set somewhere with affected keys, and iterate on it - and determinism is lost.
>
> That's actually by design. Sets are not meant to be deterministic conceptually as they are essentially a bag of stuff. If you want deterministic ordering you should convert it to a list and sort the list.

What does "sets are not meant to be deterministic" even mean?

Mathematically speaking sets are not meant to be ordered in any
particular way but a computational implementation has to have some
order and there is no reason to prefer non-deterministic order in
general. Actually determinism in a computational context is usually a
very valuable feature. I find it hard to see why non-determinism is
"by design".

Also it isn't usually possible to sort a list containing None:

In [9]: sorted([None, 1, 2])
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-9-344383189210> in <module>
----> 1 sorted([None, 1, 2])

TypeError: '<' not supported between instances of 'int' and 'NoneType'

It would be useful to have a straight-forward way to sort a set into a
deterministic ordering but no such feature exists after the Py3K
changes (sorted used to do this in Python 2.x).

--
Oscar
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ILP2ZKVXQIF2ONOWRJCMLNHI3LFUFBD3/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: A proposal to modify `None` so that it hashes to a constant [ In reply to ]
To stir up some more fire, I would personally be fine with sets having the
same ordering guarantees as dicts, *IF* it can be done without performance
degradations. So far nobody has come up with a way to ensure that. "Sets
weren't meant to be deterministic" sounds like a remnant of the old
philosophy, where we said the same about dicts -- until they became
deterministic without slowing down, and then everybody loved it.

Regarding hash(None), I think that there is something to be said for making
that stable, and the arguments against it feel like rationalizations for
FUD. We've survived way larger controversies. I also note that hash(()) is
apparently stable.

On Mon, Nov 28, 2022 at 3:16 PM Oscar Benjamin <oscar.j.benjamin@gmail.com>
wrote:

> On Mon, 28 Nov 2022 at 22:56, Brett Cannon <brett@python.org> wrote:
> >
> > On Sun, Nov 27, 2022 at 11:36 AM Yoni Lavi <yoni.lavi.p@gmail.com>
> wrote:
> >>
> >> All it takes is for your program to compute a set somewhere with
> affected keys, and iterate on it - and determinism is lost.
> >
> > That's actually by design. Sets are not meant to be deterministic
> conceptually as they are essentially a bag of stuff. If you want
> deterministic ordering you should convert it to a list and sort the list.
>
> What does "sets are not meant to be deterministic" even mean?
>
> Mathematically speaking sets are not meant to be ordered in any
> particular way but a computational implementation has to have some
> order and there is no reason to prefer non-deterministic order in
> general. Actually determinism in a computational context is usually a
> very valuable feature. I find it hard to see why non-determinism is
> "by design".
>
> Also it isn't usually possible to sort a list containing None:
>
> In [9]: sorted([None, 1, 2])
> ---------------------------------------------------------------------------
> TypeError Traceback (most recent call last)
> <ipython-input-9-344383189210> in <module>
> ----> 1 sorted([None, 1, 2])
>
> TypeError: '<' not supported between instances of 'int' and 'NoneType'
>
> It would be useful to have a straight-forward way to sort a set into a
> deterministic ordering but no such feature exists after the Py3K
> changes (sorted used to do this in Python 2.x).
>
> --
> Oscar
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/ILP2ZKVXQIF2ONOWRJCMLNHI3LFUFBD3/
> Code of Conduct: http://python.org/psf/codeofconduct/
>


--
--Guido van Rossum (python.org/~guido)
*Pronouns: he/him **(why is my pronoun here?)*
<http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>
Re: A proposal to modify `None` so that it hashes to a constant [ In reply to ]
On 29/11/22 12:51 pm, Guido van Rossum wrote:
> "Sets weren't meant to be deterministic" sounds like a remnant of
> the old philosophy, where we said the same about dicts -- until they
> became deterministic without slowing down, and then everybody loved it.

I got the impression that there were some internal language reasons
to want stable dicts, e.g. so that the class dict passed to __prepare__
preserves the order in which names are assigned in the class body. Are
there any such use cases for stable sets?

--
Greg
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/3TEOQMV2UXOKWMHVYA63JLPLAZ2TNX55/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: A proposal to modify `None` so that it hashes to a constant [ In reply to ]
Nah, `__prepare__` very much predates stable dicts and that problem was
solved differently.

On Mon, Nov 28, 2022 at 4:46 PM Greg Ewing <gcewing@snap.net.nz> wrote:

> On 29/11/22 12:51 pm, Guido van Rossum wrote:
> > "Sets weren't meant to be deterministic" sounds like a remnant of
> > the old philosophy, where we said the same about dicts -- until they
> > became deterministic without slowing down, and then everybody loved it.
>
> I got the impression that there were some internal language reasons
> to want stable dicts, e.g. so that the class dict passed to __prepare__
> preserves the order in which names are assigned in the class body. Are
> there any such use cases for stable sets?
>
> --
> Greg
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/3TEOQMV2UXOKWMHVYA63JLPLAZ2TNX55/
> Code of Conduct: http://python.org/psf/codeofconduct/
>


--
--Guido van Rossum (python.org/~guido)
*Pronouns: he/him **(why is my pronoun here?)*
<http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>
Re: A proposal to modify `None` so that it hashes to a constant [ In reply to ]
For what it's worth, as a user of the language I would like sets to behave
as much as possible as-if they were basically dicts that map all elements
to `()`. That way I'd have to keep one less mental model in my head.

I deliberately say 'as-if' because when I'm a user of the language, I don't
care how it's implemented. (Just like I don't have to care as a user that
we have (at least) two different ways dicts are represented internally.)
Re: A proposal to modify `None` so that it hashes to a constant [ In reply to ]
On Tue, 29 Nov 2022 at 01:33, Steven D'Aprano <steve@pearwood.info> wrote:
>
> On Mon, Nov 28, 2022 at 11:13:34PM +0000, Oscar Benjamin wrote:
> > On Mon, 28 Nov 2022 at 22:56, Brett Cannon <brett@python.org> wrote:
>
> As I understand it, we could make sets ordered, but only at the cost of
> space (much more memory) or time (slower) or both.
>
> I am sure that Guido is correct that **if** somebody comes up with a
> fast, efficient ordered set implementation that doesn't perform worse
> than the current implementation, we will happily swap to giving sets a
> predictable order, as we did with dicts. (Practicality beats purity --
> even if sets are *philosophically* unordered, preserving input order is
> too useful to give up unless we gain something in return.)

Let's split this into two separate questions:

1. Is it *innately* good that set order is non-deterministic?
2. Are there some other reasons why it is good to choose a model that
implies *non-deterministic* set order?

The answer to 1. is emphatically NO. In fact the question itself is
badly posed: why are we even asking about "set order" rather than the
benefits of determinism in general? If I want my code to be
deterministic then that's just something that I want regardless of
whether sets, dicts, floats etc are involved.

As for point 2. the fact that sets are currently non-deterministic is
actually a relatively new thing in Python. Before hash-randomisation
set and dict order *was* deterministic but with an arbitrary order.
That was only changed because of a supposed security issue with hash
collisions. Prior to that it was well understood that determinism was
beneficial (honestly I don't understand why I have to state this point
explicitly: determinism is almost always best in our context).

Please everyone don't confuse arbitrary order, implementation defined
order and non-deterministic order. There is no reason why sets in
Python need to have a *non-deterministic* order or at least why there
shouldn't be a way to control that. There is no performance penalty in
making the order *deterministic*. (If you think that there might be a
performance penalty then you haven't understood the suggestion!)

> > It would be useful to have a straight-forward way to sort a set into a
> > deterministic ordering but no such feature exists after the Py3K
> > changes (sorted used to do this in Python 2.x).
>
> `sorted()` works fine on homogeneous sets. It is only heterogeneous sets
> that are a problem, and in practice, that usually means None mixed in
> with some other type.

That is of course precisely the context for this thread!

--
Oscar
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/YGXJBLQOQBZL5ZTEPJ2V2B3KUCXDXSK2/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: A proposal to modify `None` so that it hashes to a constant [ In reply to ]
On Tue, 29 Nov 2022 at 13:12, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote:
> As for point 2. the fact that sets are currently non-deterministic is
> actually a relatively new thing in Python. Before hash-randomisation
> set and dict order *was* deterministic but with an arbitrary order.
> That was only changed because of a supposed security issue with hash
> collisions. Prior to that it was well understood that determinism was
> beneficial (honestly I don't understand why I have to state this point
> explicitly: determinism is almost always best in our context).

To clarify: The hash collision attack is a very real one, but specific
to dictionaries of string keys, since there are quite a few ways for
an attacker to send a string that gets automatically parsed into such
a dictionary (eg web app frameworks where the request parameters are
made available as a dictionary). But since that attack surface is *so*
specific, randomization of non-string hashes is unimportant.

ChrisA
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/PIUODXYX4ZYXHGKONYCRQKOGDYOAGDEE/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: A proposal to modify `None` so that it hashes to a constant [ In reply to ]
On Tue, Nov 29, 2022 at 08:51:09PM -0000, Yoni Lavi wrote:

> It does make your argument invalid though, since it's based on this
> assumption that I was asking for a requirement on iteration order
> (e.g. like dict's iteration order = insertion order guarantee), which
> is not the case.

Yoni, I think this answer is disingenious. Over on the Discuss threads,
you have made it clear that the primary reason why you want hash(None)
to return a constant value is so that set iteration order will be
consistent from one run to another.

Sure, you may *also* have some philosophical preference for None having
a fixed hash, but you have barely mentioned that the same applies to
Ellipsis and NotImplemented (and then only in passing), and that's not
your driving motivation for this proposal.

Let's consider a thought-experiment: suppose we agree to your proposal
to make hash(None) return a constant, but at the same time modify the
set iteration algorithm so that it starts from a different position each
time you iterate, making set iteration order even more unpredictable and
deterministically chaotic than it is now.

Would that satisfy you? From your posts on Discuss, I don't expect that
you will be happy with this outcome.

Personally, and I don't expect that this will be a popular opinion, I
wish that data structures that don't guarantee their iteration order
would actively mix up their iteration order so that people wouldn't be
tempted to rely on whatever accidental order they happen to get.

Sure, it would cost a little bit of code to implement, but think of the
savings in unproductive arguments and discussions like this one :-(

It would also break the invariant that `repr(data) == repr(data)` but it
is times like this that I feel that it would be worth it.


--
Steve
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/JIR4WQ5B432BQ4R27EMWZP6HCNFLFPMU/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: A proposal to modify `None` so that it hashes to a constant [ In reply to ]
[Oscar Benjamin]
> (If you think that there might be a
> performance penalty then you haven't understood the suggestion!)

Then I don't understand the question, and I will refrain from participating
further in this discussion.

--
--Guido van Rossum (python.org/~guido)
*Pronouns: he/him **(why is my pronoun here?)*
<http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>
Re: A proposal to modify `None` so that it hashes to a constant [ In reply to ]
> I can't see any, but then I couldn't see the security consequences of
> predictable string hashes until they were pointed out to me. So it would
> be really good to have some security experts comment on whether this is
> safe or not.

I can't either. I can point out that the complexity attack via hash collisions is not possible on None specifically because it is a singleton, so:
1. There may only be one None in a dict at most,
2. For composite types that depend on None and other things, like say a tuple of string and an optional int, they become as resistant to attack as the same types without None, in this case string and int.
3. Python only explicitly added this security (hash randomization) feature to string and bytes hashes, and if your composite key type depends on those types, it will still be protected regardless of what None does


> Because this entire discussion is motivated by the OP who wants
> consistent set order across multiple runs of his Python application.
> That's what he needs; having None hash to a constant value is just a
> means to that end.

Not entirely. I do explain in my doc why there is a foundational reason why, once the choice was made to have None represent the `None` of Optional[T], it entered the family of types that you would expect to have deterministic behavior. And as the hashing function is intrinsic to types in Python, it is included in that.


> Even if we grant None a constant hash, that still does not guarantee
> consistent set order across runs. At best, we might get such consistent
> order as an undocumented and changeable implementation detail, until we
> change the implementation of hashing, or of sets, or of something
> seemingly unrelated like address randomisation.

ASLR will not cause any trouble so long as I keep objects with identity based hashing out of my sets. Or at least, any sets I later iterate on and run imperative code with side-effects under the loop.
The possibility of disabling ASLR is not a reason to dismiss this change. No one guarantees a user of Python is in a position to make infosec related decisions on the computer system they are working on.

Regarding the possibilities of hashes changing for the worse (in this regard) - sure. Anything is possible.

Regarding sets - the set implementation is deterministic, and always has been (for decades).
Yes, in theory it is possible that a new version or implementation of Python will make sets non-deterministic - shuffle themselves in the background independently from their history of operations, or what have you, and then my change loses all of its value.
Like I said, anything is possible. But I think these last two points are essentially FUD.

I made my proposal because I believe the FUD scenarios are strongly unlikely. (and even then, at worst we end up with a "practically useless" behavior on None, that can also be freely reverted along with such other changes anyway)
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/GD3N4DZQXISYM6UQQLMKJC6Z52N54IO3/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: A proposal to modify `None` so that it hashes to a constant [ In reply to ]
Looks like it's just miscommunication.

There is the original proposal I made, strictly about how None is ought to be hashed, and then there is the separate topic of changing the stability properties of iteration on sets, and whether that can be made with/without a performance regression...
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/MX32ODGSEGRB2FSJS3CCR55GHXNY644X/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: A proposal to modify `None` so that it hashes to a constant [ In reply to ]
On Mon, Nov 28, 2022 at 5:38 PM Steven D'Aprano <steve@pearwood.info> wrote:

> On Mon, Nov 28, 2022 at 11:13:34PM +0000, Oscar Benjamin wrote:
> > On Mon, 28 Nov 2022 at 22:56, Brett Cannon <brett@python.org> wrote:
>
> > > That's actually by design. Sets are not meant to be deterministic
> > > conceptually as they are essentially a bag of stuff. If you want
> > > deterministic ordering you should convert it to a list and sort the
> > > list.
> >
> > What does "sets are not meant to be deterministic" even mean?
>
> I'm not Brett, so I'm not answering for him, but many people (sometimes
> including me) misuse "not deterministic" to refer to set order and the
> old dict order.


Yep, that's what I mean. And my comment about dicts earlier is old
knowledge from before we made dict iteration deterministic based on
insertion order since I'm been doing this for too long. ????

-Brett


> Of course set order is deterministic, it is determined
> by the combination of the set implementation, the hashing algorithm, and
> the history of the set -- all the items that have ever appeared in the
> set (both those removed and those that remain).
>
> Set order is deterministic in the same way that roulette wheels are
> deterministic.
>
> We don't have a good term for this: the items don't appear in random
> order. But it is not predictable either, except in very special cases.
> "Arbitrary" is not right either, since that implies we can easily choose
> whatever set order we want.
>
> I think that physicists call this "deterministic chaos"
>
>
> https://www.quora.com/Theoretical-Physics-What-is-deterministic-but-unpredictable
>
> (sorry for the quora link) so I guess we might say that set iteration
> order is "deterministically chaotic" if you want to be precise, but life
> is too short for that level of pedantry (and that's coming from me, a
> Grade A Pedant *wink*) so people often describe it as random, arbitrary,
> or non-deterministic when it's none of those things :-).
>
> Maybe we could call set order "pseudo-random".
>
> Getting back to the design part, I think that what Brett is trying to
> get across is not that the chaotic set order is in and of itself a
> requirement, but that given the other requirements, chaotic set order is
> currently considered a necessary condition.
>
> As I understand it, we could make sets ordered, but only at the cost of
> space (much more memory) or time (slower) or both.
>
> I am sure that Guido is correct that **if** somebody comes up with a
> fast, efficient ordered set implementation that doesn't perform worse
> than the current implementation, we will happily swap to giving sets a
> predictable order, as we did with dicts. (Practicality beats purity --
> even if sets are *philosophically* unordered, preserving input order is
> too useful to give up unless we gain something in return.)
>
> But I don't think it is fair or kind to call Brett's argument FUD. At
> the very least it is uncharitable interpretation of Brett's position.
>
> > It would be useful to have a straight-forward way to sort a set into a
> > deterministic ordering but no such feature exists after the Py3K
> > changes (sorted used to do this in Python 2.x).
>
> `sorted()` works fine on homogeneous sets. It is only heterogeneous sets
> that are a problem, and in practice, that usually means None mixed in
> with some other type.
>
>
> --
> Steve
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/UAVNSMS3XFAQQ6ZRD27IXPTWZFCHP6M4/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
Re: A proposal to modify `None` so that it hashes to a constant [ In reply to ]
It does make your argument invalid though, since it's based on this assumption that I was asking for a requirement on iteration order (e.g. like dict's iteration order = insertion order guarantee), which is not the case.

Again, determinism means that given all input data and commands fed to a data structure is the same, it will arrive at the same observable state, any time you start from scratch and replay this workload. In the context of sets, "all input data" includes the hashing function itself, and "observable state" also includes the order in which items will be returned if iterated. Note that there is NO requirement here on what that order might be.

Under this definition, sets in Python are deterministic, and _always_ have been. And even outside of Python, there are aren't many cases where people willingly want to use data structures with non deterministic behavior. It usually involves concurrency (in the form of multithreading) and extreme performance requirements. And it's never the "standard" choice even in languages that do offer this. Determinism is generally considered as a valuable property in computation, at least when it is feasible to maintain it.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/5Z3SOH4JDHRGYF4NTLND4E2UFM7QIXTL/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: A proposal to modify `None` so that it hashes to a constant [ In reply to ]
> it [randomising iteration order or sets] would also break the invariant
that `repr(data) == repr(data)` but it
is times like this that I feel that it would be worth it.

But it wouldn't -- equality of sets doesn't depend on iteration order. And
even if you are talking about other types (this *is* a thought experiment),
any type in which equality depends on order certainly shouldn't be jiggled
in this way.

And I think this is relevant -- I understand that it's handy that order be
deterministic -- but even if it is made so in the current implementation,
it still may not be consistent across different Python versions and
implementations. Which is why I would think that relying on determinism of
order is a bad idea anyway -- yes, the same computation should yield the
same result -- but what does "same" mean? As far as sets go -- "same" is
equality, and if you are checking anything else about them (such as
iteration order), I think it's a mistake.

Which I think is why Guido (or maybe someone else introduced the idea) was
talking about order-preserving sets -- THAT would provide a larger benefit,
and would change the specification of sets so that you *could* count on it
across versions and implementations, like we can for the modern dict.

So yes, the OP wasn't asking for that, but I think the point is that what
the OP is asking for is not useful enough to do, whereas a larger proposal
might be -- but only if it can be done without loss of performance, which,
alas, no one knows how to do.

Final confusion -- IIUC, even if the hash of None is made constant, you'd
still need to turn off string hash munging to get deterministic order for
sets with strings in them, yes? Which really seems to make this even less
useful.

-CHB


On Tue, Nov 29, 2022 at 3:48 PM Steven D'Aprano <steve@pearwood.info> wrote:

> On Tue, Nov 29, 2022 at 08:51:09PM -0000, Yoni Lavi wrote:
>
> > It does make your argument invalid though, since it's based on this
> > assumption that I was asking for a requirement on iteration order
> > (e.g. like dict's iteration order = insertion order guarantee), which
> > is not the case.
>
> Yoni, I think this answer is disingenious. Over on the Discuss threads,
> you have made it clear that the primary reason why you want hash(None)
> to return a constant value is so that set iteration order will be
> consistent from one run to another.
>
> Sure, you may *also* have some philosophical preference for None having
> a fixed hash, but you have barely mentioned that the same applies to
> Ellipsis and NotImplemented (and then only in passing), and that's not
> your driving motivation for this proposal.
>
> Let's consider a thought-experiment: suppose we agree to your proposal
> to make hash(None) return a constant, but at the same time modify the
> set iteration algorithm so that it starts from a different position each
> time you iterate, making set iteration order even more unpredictable and
> deterministically chaotic than it is now.
>
> Would that satisfy you? From your posts on Discuss, I don't expect that
> you will be happy with this outcome.
>
> Personally, and I don't expect that this will be a popular opinion, I
> wish that data structures that don't guarantee their iteration order
> would actively mix up their iteration order so that people wouldn't be
> tempted to rely on whatever accidental order they happen to get.
>
> Sure, it would cost a little bit of code to implement, but think of the
> savings in unproductive arguments and discussions like this one :-(
>
> It would also break the invariant that `repr(data) == repr(data)` but it
> is times like this that I feel that it would be worth it.
>
>
> --
> Steve
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/JIR4WQ5B432BQ4R27EMWZP6HCNFLFPMU/
> Code of Conduct: http://python.org/psf/codeofconduct/
>


--
Christopher Barker, PhD (Chris)

Python Language Consulting
- Teaching
- Scientific Software Development
- Desktop GUI and Web Development
- wxPython, numpy, scipy, Cython
Re: A proposal to modify `None` so that it hashes to a constant [ In reply to ]
On Wed, 30 Nov 2022 at 10:48, Steven D'Aprano <steve@pearwood.info> wrote:
> Let's consider a thought-experiment: suppose we agree to your proposal
> to make hash(None) return a constant, but at the same time modify the
> set iteration algorithm so that it starts from a different position each
> time you iterate, making set iteration order even more unpredictable and
> deterministically chaotic than it is now.

Let's consider an equally-likely thought experiment: suppose that
every call to set.__iter__() causes a short busy-wait, spinning the
CPU for a little while for absolutely no purpose. There's nothing in
the docs that mandates performance, so it would be perfectly
reasonable for a future version of Python to do this, right? But it's
not something that I would consider *likely*.

And, in fact, modifying the iteration algorithm to be completely
unpredictable would have the same cost: an RNG lookup for no value
whatsoever.

It's notable that set() is about the only data type where this is
possible. You can't randomize dict iteration order (couldn't do that
even before it was locked in, because iterating over keys and values
was always required to return them in the same order), and a lot of
arbitrary-order collections are built on top of dictionaries. So your
thought experiment would be special-casing sets and making them
unnecessarily difficult to work with, just so you can point to that
arbitrariness and say "hah! GOTCHA! You silly fools shouldn't have
been depending on iteration order!!".

> Personally, and I don't expect that this will be a popular opinion, I
> wish that data structures that don't guarantee their iteration order
> would actively mix up their iteration order so that people wouldn't be
> tempted to rely on whatever accidental order they happen to get.

Which data structures are those? Please enumerate.

> Sure, it would cost a little bit of code to implement, but think of the
> savings in unproductive arguments and discussions like this one :-(

So basically, it's a bit of useless code, a bit of useless CPU usage,
all just so you can win an argument. Makes sense.

ChrisA
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/E4EVMJRFTQIKZIQMEHEB2SF7FORMZUO5/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: A proposal to modify `None` so that it hashes to a constant [ In reply to ]
I stand by what I said, there is absolutely nothing disingenious about it.

> Over on the Discuss threads,
> you have made it clear that the primary reason why you want hash(None)
> to return a constant value is so that set iteration order will be
> consistent from one run to another.
No, it's so the entire program behaves the same the next time I run it on the same data.
There is nothing wrong with what sets are doing, it's the hashing function that injects non-determinism, in this case for no good reason at all.

Again, for the n'th time, read what I wrote above:
Under this definition, sets in Python are deterministic, and _always_ have been.

I was just stating a fact, same as it is in all other languages I know. You are welcome to verify it by the way, if you can.
That's why I don't request any change from set. It works fine and always has.

> you have barely mentioned that the same applies to
> Ellipsis and NotImplemented
That's wrong, I did address that many times. What makes None special is that Optional as it is implemented in Python (`T | None`) relies on `None` for one of its possible states, and `Optional` fields are in common use on key types that perform structural hashing

I'd also change other sentinels to hash to a constant if it were up to me to decide, but there is no practical significance to that so I don't bother dying on that hill (especially now, given the mob reaction). That doesn't make my argument inconsistent in any way. Sorry.

> Let's consider a thought-experiment: suppose we agree to your proposal
> to make hash(None) return a constant, but at the same time modify the
> set iteration algorithm so that it starts from a different position each
> time you iterate, making set iteration order even more unpredictable and
> deterministically chaotic than it is now.

I address this already. I said it is a strongly unlikely FUD scenario. Sets have been behaving deterministically for decades now, across all languages. There's a pretty good reason for that, even if this reason doesn't happen to be codified into the requirements of Python. So as I said, I will take my chances.

Again it's something I already addressed here:
> Yes, in theory it is possible that a new version or implementation of Python will make sets non-deterministic - shuffle
> themselves in the background independently from their history of operations, or what have you
> and then my change loses all > of its value
> Like I said, anything is possible. But I think these last two points are essentially FUD.

_Please_ stop reiterating the same argument over and over and pretending I didn't already make a case against it. It doesn't help, just turns this discussion into a shouting match.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/BWJSBBQDN3C72OXO5XDD2PIJRCRJ4CG6/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: A proposal to modify `None` so that it hashes to a constant [ In reply to ]
On Tue, 29 Nov 2022 at 23:46, Steven D'Aprano <steve@pearwood.info> wrote:
>
> On Tue, Nov 29, 2022 at 08:51:09PM -0000, Yoni Lavi wrote:
>
> > It does make your argument invalid though, since it's based on this
> > assumption that I was asking for a requirement on iteration order
> > (e.g. like dict's iteration order = insertion order guarantee), which
> > is not the case.
>
> Yoni, I think this answer is disingenious.

I don't think it is disingenuous. There are just a lot of people
talking past each other and not quite understanding what each person
means because there is confusion about even the intended meaning of
terms like "deterministic". I will expand here with enough detail that
we should hopefully be able to avoid misunderstanding each other.

There are probably other places where you could find mentions of this
in the docs but I just took a quick look in the Python 3.5 docs
(before hash randomisation) to find this mention of dictionary
iteration order:
https://docs.python.org/3.5/library/stdtypes.html#dictionary-view-objects

What it says is
"""
Keys and values are iterated over in an arbitrary order which is
non-random, varies across Python implementations, and depends on the
dictionary’s history of insertions and deletions.
"""
The key point is the use of the term "non-random" which here is
intended to mean that although no particular ordering is guaranteed
you can expect to rerun the same program and get the same result
deterministically. A different version or implementation of Python
might give a different order but rerunning the same program twice
without changing anything should give the same result even if that
result depends in some way on the iteration order of some
dictionaries. I can't immediately find a similar statement about sets
but in practice the same behaviour applied to sets as well. Note
carefully that it is this *narrow* form of determinism that Yoni is
interested in.

Of course there are some caveats to this and the obvious one is that
this statement does not apply if there are some objects that use
identity based hashing so this is not deterministic:

class A:
def __init__(self, data):
self.data = data
def __repr__(self):
return 'A(%s)' % self.data

a1 = A(1)
a2 = A(2)

for a in {a1, a2}:
print(a)

Running this gives:

$ python3.5 t.py
A(2)
A(1)
$ python3.5 t.py
A(1)
A(2)

On the other hand if all of the hashes themselves are deterministic
then the program as a whole will be as well so this is deterministic:

class A:
def __init__(self, data):
self.data = data
def __repr__(self):
return 'A(%s)' % self.data
def __hash__(self):
return hash(self.data)
def __eq__(self):
return self.data == other.data

a1 = A(1)
a2 = A(2)

for a in {a1, a2}:
print(a)

$ python3.5 t.py
A(1)
A(2)
$ python3.5 t.py
A(1)
A(2)

So we have two classes of hashable objects:

1. Those with deterministic hash
2. Those with non-deterministic hash

A program that avoids depending on the iteration order of sets or
dicts containing objects with non-deterministic hash could be
deterministic. It is not the case that the program would depend on the
iteration order for its *correctness* but just that the behaviour of
the program is *reproducible* which is useful in various ways e.g.:

- You could say to someone else "run this code with CPython 3.5 and
you should be able to reproduce exactly what I see when I run the
program". It is common practice e.g. in scientific programming to
record things like random seeds so that someone else can precisely
reproduce the results shown in a paper or some other work and this in
general requires that it is at least possible to make everything
deterministic.

- When debugging it is useful to be able to reproduce an error
condition precisely. Debugging non-deterministic failures can be
extremely difficult. In the same way that you might want to reproduce
correctly functioning code it is also very useful to be able to
reproduce bugs.

I can list more examples but really it shouldn't be necessary to
justify from first principles why determinism in programming is
usually a good thing. There can be reasons sometimes why determinism
is undesired or cannot or should not be guaranteed. It should not be
controversial though to say that all things being equal determinism is
usually a desirable feature and should be preferred by default. I
don't think that the 3.5 docs I quoted above used the words
"non-random" casually: it was an intended feature and people were
aware that breaking that behaviour would be problematic in many
situations.

Of course in Python 3.6 this determinism was broken with the
introduction of hash randomisation for strings. It was considered that
for security purposes it would be better to have some internal
non-deterministic behaviour to guard against attackers. Specifically
the hashes of three types (str, bytes and datetime) were made
non-deterministic between subsequent CPython processes. The effect was
not only to change purely internal state though but also the
observable iteration order of dicts and sets which became
non-deterministic from one run of CPython to another. It was
anticipated at the time that this might be problematic in some
situations (it certainly was!) and so an environment variable
PYTHONHASHSEED was introduced in order to restore determinism for
cases that needed it.

So now if you want to have reproducible behaviour with Python 3.6+ you
also need to fix PYTHONHASHSEED as well as avoiding the use of other
types of non-deterministically hashable objects in sets and dicts.
This is something that I have personally used mainly for the
reproducibility of rare bugs e.g. "to reproduce this you should run
the following Python code using commit abc123 under CPython 3.6 and
with PYTHONHASHSEED=1234".

Subsequently in Python 3.7 dict iteration order was changed to make it
always deterministic by having it not depend on the hash values at
all, with the order depending on the order of insertions into the dict
instead. This introduced a *stronger* guarantee of determinism: now
the ordering could be expected to be reproducible even with different
versions and implementations of Python. For many this seemed to have
resolved the problems of undefined, implementation-defined etc
ordering. However this only applied to dicts and not sets and as of
Python 3.12 any issues about deterministic ordering still remain
wherever sets are used.

The introduction of hash randomisation means that since Python 3.6
there are now three classes of hashable objects:

1. Objects with deterministic hash (int etc)
2. Objects with non-deterministic hash that can be controlled with
PYTHONHASHSEED (bytes, str and datetime).
3. Objects with non-deterministic hash that cannot be controlled
(id-based hashing).

The question in this thread and others is which of these three classes
None should belong to. Although None is just one particular value it
is a very commonly used value and its non-deterministic hash can creep
through to affect the hash of larger data structures that use
recursive hash calls (e.g. a tuple of tuples of ... that somewhere
contains None). Also certain situations such as a type hint like
Optional[T] as referred to by the OP necessarily use None:

$ python -c 'from typing import Optional; print(hash(Optional[int]))'
-7631587549837930667
$ python -c 'from typing import Optional; print(hash(Optional[int]))'
-6488475102642892491

Somehow this affects frozen dataclasses but I haven't used those
myself so I won't demonstrate how to reproduce the problem with them.

Here is a survey of types from builtins:

NoneType bool bytearray bytes complex dict ellipsis enumerate filter
float frozenset int list map memoryview object range reversed set
slice str tuple type zip

We can divide these into the three classes described above plus the
non-hashable types (I haven't checked this in the code, but just
experimenting with calling hash):

1. Deterministic hash:
bool, complex, float, frozenset, int, range, tuple

2. Hash depends on hash seed:
str, bytes

3. Hash depends on id:
NoneType, ellipsis, enumerate, filter, memoryview, object, reversed, zip

4. Non-hashable:
bytearray, dict, list, set, slice

The question here is whether None belongs in class 3 or class 1. To me
it seems clear that there is no advantage in having None be in class 3
except perhaps to save a few simple lines of code:
https://github.com/python/cpython/pull/99541/files

There is however a worthwhile advantage in having None be in class 1.
If None had a deterministic hash then tuples, frozensets etc
consisting of objects with None as well as other objects with
deterministic hash could all have deterministic hash. The behaviour of
iteration order for sets would then be deterministic in the *narrow*
sense that is referred to by the Python 3.5 docs above.

Some have argued that the fact that some types have a seed dependent
hash implies that None should not have a deterministic hash but this
does not follow. It was known at the time that str and bytes were
moved from class 1 to class 2 that it would be problematic to do so
which is precisely why PYTHONHASHSEED was introduced. However
PYTHONHASHSEED does not help here because None is not even in class 2
but rather class 3.

If I was going to label any claim made by anyone in these threads as
disingenuous then it would be the claim that None is somehow not
"special". Firstly many types have deterministic hash so it isn't
really that much of a special property. Secondly None is *clearly* a
special value that is used everywhere! When I open the CPython
interpreter there are already thousands of references to None before I
have even done anything:

>>> import sys
>>> sys.getrefcount(None)
4429

The motivation for this and other threads is to bring determinism in
the *narrow* sense. Others (including me) have made references to
other kinds of determinism that have derailed the threads by
misunderstanding exactly what Yoni is referring to. The *stronger*
sense of determinism would be useful if possible but it is not the
intended topic of these threads.

--
Oscar
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/73KMUPBTS4MIOMPRT3PBQ36HREQFXUUN/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: A proposal to modify `None` so that it hashes to a constant [ In reply to ]
On Tue, Nov 29, 2022 at 12:58 PM Yoni Lavi <yoni.lavi.p@gmail.com> wrote:

> It does make your argument invalid though,


It makes that single sentence invalid, but the rest of my points still
hold, e.g. the language makes no guarantee about hash consistency between
executions, set order is not guaranteed, etc. are all still valid points.
And as I said earlier, I also agree with the points made in the issue you
linked to, so I'm still -0 on this.

-Brett


> since it's based on this assumption that I was asking for a requirement on
> iteration order (e.g. like dict's iteration order = insertion order
> guarantee), which is not the case.
>
> Again, determinism means that given all input data and commands fed to a
> data structure is the same, it will arrive at the same observable state,
> any time you start from scratch and replay this workload. In the context of
> sets, "all input data" includes the hashing function itself, and
> "observable state" also includes the order in which items will be returned
> if iterated. Note that there is NO requirement here on what that order
> might be.
>
> Under this definition, sets in Python are deterministic, and _always_ have
> been. And even outside of Python, there are aren't many cases where people
> willingly want to use data structures with non deterministic behavior. It
> usually involves concurrency (in the form of multithreading) and extreme
> performance requirements. And it's never the "standard" choice even in
> languages that do offer this. Determinism is generally considered as a
> valuable property in computation, at least when it is feasible to maintain
> it.
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/5Z3SOH4JDHRGYF4NTLND4E2UFM7QIXTL/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
Re: A proposal to modify `None` so that it hashes to a constant [ In reply to ]
Thank you for this very clear analysis, Oscar.
It seems to me that this strengthens the OP's case.  I am curious as to
whether others agree.
Best wishes
Rob Cliffe

On 30/11/2022 13:35, Oscar Benjamin wrote:
> On Tue, 29 Nov 2022 at 23:46, Steven D'Aprano <steve@pearwood.info> wrote:
>> On Tue, Nov 29, 2022 at 08:51:09PM -0000, Yoni Lavi wrote:
>>
>>> It does make your argument invalid though, since it's based on this
>>> assumption that I was asking for a requirement on iteration order
>>> (e.g. like dict's iteration order = insertion order guarantee), which
>>> is not the case.
>> Yoni, I think this answer is disingenious.
> I don't think it is disingenuous. There are just a lot of people
> talking past each other and not quite understanding what each person
> means because there is confusion about even the intended meaning of
> terms like "deterministic". I will expand here with enough detail that
> we should hopefully be able to avoid misunderstanding each other.
>
> There are probably other places where you could find mentions of this
> in the docs but I just took a quick look in the Python 3.5 docs
> (before hash randomisation) to find this mention of dictionary
> iteration order:
> https://docs.python.org/3.5/library/stdtypes.html#dictionary-view-objects
>
> What it says is
> """
> Keys and values are iterated over in an arbitrary order which is
> non-random, varies across Python implementations, and depends on the
> dictionary’s history of insertions and deletions.
> """
> The key point is the use of the term "non-random" which here is
> intended to mean that although no particular ordering is guaranteed
> you can expect to rerun the same program and get the same result
> deterministically. A different version or implementation of Python
> might give a different order but rerunning the same program twice
> without changing anything should give the same result even if that
> result depends in some way on the iteration order of some
> dictionaries. I can't immediately find a similar statement about sets
> but in practice the same behaviour applied to sets as well. Note
> carefully that it is this *narrow* form of determinism that Yoni is
> interested in.
>
> Of course there are some caveats to this and the obvious one is that
> this statement does not apply if there are some objects that use
> identity based hashing so this is not deterministic:
>
> class A:
> def __init__(self, data):
> self.data = data
> def __repr__(self):
> return 'A(%s)' % self.data
>
> a1 = A(1)
> a2 = A(2)
>
> for a in {a1, a2}:
> print(a)
>
> Running this gives:
>
> $ python3.5 t.py
> A(2)
> A(1)
> $ python3.5 t.py
> A(1)
> A(2)
>
> On the other hand if all of the hashes themselves are deterministic
> then the program as a whole will be as well so this is deterministic:
>
> class A:
> def __init__(self, data):
> self.data = data
> def __repr__(self):
> return 'A(%s)' % self.data
> def __hash__(self):
> return hash(self.data)
> def __eq__(self):
> return self.data == other.data
>
> a1 = A(1)
> a2 = A(2)
>
> for a in {a1, a2}:
> print(a)
>
> $ python3.5 t.py
> A(1)
> A(2)
> $ python3.5 t.py
> A(1)
> A(2)
>
> So we have two classes of hashable objects:
>
> 1. Those with deterministic hash
> 2. Those with non-deterministic hash
>
> A program that avoids depending on the iteration order of sets or
> dicts containing objects with non-deterministic hash could be
> deterministic. It is not the case that the program would depend on the
> iteration order for its *correctness* but just that the behaviour of
> the program is *reproducible* which is useful in various ways e.g.:
>
> - You could say to someone else "run this code with CPython 3.5 and
> you should be able to reproduce exactly what I see when I run the
> program". It is common practice e.g. in scientific programming to
> record things like random seeds so that someone else can precisely
> reproduce the results shown in a paper or some other work and this in
> general requires that it is at least possible to make everything
> deterministic.
>
> - When debugging it is useful to be able to reproduce an error
> condition precisely. Debugging non-deterministic failures can be
> extremely difficult. In the same way that you might want to reproduce
> correctly functioning code it is also very useful to be able to
> reproduce bugs.
>
> I can list more examples but really it shouldn't be necessary to
> justify from first principles why determinism in programming is
> usually a good thing. There can be reasons sometimes why determinism
> is undesired or cannot or should not be guaranteed. It should not be
> controversial though to say that all things being equal determinism is
> usually a desirable feature and should be preferred by default. I
> don't think that the 3.5 docs I quoted above used the words
> "non-random" casually: it was an intended feature and people were
> aware that breaking that behaviour would be problematic in many
> situations.
>
> Of course in Python 3.6 this determinism was broken with the
> introduction of hash randomisation for strings. It was considered that
> for security purposes it would be better to have some internal
> non-deterministic behaviour to guard against attackers. Specifically
> the hashes of three types (str, bytes and datetime) were made
> non-deterministic between subsequent CPython processes. The effect was
> not only to change purely internal state though but also the
> observable iteration order of dicts and sets which became
> non-deterministic from one run of CPython to another. It was
> anticipated at the time that this might be problematic in some
> situations (it certainly was!) and so an environment variable
> PYTHONHASHSEED was introduced in order to restore determinism for
> cases that needed it.
>
> So now if you want to have reproducible behaviour with Python 3.6+ you
> also need to fix PYTHONHASHSEED as well as avoiding the use of other
> types of non-deterministically hashable objects in sets and dicts.
> This is something that I have personally used mainly for the
> reproducibility of rare bugs e.g. "to reproduce this you should run
> the following Python code using commit abc123 under CPython 3.6 and
> with PYTHONHASHSEED=1234".
>
> Subsequently in Python 3.7 dict iteration order was changed to make it
> always deterministic by having it not depend on the hash values at
> all, with the order depending on the order of insertions into the dict
> instead. This introduced a *stronger* guarantee of determinism: now
> the ordering could be expected to be reproducible even with different
> versions and implementations of Python. For many this seemed to have
> resolved the problems of undefined, implementation-defined etc
> ordering. However this only applied to dicts and not sets and as of
> Python 3.12 any issues about deterministic ordering still remain
> wherever sets are used.
>
> The introduction of hash randomisation means that since Python 3.6
> there are now three classes of hashable objects:
>
> 1. Objects with deterministic hash (int etc)
> 2. Objects with non-deterministic hash that can be controlled with
> PYTHONHASHSEED (bytes, str and datetime).
> 3. Objects with non-deterministic hash that cannot be controlled
> (id-based hashing).
>
> The question in this thread and others is which of these three classes
> None should belong to. Although None is just one particular value it
> is a very commonly used value and its non-deterministic hash can creep
> through to affect the hash of larger data structures that use
> recursive hash calls (e.g. a tuple of tuples of ... that somewhere
> contains None). Also certain situations such as a type hint like
> Optional[T] as referred to by the OP necessarily use None:
>
> $ python -c 'from typing import Optional; print(hash(Optional[int]))'
> -7631587549837930667
> $ python -c 'from typing import Optional; print(hash(Optional[int]))'
> -6488475102642892491
>
> Somehow this affects frozen dataclasses but I haven't used those
> myself so I won't demonstrate how to reproduce the problem with them.
>
> Here is a survey of types from builtins:
>
> NoneType bool bytearray bytes complex dict ellipsis enumerate filter
> float frozenset int list map memoryview object range reversed set
> slice str tuple type zip
>
> We can divide these into the three classes described above plus the
> non-hashable types (I haven't checked this in the code, but just
> experimenting with calling hash):
>
> 1. Deterministic hash:
> bool, complex, float, frozenset, int, range, tuple
>
> 2. Hash depends on hash seed:
> str, bytes
>
> 3. Hash depends on id:
> NoneType, ellipsis, enumerate, filter, memoryview, object, reversed, zip
>
> 4. Non-hashable:
> bytearray, dict, list, set, slice
>
> The question here is whether None belongs in class 3 or class 1. To me
> it seems clear that there is no advantage in having None be in class 3
> except perhaps to save a few simple lines of code:
> https://github.com/python/cpython/pull/99541/files
>
> There is however a worthwhile advantage in having None be in class 1.
> If None had a deterministic hash then tuples, frozensets etc
> consisting of objects with None as well as other objects with
> deterministic hash could all have deterministic hash. The behaviour of
> iteration order for sets would then be deterministic in the *narrow*
> sense that is referred to by the Python 3.5 docs above.
>
> Some have argued that the fact that some types have a seed dependent
> hash implies that None should not have a deterministic hash but this
> does not follow. It was known at the time that str and bytes were
> moved from class 1 to class 2 that it would be problematic to do so
> which is precisely why PYTHONHASHSEED was introduced. However
> PYTHONHASHSEED does not help here because None is not even in class 2
> but rather class 3.
>
> If I was going to label any claim made by anyone in these threads as
> disingenuous then it would be the claim that None is somehow not
> "special". Firstly many types have deterministic hash so it isn't
> really that much of a special property. Secondly None is *clearly* a
> special value that is used everywhere! When I open the CPython
> interpreter there are already thousands of references to None before I
> have even done anything:
>
> >>> import sys
> >>> sys.getrefcount(None)
> 4429
>
> The motivation for this and other threads is to bring determinism in
> the *narrow* sense. Others (including me) have made references to
> other kinds of determinism that have derailed the threads by
> misunderstanding exactly what Yoni is referring to. The *stronger*
> sense of determinism would be useful if possible but it is not the
> intended topic of these threads.
>
> --
> Oscar
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/73KMUPBTS4MIOMPRT3PBQ36HREQFXUUN/
> Code of Conduct: http://python.org/psf/codeofconduct/

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/KVY3SRLMEFTAY7CBJHYDXQ4HCHK6P2O4/
Code of Conduct: http://python.org/psf/codeofconduct/

1 2  View All