On Fri, Aug 06, 2021 at 11:26:49AM +0200, demerphq wrote:
> On Tue, 3 Aug 2021 at 16:28, Nicholas Clark <nick@ccl4.org> wrote:
>
> > Sorry that this message is about 4 months late for April Fool's Day.
> >
> > So, this doesn't really fit the "elevator pitch" shape because I don't
> > think
> > that proposal would really fit the RFC shape either.
> >
> >
> > I'd like to suggest that we switch the core's hashtable implementation from
> > the current array of linked-lists (which is not friendly with CPU caches)
> > to
> > "open addressing" hashing (which is). More specifically, to "Robin Hood
> > Hashing", and most specifically to the implementation that I wrote for
> > MoarVM (and Perl 5) - "A Better Hash".
> >
>
> Wow. I had looked into this once, and got discouraged.
>
> I am super happy to hear that you did this and I am in favour.
Well, "did" might be an exaggeration. I make a mess, that works. But that
is only the first 80%...
> NIcholas if there is anything I can do to help please reach out to me at my
> work email.
The two things that I know I'm not confident about that I think you (maybe
with $ork colleagues) knew about
1) benchmarking (so load factors, and trade offs there)
In that I see that the current hash table is "split if we get an
insertion collision once the load factor is over 0.66", which isn't
exactly an *obvious* choice... :-)
2) security
the way we split the linked list hash table is a function of 1 bit of the
hash value. If we didn't put effort into perturbing the iteration order to
conceal the bucket order, an attacker who can control the hash table's
keys can infer that single bit, work backwards from there to expose the
internal random state, and then with that pre-compute keys that collide
I *think* that perturbing the iteration order (by traversing the buckets
in an order determined by XOR-ing a per-hash-table random integer) is
exactly equivalent to *storing* the hash table with the buckets in that
perturbed order. Hence (for a single iteration) there's no difference in
exposure.
(Clearly for repeated iterations, a new random integer each time doesn't
expose anything, whereas the same bucket order exposes the same integer)
(Yes, we now have to store that random integer salt for every hash table,
not just hash tables we iterate)
But *if we properly mix* the per-hash-table 64 bit salt to get the bucket
order, am I right in thinking that, an attacker who can control the keys
and observe the iteration order, needs to *simultaneously* solve for 64
bits of random state? And never less.
(Contrast with the linked list split, where it's 1 bit)
hence, is the approach of salting the insertion order no less secure
than what we have currently?
> One additional downside of open hashing is that it does not allow
> constructing a hash that can have more than 100% load factor. i dont think
> we care, but its an important point. With chained hashing you can have a
> load factor higher than 100% if you want, and i believe that is what
> happens when the existing bucket array gets oversized. I dont think it
> matters, in practice such a hash would be insanely large in memory.
Technically you can kind-of-sort-of just, in that one of the standard-ish
tricks for open addressing generally was to "unwrap" the hash table to avoid
a modulo instruction when walking along looking for an empty bucket. Given
you know the maximum probe distance, you allocate that many "extra" buckets
at the end, and just increment your pointer when searching, instead of
increment and then maybe "wrap" it back to the start.
Hence you could make a hash with "4" buckets, and a load factor of 1.25,
relying on the extra buckets to provide enough space. Yes, you'd mostly
be doing a linear search...
But, this *might* be worth doing for us, if there are lots of hashes with
1 to 4 keys. (version objects seem to be 4 keys, MRO lookup tables are 1
key, at worst 2) - "official" size 4 buckets with a load factor of 1.0,
meaning actually allocating 7 buckets and storing up to 4 keys.
> I also believe that the penalty of a poor hash function is higher, but I
> dont know that I can prove it. (Eg, a poor hash function that produces too
> many collisiions causes more damage to open addressing than to
> chain-addressed) but with SipHash as the standard hash function this isnt a
> problem.
I get this impression too. The folks blogging about this stress that "you
need a good hash function" but never go into much more detail than that.
> > I *did* need to patch Cpanel::JSON::XS and Sereal to remove
> > -Werror-no-declaration-after-statement from their C compiler flags.
> >
>
> Why? Doesn't this just make it unprotable to AIX and old MS builds?
No, not AIX :-)
https://www.nntp.perl.org/group/perl.perl5.porters/2021/06/msg260331.html The PSC proposes that we no longer support VC11.0 and earlier.
...
VC12.0 is capable of *targeting* XP and Windows Server 2003
(build host requirement is Windows 7/Windows Server 2012)
VC12.0 supports C99 (well, the bits that C11 didn't make optional)
All the *nix vendor compilers are fine. Amusingly it's actually VMS that was
late to the party, but now HP have got enough of its act together to be
useful.
> 2) The type of PL_strtab
> >
>
> I am not sure I get this one, can you explain more.
Currently PL_strtab *is* (almost) a regular SV of type SVt_PVHV.
It's marked as having "unshared" hash keys, but you can (in theory) do that
to any other hash.
The only crazy thing about it is that the "value" is a reference count
rather than a pointer.
All the current hash code assumes a fixed HE * structure, hence a fixed size.
Whereas the ABH internals were written to be flexible and handle hashes of
"any size you like, as long as they start with a pointer to the key"
This means that the shared string table entries can be just a pointer to
a key - ie sizeof(void *)
With the reference count moved *into* the HEK.
1) This makes some of the other code simpler - sharing a HEK is cheaper
2) The open addressing array wastes less space for each empty "slot"
but this means that PL_strtab can no longer be passed into regular HV*
functions to get stats from it.
> > Why am I suggesting "A Better Hash" and not $other...
> >
> > 1) C implementation exists
> > 2) It's portable
> > 3) It's suitably licensed
> > 4) It supports hash randomisation
> > 5) You can delete during hash iteration without having "problems"
> >
>
> It can? Interesting. I need to read up on that, I didnt think that was true.
Well, I thought that the "rule" was that it was OK to delete at the position
of the current iterator...
But, actually the current code takes a lot of care to ensure that you can
delete *any* HE from a hash without corrupting the pointers. ie every
delete checks
1) are they deleting the current iterator?
2) are they deleting the target of the next pointer in the current iterator?
and fixes everything up accordingly
> > 0) The "end of hash" iterator value is the same for hashes of any size
> > 1) It's cheap to test a CPU register against a constant.
> > 2) It's the reverse direction from internal movements caused by insertion
> > and deletion
> >
> > That last one is key - it means that it's possible to delete the hash entry
> > at the current iterator position without perturbing the iteration.
> >
> > Perl documents this as supported behaviour. We need to support it.
> >
>
> Yep. If we can insert safely during implementation then that would be very
> cool.
Well, "safe" is possible as "it won't crash" but I think that it's not
viable to also try to make guarantees like "you'll see existing key exactly
once, even if you insert during iteration"
(which technically we can't currently, as a hash split happens while you
are iterating, keys you already saw can get moved into a bucket that has
just been created, which has to be at a future index)
in that, any Robin Hood hash
1) might need to move entries in the array to make space during insert
2) and might need to "split", which is a completely new order
deletion (at the iterator) can be made "safe" because
1) you never need to split
2) if you iterate the array in the reverse direction from the direction that
entries "move" to "close" up a hole, then deletion (at the iterator) can't
move entries across the iterator (no repeats, and no skips)
if someone else has figured this out, I've not found it, or seen it written
down anywhere (clearly).
> If speed is the objective AFAUI the underlying math for the birthday
> paradox says what this should be, IIRC basically something in the %60 to
> ~%66. After that the collision probability starts to approach 1
> relatively quickly. As far as I know there is no difference on this for
> open hashing versus chain, the underlying costs are related to the
> probability of collision which is purely determined by how many elements
> are filled, if anything open addressing benefits more from a lower load
> factor than chain addressing I would think.
Interesting, 60%-66%. I'd read this mail then forgotten the details when
exploring - of 0.5, 0.625 and 0.75, 0.625 is the fastest
(I typed 0.675 in my other messages. That was a typo. My fault. It was
always 0.625 - 5/8)
I'd read that *generally* open addressing benefits from lower load factors,
for most collision strategies, it's better to finish soon.
(Of those that I can remember, linear probing has to keep going until it
finds an empty slot, so higher load factors mean that slot will be further,
and both insert and lookup pay the penalty of that long probe distance.
I think that all the others are similar - you can't know that you have a
hash miss until you reach an empty slot, or hit your max probe distance)
Whereas Robin Hood hashing's probe distance invariants mean that a lookup
can stop *before* it finds an empty slot. Folks blogging about their
exploits in writing hash tables seemed to suggest that it could support
a higher load factor.
But 0.625 seemed good.
> I need to read up on the implementation to say anything about that part.
> Looking forward to it. :-)
It's a long messy branch at
https://github.com/nwc10/perl5/tree/smoke-me/nicholas/ABH-POC (which was somewhere earlier in this thread.
but has also been rebased a couple of times, and I'm trying to extract
runs of commits from it that stand on their own.)
And I can paste this faster than linking to it:
There are a lot of comments explaining how it works at
https://github.com/MoarVM/MoarVM/blob/master/src/core/str_hash_table.h which I didn't duplicate for this proof of concept/proof of insanity.
Nicholas Clark