Mailing List Archive: Robin Hood Hashing for the perl core

Re: Robin Hood Hashing for the perl core [ In reply to ]

Aug 30, 2021, 5:51 AM

Post #26 of 39 (1327 views)

On Wed, Aug 18, 2021 at 01:09:47PM +0000, Ed Avis wrote:
> I have asked before whether references could always be preserved in hash
> keys

Just a general observation here about proposals to change or improve hash
semantics. I've been meaning for many years (but lacked tuits) to introduce
hash vtables, which would allow for differing hash implementations to be
called on a per-hash basis (so you could declare a hash as being of a
key-order-preserving or ref-as-key-preserving type for example).

Implementing this would then allow proposals such as yours to implemented
as optional extras without risking breaking backwards compatibility or
internals efficiency.

I even reserved a flag bit for it:

#define SVf_FAKE 0x01000000 /* ...
3: HV: informally reserved by DAPM
for vtables

Having such a mechanism would also allow us to remove all the
special-cased code in hv.c for %ENV, tied etc.

--
"There's something wrong with our bloody ships today, Chatfield."
-- Admiral Beatty at the Battle of Jutland, 31st May 1916.

Re: Robin Hood Hashing for the perl core [ In reply to ]

davem at iabyn

Aug 30, 2021, 5:53 AM

Post #27 of 39 (1327 views)

Permalink

On Wed, Aug 18, 2021 at 10:15:48AM +0000, Nicholas Clark wrote:
> > How will this affect the hv_fetch_ent() and similar functions, which
> > return a pointer to a HE? In particular what happens to such a pointer if
> > the HvARRAY() gets reallocated in the meantime. E.g.
> >
> > HE *he = hv_fetch_ent(hv, ...);
> > hv_store(hv,...); /* XXX triggers HvARRAY realloc */
> > do_something_with(he->val); /* using freed memory! */
>
> It will go wrong the way that you figured out.

Perhaps we should update the docs for hv_*ent() now to state that the
returned value has a limited lifetime, and mustn't be used after any
further 'write' to the hash?

--
My Dad used to say 'always fight fire with fire', which is probably why
he got thrown out of the fire brigade.

Re: Robin Hood Hashing for the perl core [ In reply to ]

nick at ccl4

Sep 1, 2021, 6:50 AM

Post #28 of 39 (1326 views)

Permalink

On Mon, Aug 30, 2021 at 01:53:20PM +0100, Dave Mitchell wrote:
> On Wed, Aug 18, 2021 at 10:15:48AM +0000, Nicholas Clark wrote:
> > > How will this affect the hv_fetch_ent() and similar functions, which
> > > return a pointer to a HE? In particular what happens to such a pointer if
> > > the HvARRAY() gets reallocated in the meantime. E.g.
> > >
> > > HE *he = hv_fetch_ent(hv, ...);
> > > hv_store(hv,...); /* XXX triggers HvARRAY realloc */
> > > do_something_with(he->val); /* using freed memory! */
> >
> > It will go wrong the way that you figured out.
>
> Perhaps we should update the docs for hv_*ent() now to state that the
> returned value has a limited lifetime, and mustn't be used after any
> further 'write' to the hash?

Whatever else happens, yes I think that this is a good idea.

But more generally, if you're totally paranoid you can't assume anything
about it if you let the hash out of your sight, as anything you call might
cause the hash entry to be deleted, and hence the HE to be freed.

Your code doesn't own a reference to it.

I doubt that there's anything that evil out there though.

On Mon, Aug 30, 2021 at 01:51:50PM +0100, Dave Mitchell wrote:
> On Wed, Aug 18, 2021 at 01:09:47PM +0000, Ed Avis wrote:
> > I have asked before whether references could always be preserved in hash
> > keys
>
> Just a general observation here about proposals to change or improve hash
> semantics. I've been meaning for many years (but lacked tuits) to introduce
> hash vtables, which would allow for differing hash implementations to be
> called on a per-hash basis (so you could declare a hash as being of a
> key-order-preserving or ref-as-key-preserving type for example).

Yes. At least, "yes" is what I thought at first.

But I think for this case, it won't help most of the time.

"I don't want hash entries to move underneath me" is a constraint wanted by
the call site (the third party XS code)

"This hash can move about"/"This hash does not" would be a property of the
vtable, and you couldn't "turn it on/off" for any hash passed into the XS
code.

> Implementing this would then allow proposals such as yours to implemented
> as optional extras without risking breaking backwards compatibility or
> internals efficiency.

So I don't think that it can implement this, at least not "risk free"

I'm also not convinced about the efficiency. At the lowest level, it's a call
through a function pointer, which I thought that CPUs didn't totally enjoy.

But one level up, I can see (and in that branch) have made code and structure
assumptions based on how the parts of the hash fit together.

Trying to be general, particularly with iterators, is more complex.

I'm also sort of starting to think that there are two "levels" of hash
we're dealing with (anyway)

1) the low(er) level "dumb" hash that stores things, and can be iterated
- a data structure
2) SVt_PVHV, which can trigger arbitrary code on actions (such as tie)
- an object

and the VTABLEs for the two somewhat differ.

For the lower level, I think the minimal efficient API we get away with is

* build
* demolish
* fetch
* LVALUE fetch
* delete returning

and ideally

* pre-grow this hash

store is implemented as LVALUE fetch.
or LVALUE fetch is emulated with fetch and then maybe store
delete returning would be emulated with fetch and then delete

and then we also want iteration, which I think is:

* build
* demolish
* current key
* current value
* advance
* finished?

but really we'd like to have first class iterator (ie 2+ read-only
can exist for the same hash at the same time)

and that API I've sketched out above

1) doesn't map 1 to 1 with the current tied hash API
2) doesn't consider whether we need to have "bless" at the API level

and the current hash API is nuts with tied hashes.

"you call store, and then if it returns 0 you actually need to call set magic
on your value", IIRC

so it's not obvious to me how "tied hashes" fit into what the core needs to
call down to. Or what shape a sane API should be for the core to expose,
such that you can actually pass tied hashes into unsuspecting XS code and,
well, it actually works.

Nicholas Clark

Re: Robin Hood Hashing for the perl core [ In reply to ]

davem at iabyn

Sep 1, 2021, 7:10 AM

Post #29 of 39 (1326 views)

Permalink

On Wed, Sep 01, 2021 at 01:50:29PM +0000, Nicholas Clark wrote:
> On Mon, Aug 30, 2021 at 01:53:20PM +0100, Dave Mitchell wrote:
> > Just a general observation here about proposals to change or improve hash
> > semantics. I've been meaning for many years (but lacked tuits) to introduce
> > hash vtables, which would allow for differing hash implementations to be
> > called on a per-hash basis (so you could declare a hash as being of a
> > key-order-preserving or ref-as-key-preserving type for example).
>
> Yes. At least, "yes" is what I thought at first.
>
> But I think for this case, it won't help most of the time.
>
> "I don't want hash entries to move underneath me" is a constraint wanted by
> the call site (the third party XS code)
>
> "This hash can move about"/"This hash does not" would be a property of the
> vtable, and you couldn't "turn it on/off" for any hash passed into the XS
> code.

Er, I was suggesting vtables would help with ref-preserving key semantics,
not that they would help with HE lifetimes!

> I'm also not convinced about the efficiency. At the lowest level, it's a call
> through a function pointer, which I thought that CPUs didn't totally enjoy.

My "vision" such as it was, is that vtable functions are only called for
the special cases. I.e. whereas hv_common() etc currently have a whole
bunch of code along the lines of:

if (has %ENV magic)
do something special;
if (has tie magic)
do something special;
if (is stash)
do something special;
...etc...

It would now look like:

if (hv->sv_flags & SVf_HV_VTABLE) {
...call the relevant chain of vtable fn pointers ...
return if vtable function indicated a return was appropriate;
}
... handle a plain perl hash ...

So most special-case code is now hidden behind a single conditional.

> But one level up, I can see (and in that branch) have made code and structure
> assumptions based on how the parts of the hash fit together.

I confess I haven't given much thought to the API, especially in terms of
what should be the collection of vtable methods, and especially as to
whether they should be a small collection of low-level functions, mimicing
hv_common() etc, or lots of higher-level ones, mimicing all the
hv_fetch(), hv_fetch_ent() etc functions.

--
In England there is a special word which means the last sunshine
of the summer. That word is "spring".

Re: Robin Hood Hashing for the perl core [ In reply to ]

nick at ccl4

Sep 10, 2021, 1:50 AM

Post #30 of 39 (1325 views)

Permalink

On Wed, Aug 04, 2021 at 02:45:30PM +0100, Dave Mitchell wrote:

> A good start would be:
>
> $ ./perl -Ilib Porting/bench.pl --tests='/^expr::hash/' -j 8 -w /tmp/results -v perlbinary1 perlbinary2 ...
>
> which runs cachegrind on various empty and full loops containing the
> snippets of code matched in t/perf/benchmarks and writes the results (with
> startup and loop overhead factored out) into the file /tmp/results.
> The -j 8 says run 8 benchmarks in parallel. The -v says show what's going on.

It takes more CPU cycles for the hash tests, and about the same everywhere
else.

The tests are all for hashes with 1 or 2 keys? If so, this isn't at all
surprising as I know that it's trading instructions for memory locality,
but also that hashes with few keys are on average slightly larger.

> :The downsides are
> :
> :1) more CPU work in maintaining that flat array in order
>
> Can you quantify that, or describe in more detail what is involved?

quantify is hard, as every benchmark will produce something different.

But for the existing hash, insertion or deletion are operations on a linked
list, so are O(1), and are just pointer updates.

(A linked list with potentially no cache locality, so O(1) isn't sure fire
cheap)

For any Robin Hood hash, insertion or deletion can involve moving other keys
around, so I guess that's internally O(m) where `m` is how many keys you
need to move. So that's memmove of a block rather than pointer updates, *but*
there's good cache locality, so it might be close to free.

("free" for the first order approximation/assumption that cache misses
dominate CPU throughput)

> Some of my maths code involves hashes with order of 10^8 keys, typically
> "seen" hashes (that let me do X if I haven't seen this key before).
> Typically for those the only interaction with the hash is a single
> line like:
> do_X($obj) unless $seen{$obj->serialize}++;
>
> In particular, it sounds like ABH is optimized for cases where searches
> are much more common than inserts/deletes, so I think "seen" hashes
> in workloads where duplication is rare would be close to worst-case.

Yes. I think your reasoning is spot on.

So, after some experimenting, here is a benchmark tuned to flatter it:

#!perl -w
use strict;

my $size = shift;
$size ||= 1e5;

my %hash;

if ($size < 0) {
$size = -$size;
keys %hash = $size;
}

# Avoid a lot of scope entries/exits so that we're mostly timing what we care
# about

++$hash{$_ * 3}
for 1 .. $size;

$hash{$_ * 5} || 0
for 1 .. $size;

$hash{$_ * 5} || 0
for 1 .. $size;

$hash{$_ * 5} || 0
for 1 .. $size;

__END__

Running that with -1e7 (1 million keys, pre-grow hash)

blead at 75595dd289

==8755== I refs: 36,259,338,443
==8755== I1 misses: 34,395
==8755== LLi misses: 7,111
==8755== I1 miss rate: 0.00%
==8755== LLi miss rate: 0.00%
==8755==
==8755== D refs: 16,767,897,454 (11,109,054,413 rd + 5,658,843,041 wr)
==8755== D1 misses: 215,912,461 ( 190,745,837 rd + 25,166,624 wr)
==8755== LLd misses: 206,452,293 ( 183,566,167 rd + 22,886,126 wr)
==8755== D1 miss rate: 1.3% ( 1.7% + 0.4% )
==8755== LLd miss rate: 1.2% ( 1.7% + 0.4% )
==8755==
==8755== LL refs: 215,946,856 ( 190,780,232 rd + 25,166,624 wr)
==8755== LL misses: 206,459,404 ( 183,573,278 rd + 22,886,126 wr)
==8755== LL miss rate: 0.4% ( 0.4% + 0.4% )

ABH with a load factor of 0.675 (effectively what the branch now has as HEAD):

==10595== I refs: 41,940,009,601
==10595== I1 misses: 33,078
==10595== LLi misses: 6,927
==10595== I1 miss rate: 0.00%
==10595== LLi miss rate: 0.00%
==10595==
==10595== D refs: 18,636,752,812 (12,272,777,479 rd + 6,363,975,333 wr)
==10595== D1 misses: 222,023,137 ( 182,141,483 rd + 39,881,654 wr)
==10595== LLd misses: 181,489,351 ( 145,139,062 rd + 36,350,289 wr)
==10595== D1 miss rate: 1.2% ( 1.5% + 0.6% )
==10595== LLd miss rate: 1.0% ( 1.2% + 0.6% )
==10595==
==10595== LL refs: 222,056,215 ( 182,174,561 rd + 39,881,654 wr)
==10595== LL misses: 181,496,278 ( 145,145,989 rd + 36,350,289 wr)
==10595== LL miss rate: 0.3% ( 0.3% + 0.6% )

Lots more CPU instructions.
More CPU cache references, but far fewer last level read misses.

Deliberate choices that favour it here

1) pre-grow the hash
2) more reads than writes
3) lexical hash, so we're timing hash free as well

Note that there's quite a lot more data writing - that's all the memmove()

> Having read through the comments in MoarVM, I find some of it confusing,
> but it sounds like the trade-off could be tweaked by controlling the
> number of extra hash bits stored. Is that something we could consider
> making controllable on a per-hash basis?

Yes. Things that can be tuned

1) load factor
2) extra hash bits store (if any)
3) strategy for "do i need to grow" - try to avoid it by reducing the extra
hash bits, or just bite the bullet and allocate more memory

and

1) load factor could change as a function of size
2) as could extra hash bits

in playing with that benchmark, I found that a fixed load factor of 0.675
seemed to be better than either 0.75 (what I started with) or 0.5

Also, 5 extra hash bits *seemed* to be optimal. 6, 4, and 0 all produce
more cache misses.

(0 would mean that the extra hash bits aren't used, hence the code that
supports them could be deleted. That would mean fewer CPU instructions
generally, but the testing suggested that doing this hurts the data cache
more)

Nicholas Clark

Re: Robin Hood Hashing for the perl core [ In reply to ]

nick at ccl4

Sep 10, 2021, 2:27 AM

Post #31 of 39 (1325 views)

Permalink

On Tue, Aug 03, 2021 at 04:54:56PM +0100, hv@crypt.org wrote:
> Nicholas Clark <nick@ccl4.org> wrote:
> :I'd like to suggest that we switch the core's hashtable implementation from
> :the current array of linked-lists (which is not friendly with CPU caches) to
> :"open addressing" hashing (which is). More specifically, to "Robin Hood
> :Hashing", and most specifically to the implementation that I wrote for
> :MoarVM (and Perl 5) - "A Better Hash".
>
> How separable is it, would it be possible to add the new hash algorithm
> separately from the switch from linked lists? The latter sounds like it
> necessarily needs a big-bang approach, but we have prior experience in
> supporting alternate hash algorithms.

I think there's some confusion here. It doesn't help that "hash" can be
used to mean several things, and I guess "hash algorithm" too.

1) the function that maps arbitrary length keys to fixed size integers
(hash function, generating a hash value)
2) an associative array
(hash table - a data structure)

and I might have missed another one

We also have "seed", for the private hopefully-random integer which is used
to perturb the hash function so that external attackers can't pre-compute
the mapping from key to hash value.

Probably we should actually call that "salt", as that's the term for one
way hashes for passwords.

There isn't a new hash function. But currently where we're using 64 bit
hash function we truncate the output to 32 bits.

I "need" (I think the need is genuine) 64 bit hash values to make the
security work, so I changed the code to stop truncating them.

weaker) hash bucket order is exposed on iteration (it can't be hidden)
stronger) mapping back from bucket order to get the salt requires figuring
out 128 bits of secret random state

but it's only 96 bits if we stick to 32 bit hash values.

(Also, with 32 bit hash values we can't go beyond 2**32 keys in a hash,
whereas I wanted to make this thing future proof.)

And given that 32 -> 64 bit hash values needs some API "fun and games" it
seemed a good time to also move key lengths to size_t (so 64 bit where it
matters)

The API "fun and games" seemed to work just fine - I could build and test
all of CPAN that I tried. So I'm no longer worried about this part.

We could actually just do the 64 bit "API" with the existing linked list
hash structure, changing the relevant 2 members to 64 bits and size_t.

Nicholas Clark

Re: Robin Hood Hashing for the perl core [ In reply to ]

nick at ccl4

Sep 10, 2021, 2:33 AM

Post #32 of 39 (1325 views)

Permalink

On Thu, Aug 05, 2021 at 10:56:24AM +0300, Sergey Aleynikov wrote:
> ??, 3 ???. 2021 ?. ? 17:28, Nicholas Clark <nick@ccl4.org>:
> >
> >HvARRAY can't be replaced or emulated. I think that 14 distros use it:
>
> You've missed XS::Framework and everything that depends on it. Maybe
> you've searched only in .xs files? You should add at least .h, but on
> a full grep I see more - for example threads:tbb in .cc for a total of
> 24 distributions.

I did. The trick seems to be to skip the web interface and go direct to the
command line. This makes it easy to filter out ppport.h, and bulk extract
disto names without worrying about pagers.

Cloning https://github.com/metacpan/metacpan-cpan-extracted "only" takes 15G.

> For our (mine/syber) modules, HvARRAY usage is to avoid
> tiedness/boundary checks for internal hashes that we iterate over, so
> if a replacement API does them - it's useless for this particular
> purpose. We don't mind adopting our code for new internals, they just
> shouldn't be "too closed" behind static functions.

Yes, thanks for the heads up. I looked at XS::Framework, and I think the
existing inline functions exposed in the proof of concept are enough to
cover it, with possibly one more needed. It probably helps that my iterator
design started out with looking at C++ STL code :-)

Nicholas Clark

Re: Robin Hood Hashing for the perl core [ In reply to ]

davem at iabyn

Sep 10, 2021, 2:47 AM

Post #33 of 39 (1325 views)

Permalink

On Fri, Sep 10, 2021 at 08:50:42AM +0000, Nicholas Clark wrote:
> Lots more CPU instructions.
> More CPU cache references, but far fewer last level read misses.

I'd be interested in seeing the changes in number of branches and numbers
of failed branch predictions. My unquantified intuition is that that is
one of the most important performance factors in modern CPUs.

--
Monto Blanco... scorchio!

Re: Robin Hood Hashing for the perl core [ In reply to ]

nick at ccl4

Sep 10, 2021, 4:47 AM

Post #34 of 39 (1325 views)

Permalink

On Fri, Aug 06, 2021 at 11:26:49AM +0200, demerphq wrote:
> On Tue, 3 Aug 2021 at 16:28, Nicholas Clark <nick@ccl4.org> wrote:
>
> > Sorry that this message is about 4 months late for April Fool's Day.
> >
> > So, this doesn't really fit the "elevator pitch" shape because I don't
> > think
> > that proposal would really fit the RFC shape either.
> >
> >
> > I'd like to suggest that we switch the core's hashtable implementation from
> > the current array of linked-lists (which is not friendly with CPU caches)
> > to
> > "open addressing" hashing (which is). More specifically, to "Robin Hood
> > Hashing", and most specifically to the implementation that I wrote for
> > MoarVM (and Perl 5) - "A Better Hash".
> >
>
> Wow. I had looked into this once, and got discouraged.
>
> I am super happy to hear that you did this and I am in favour.

Well, "did" might be an exaggeration. I make a mess, that works. But that
is only the first 80%...

> NIcholas if there is anything I can do to help please reach out to me at my
> work email.

The two things that I know I'm not confident about that I think you (maybe
with $ork colleagues) knew about

1) benchmarking (so load factors, and trade offs there)

In that I see that the current hash table is "split if we get an
insertion collision once the load factor is over 0.66", which isn't
exactly an *obvious* choice... :-)

2) security

the way we split the linked list hash table is a function of 1 bit of the
hash value. If we didn't put effort into perturbing the iteration order to
conceal the bucket order, an attacker who can control the hash table's
keys can infer that single bit, work backwards from there to expose the
internal random state, and then with that pre-compute keys that collide

I *think* that perturbing the iteration order (by traversing the buckets
in an order determined by XOR-ing a per-hash-table random integer) is
exactly equivalent to *storing* the hash table with the buckets in that
perturbed order. Hence (for a single iteration) there's no difference in
exposure.

(Clearly for repeated iterations, a new random integer each time doesn't
expose anything, whereas the same bucket order exposes the same integer)

(Yes, we now have to store that random integer salt for every hash table,
not just hash tables we iterate)

But *if we properly mix* the per-hash-table 64 bit salt to get the bucket
order, am I right in thinking that, an attacker who can control the keys
and observe the iteration order, needs to *simultaneously* solve for 64
bits of random state? And never less.

(Contrast with the linked list split, where it's 1 bit)

hence, is the approach of salting the insertion order no less secure
than what we have currently?

> One additional downside of open hashing is that it does not allow
> constructing a hash that can have more than 100% load factor. i dont think
> we care, but its an important point. With chained hashing you can have a
> load factor higher than 100% if you want, and i believe that is what
> happens when the existing bucket array gets oversized. I dont think it
> matters, in practice such a hash would be insanely large in memory.

Technically you can kind-of-sort-of just, in that one of the standard-ish
tricks for open addressing generally was to "unwrap" the hash table to avoid
a modulo instruction when walking along looking for an empty bucket. Given
you know the maximum probe distance, you allocate that many "extra" buckets
at the end, and just increment your pointer when searching, instead of
increment and then maybe "wrap" it back to the start.

Hence you could make a hash with "4" buckets, and a load factor of 1.25,
relying on the extra buckets to provide enough space. Yes, you'd mostly
be doing a linear search...

But, this *might* be worth doing for us, if there are lots of hashes with
1 to 4 keys. (version objects seem to be 4 keys, MRO lookup tables are 1
key, at worst 2) - "official" size 4 buckets with a load factor of 1.0,
meaning actually allocating 7 buckets and storing up to 4 keys.

> I also believe that the penalty of a poor hash function is higher, but I
> dont know that I can prove it. (Eg, a poor hash function that produces too
> many collisiions causes more damage to open addressing than to
> chain-addressed) but with SipHash as the standard hash function this isnt a
> problem.

I get this impression too. The folks blogging about this stress that "you
need a good hash function" but never go into much more detail than that.

> > I *did* need to patch Cpanel::JSON::XS and Sereal to remove
> > -Werror-no-declaration-after-statement from their C compiler flags.
> >
>
> Why? Doesn't this just make it unprotable to AIX and old MS builds?

No, not AIX :-)

https://www.nntp.perl.org/group/perl.perl5.porters/2021/06/msg260331.html

The PSC proposes that we no longer support VC11.0 and earlier.

...

VC12.0 is capable of *targeting* XP and Windows Server 2003
(build host requirement is Windows 7/Windows Server 2012)

VC12.0 supports C99 (well, the bits that C11 didn't make optional)

All the *nix vendor compilers are fine. Amusingly it's actually VMS that was
late to the party, but now HP have got enough of its act together to be
useful.

> 2) The type of PL_strtab
> >
>
> I am not sure I get this one, can you explain more.

Currently PL_strtab *is* (almost) a regular SV of type SVt_PVHV.
It's marked as having "unshared" hash keys, but you can (in theory) do that
to any other hash.

The only crazy thing about it is that the "value" is a reference count
rather than a pointer.

All the current hash code assumes a fixed HE * structure, hence a fixed size.

Whereas the ABH internals were written to be flexible and handle hashes of
"any size you like, as long as they start with a pointer to the key"

This means that the shared string table entries can be just a pointer to
a key - ie sizeof(void *)

With the reference count moved *into* the HEK.

1) This makes some of the other code simpler - sharing a HEK is cheaper
2) The open addressing array wastes less space for each empty "slot"

but this means that PL_strtab can no longer be passed into regular HV*
functions to get stats from it.

> > Why am I suggesting "A Better Hash" and not $other...
> >
> > 1) C implementation exists
> > 2) It's portable
> > 3) It's suitably licensed
> > 4) It supports hash randomisation
> > 5) You can delete during hash iteration without having "problems"
> >
>
> It can? Interesting. I need to read up on that, I didnt think that was true.

Well, I thought that the "rule" was that it was OK to delete at the position
of the current iterator...

But, actually the current code takes a lot of care to ensure that you can
delete *any* HE from a hash without corrupting the pointers. ie every
delete checks

1) are they deleting the current iterator?
2) are they deleting the target of the next pointer in the current iterator?

and fixes everything up accordingly

> > 0) The "end of hash" iterator value is the same for hashes of any size
> > 1) It's cheap to test a CPU register against a constant.
> > 2) It's the reverse direction from internal movements caused by insertion
> > and deletion
> >
> > That last one is key - it means that it's possible to delete the hash entry
> > at the current iterator position without perturbing the iteration.
> >
> > Perl documents this as supported behaviour. We need to support it.
> >
>
> Yep. If we can insert safely during implementation then that would be very
> cool.

Well, "safe" is possible as "it won't crash" but I think that it's not
viable to also try to make guarantees like "you'll see existing key exactly
once, even if you insert during iteration"

(which technically we can't currently, as a hash split happens while you
are iterating, keys you already saw can get moved into a bucket that has
just been created, which has to be at a future index)

in that, any Robin Hood hash

1) might need to move entries in the array to make space during insert
2) and might need to "split", which is a completely new order

deletion (at the iterator) can be made "safe" because

1) you never need to split
2) if you iterate the array in the reverse direction from the direction that
entries "move" to "close" up a hole, then deletion (at the iterator) can't
move entries across the iterator (no repeats, and no skips)

if someone else has figured this out, I've not found it, or seen it written
down anywhere (clearly).

> If speed is the objective AFAUI the underlying math for the birthday
> paradox says what this should be, IIRC basically something in the %60 to
> ~%66. After that the collision probability starts to approach 1
> relatively quickly. As far as I know there is no difference on this for
> open hashing versus chain, the underlying costs are related to the
> probability of collision which is purely determined by how many elements
> are filled, if anything open addressing benefits more from a lower load
> factor than chain addressing I would think.

Interesting, 60%-66%. I'd read this mail then forgotten the details when
exploring - of 0.5, 0.625 and 0.75, 0.625 is the fastest

(I typed 0.675 in my other messages. That was a typo. My fault. It was
always 0.625 - 5/8)

I'd read that *generally* open addressing benefits from lower load factors,
for most collision strategies, it's better to finish soon.

(Of those that I can remember, linear probing has to keep going until it
finds an empty slot, so higher load factors mean that slot will be further,
and both insert and lookup pay the penalty of that long probe distance.
I think that all the others are similar - you can't know that you have a
hash miss until you reach an empty slot, or hit your max probe distance)

Whereas Robin Hood hashing's probe distance invariants mean that a lookup
can stop *before* it finds an empty slot. Folks blogging about their
exploits in writing hash tables seemed to suggest that it could support
a higher load factor.

But 0.625 seemed good.

> I need to read up on the implementation to say anything about that part.
> Looking forward to it. :-)

It's a long messy branch at

https://github.com/nwc10/perl5/tree/smoke-me/nicholas/ABH-POC

(which was somewhere earlier in this thread.

but has also been rebased a couple of times, and I'm trying to extract
runs of commits from it that stand on their own.)

And I can paste this faster than linking to it:

There are a lot of comments explaining how it works at
https://github.com/MoarVM/MoarVM/blob/master/src/core/str_hash_table.h

which I didn't duplicate for this proof of concept/proof of insanity.

Nicholas Clark

Re: Robin Hood Hashing for the perl core [ In reply to ]

nick at ccl4

Sep 10, 2021, 5:21 AM

Post #35 of 39 (1325 views)

Permalink

On Fri, Sep 10, 2021 at 10:47:48AM +0100, Dave Mitchell wrote:
> On Fri, Sep 10, 2021 at 08:50:42AM +0000, Nicholas Clark wrote:
> > Lots more CPU instructions.
> > More CPU cache references, but far fewer last level read misses.
>
> I'd be interested in seeing the changes in number of branches and numbers
> of failed branch predictions. My unquantified intuition is that that is
> one of the most important performance factors in modern CPUs.

Aha, --branch-sim=yes

blead at 75595dd289

==9493== I refs: 36,259,777,136
==9493== I1 misses: 34,365
==9493== LLi misses: 7,108
==9493== I1 miss rate: 0.00%
==9493== LLi miss rate: 0.00%
==9493==
==9493== D refs: 16,768,164,534 (11,109,175,017 rd + 5,658,989,517 wr)
==9493== D1 misses: 215,977,059 ( 190,801,625 rd + 25,175,434 wr)
==9493== LLd misses: 206,542,187 ( 183,664,107 rd + 22,878,080 wr)
==9493== D1 miss rate: 1.3% ( 1.7% + 0.4% )
==9493== LLd miss rate: 1.2% ( 1.7% + 0.4% )
==9493==
==9493== LL refs: 216,011,424 ( 190,835,990 rd + 25,175,434 wr)
==9493== LL misses: 206,549,295 ( 183,671,215 rd + 22,878,080 wr)
==9493== LL miss rate: 0.4% ( 0.4% + 0.4% )
==9493==
==9493== Branches: 5,829,099,182 ( 5,326,755,821 cond + 502,343,361 ind)
==9493== Mispredicts: 429,396,239 ( 89,391,173 cond + 340,005,066 ind)
==9493== Mispred rate: 7.4% ( 1.7% + 67.7% )

ABH, load factor 0.625 (*not* 0.675 as I mis-reported earlier. Sorry)

==27112== I refs: 41,942,836,004
==27112== I1 misses: 33,102
==27112== LLi misses: 6,921
==27112== I1 miss rate: 0.00%
==27112== LLi miss rate: 0.00%
==27112==
==27112== D refs: 18,637,514,170 (12,273,187,422 rd + 6,364,326,748 wr)
==27112== D1 misses: 222,136,637 ( 182,252,838 rd + 39,883,799 wr)
==27112== LLd misses: 181,543,299 ( 145,196,987 rd + 36,346,312 wr)
==27112== D1 miss rate: 1.2% ( 1.5% + 0.6% )
==27112== LLd miss rate: 1.0% ( 1.2% + 0.6% )
==27112==
==27112== LL refs: 222,169,739 ( 182,285,940 rd + 39,883,799 wr)
==27112== LL misses: 181,550,220 ( 145,203,908 rd + 36,346,312 wr)
==27112== LL miss rate: 0.3% ( 0.3% + 0.6% )
==27112==
==27112== Branches: 6,629,846,443 ( 6,119,713,828 cond + 510,132,615 ind)
==27112== Mispredicts: 447,055,595 ( 127,049,280 cond + 320,006,315 ind)
==27112== Mispred rate: 6.7% ( 2.1% + 62.7% )

If I read that right

More branches.

In absolute numbers, more mispredicts for conditional branches, fewer for
indirect branches. (Where *are* the indirect branches we're hitting so much?),

In absolute numbers slightly more mispredicts over all, but proportionally
fewer.

(Burn more CPU work with the intent to reduce cache misses)

I'm not sure what can be tweaked to be better.
Or what actually matters, and what's fluff.
What is the benchmark?

Nicholas Clark

Re: Robin Hood Hashing for the perl core [ In reply to ]

davem at iabyn

Sep 10, 2021, 5:29 AM

Post #36 of 39 (1325 views)

Permalink

On Fri, Sep 10, 2021 at 12:21:06PM +0000, Nicholas Clark wrote:
> If I read that right
>
> More branches.
>
> In absolute numbers, more mispredicts for conditional branches, fewer for
> indirect branches. (Where *are* the indirect branches we're hitting so much?),

In perl, the biggest indirects are usually the per-op op->op_ppaddr(aTHX)
calls in runops() and any switch() statements.

> In absolute numbers slightly more mispredicts over all, but proportionally
> fewer.
>
> (Burn more CPU work with the intent to reduce cache misses)
>
> I'm not sure what can be tweaked to be better.
> Or what actually matters, and what's fluff.
> What is the benchmark?

Your guess is as good as mine!

--
"But Sidley Park is already a picture, and a most amiable picture too.
The slopes are green and gentle. The trees are companionably grouped at
intervals that show them to advantage. The rill is a serpentine ribbon
unwound from the lake peaceably contained by meadows on which the right
amount of sheep are tastefully arranged." -- Lady Croom, "Arcadia"

Re: Robin Hood Hashing for the perl core [ In reply to ]

me at xenu

Sep 10, 2021, 7:31 AM

Post #37 of 39 (1325 views)

Permalink

On Fri, 10 Sep 2021 12:21:06 +0000
Nicholas Clark <nick@ccl4.org> wrote:

> Aha, --branch-sim=yes
>
> blead at 75595dd289
>
> ==9493== I refs: 36,259,777,136
> ==9493== I1 misses: 34,365
> ==9493== LLi misses: 7,108
> ==9493== I1 miss rate: 0.00%
> ==9493== LLi miss rate: 0.00%
> ==9493==
> ==9493== D refs: 16,768,164,534 (11,109,175,017 rd + 5,658,989,517 wr)
> ==9493== D1 misses: 215,977,059 ( 190,801,625 rd + 25,175,434 wr)
> ==9493== LLd misses: 206,542,187 ( 183,664,107 rd + 22,878,080 wr)
> ==9493== D1 miss rate: 1.3% ( 1.7% + 0.4% )
> ==9493== LLd miss rate: 1.2% ( 1.7% + 0.4% )
> ==9493==
> ==9493== LL refs: 216,011,424 ( 190,835,990 rd + 25,175,434 wr)
> ==9493== LL misses: 206,549,295 ( 183,671,215 rd + 22,878,080 wr)
> ==9493== LL miss rate: 0.4% ( 0.4% + 0.4% )
> ==9493==
> ==9493== Branches: 5,829,099,182 ( 5,326,755,821 cond + 502,343,361 ind)
> ==9493== Mispredicts: 429,396,239 ( 89,391,173 cond + 340,005,066 ind)
> ==9493== Mispred rate: 7.4% ( 1.7% + 67.7% )
>
>
> ABH, load factor 0.625 (*not* 0.675 as I mis-reported earlier. Sorry)
>
> ==27112== I refs: 41,942,836,004
> ==27112== I1 misses: 33,102
> ==27112== LLi misses: 6,921
> ==27112== I1 miss rate: 0.00%
> ==27112== LLi miss rate: 0.00%
> ==27112==
> ==27112== D refs: 18,637,514,170 (12,273,187,422 rd + 6,364,326,748 wr)
> ==27112== D1 misses: 222,136,637 ( 182,252,838 rd + 39,883,799 wr)
> ==27112== LLd misses: 181,543,299 ( 145,196,987 rd + 36,346,312 wr)
> ==27112== D1 miss rate: 1.2% ( 1.5% + 0.6% )
> ==27112== LLd miss rate: 1.0% ( 1.2% + 0.6% )
> ==27112==
> ==27112== LL refs: 222,169,739 ( 182,285,940 rd + 39,883,799 wr)
> ==27112== LL misses: 181,550,220 ( 145,203,908 rd + 36,346,312 wr)
> ==27112== LL miss rate: 0.3% ( 0.3% + 0.6% )
> ==27112==
> ==27112== Branches: 6,629,846,443 ( 6,119,713,828 cond + 510,132,615 ind)
> ==27112== Mispredicts: 447,055,595 ( 127,049,280 cond + 320,006,315 ind)
> ==27112== Mispred rate: 6.7% ( 2.1% + 62.7% )
>
>
> If I read that right
>
> More branches.
>
> In absolute numbers, more mispredicts for conditional branches, fewer for
> indirect branches. (Where *are* the indirect branches we're hitting so much?),
>
> In absolute numbers slightly more mispredicts over all, but proportionally
> fewer.
>
> (Burn more CPU work with the intent to reduce cache misses)
>
> I'm not sure what can be tweaked to be better.
> Or what actually matters, and what's fluff.
> What is the benchmark?
>
>
> Nicholas Clark

Valgrind's branch prediction numbers are almost completely made up and
shouldn't be trusted. According to their documentation, valgrind's
simulated branch predictor is based on early Pentium 4. Modern CPUs work
completely differently!

https://valgrind.org/docs/manual/cg-manual.html#branch-sim

I'm also sceptical about its cache simulation. The only metrics I trust
are instruction and branch counts.

perf and AMD uProf are better tools for measuring those things.
Unfortunately, they aren't nearly as pleasant to use as valgrind.

BTW, while branch prediction numbers are very useful for figuring out
what the CPU is doing, it's important to remember that, in the end,
wall-clock time is the only metric that actually matters.

Re: Robin Hood Hashing for the perl core [ In reply to ]

nick at ccl4

Sep 12, 2021, 1:43 AM

Post #38 of 39 (1325 views)

Permalink

On Fri, Sep 10, 2021 at 04:31:41PM +0200, Tomasz Konojacki wrote:
> On Fri, 10 Sep 2021 12:21:06 +0000
> Nicholas Clark <nick@ccl4.org> wrote:
>
> > Aha, --branch-sim=yes

> Valgrind's branch prediction numbers are almost completely made up and
> shouldn't be trusted. According to their documentation, valgrind's

The cynic in me wants to say:
So that fits quite nicely with benchmarks generally, doesn't it?

> simulated branch predictor is based on early Pentium 4. Modern CPUs work
> completely differently!
>
> https://valgrind.org/docs/manual/cg-manual.html#branch-sim

Thanks for the reminder about how ancient this is, and hence whether it's
really even useful.

> I'm also sceptical about its cache simulation. The only metrics I trust
> are instruction and branch counts.
>
> perf and AMD uProf are better tools for measuring those things.
> Unfortunately, they aren't nearly as pleasant to use as valgrind.

I've not met uProf, but if it's like perf, one needs root access to run it?

Right now I have user-level access to powerful hardware owned by (nice)
other people, but root access only to a couple of small things at home.

> BTW, while branch prediction numbers are very useful for figuring out
> what the CPU is doing, it's important to remember that, in the end,
> wall-clock time is the only metric that actually matters.

Yes, totally. In this case the "problem" being that tuning things (the code
or the benchmark) pushes work between one end member of "more instruction
effort, but stress the data cache less" and the other of "less instruction
effort, but stress the data cache more"

Well, not even two - I guess that can also be broken down into
"do more instruction dispatch but less branching" vs "less/more"
and the data side also looks like "trading more reads for fewer writes"
and even "more L1 work for fewer L3 misses".

So yes, wallclock is the only metric. And that needs benchmarks of the
appropriate shape and scale for the real workloads on the CPUs.

Dammit, which really means "benchmark something close to your production
code".

Nicholas Clark

Re: Robin Hood Hashing for the perl core [ In reply to ]

demerphq at gmail

Sep 23, 2021, 6:07 AM

Post #39 of 39 (1321 views)

Permalink

On Fri, 10 Sept 2021 at 13:48, Nicholas Clark <nick@ccl4.org> wrote:

> 2) security
>

> the way we split the linked list hash table is a function of 1 bit of
> the
> hash value. If we didn't put effort into perturbing the iteration order
> to
> conceal the bucket order, an attacker who can control the hash table's
> keys can infer that single bit, work backwards from there to expose the
> internal random state, and then with that pre-compute keys that collide
>
...

> hence, is the approach of salting the insertion order no less secure
> than what we have currently?
>

When bucket perturbing was introduced we were using a weak hash function.
I added the perturbing to make attacks on the seed harder. Now that we use
Siphash I think bucket perturbing can be ditched as an
unnecessary complication.

> But, this *might* be worth doing for us, if there are lots of hashes with
> 1 to 4 keys. (version objects seem to be 4 keys, MRO lookup tables are 1
> key, at worst 2) - "official" size 4 buckets with a load factor of 1.0,
> meaning actually allocating 7 buckets and storing up to 4 keys.
>

Doesn't sound like a good idea to me. But you know best. :-)

>
> > I also believe that the penalty of a poor hash function is higher, but I
> > dont know that I can prove it. (Eg, a poor hash function that produces
> too
> > many collisiions causes more damage to open addressing than to
> > chain-addressed) but with SipHash as the standard hash function this
> isnt a
> > problem.
>
> I get this impression too. The folks blogging about this stress that "you
> need a good hash function" but never go into much more detail than that.
>
>
Yeah, so don't worry about it. Siphash is a good hash function. The only
problem is that it is not 32 bit. Do we care anymore?

>
> > > I *did* need to patch Cpanel::JSON::XS and Sereal to remove
> > > -Werror-no-declaration-after-statement from their C compiler flags.
> > >
> >
> > Why? Doesn't this just make it unprotable to AIX and old MS builds?
>
> No, not AIX :-)
>
> https://www.nntp.perl.org/group/perl.perl5.porters/2021/06/msg260331.html
>
> The PSC proposes that we no longer support VC11.0 and earlier.

> ...
>
> VC12.0 is capable of *targeting* XP and Windows Server 2003
> (build host requirement is Windows 7/Windows Server 2012)
>

We only introduced the flag so we would stop breaking our windows builds.
IF that isnt a problem anymore then yay.

>
> > 2) The type of PL_strtab
> > >
> >
> > I am not sure I get this one, can you explain more.
>
> Currently PL_strtab *is* (almost) a regular SV of type SVt_PVHV.
> It's marked as having "unshared" hash keys, but you can (in theory) do that
> to any other hash.
>

Yes, that is on my todo list to play with. For certain types of hashes IMO
the PL_strtab is a massive pessimization. We have to do a double store and
a lot of complex crap and we end up having one hidden hash that scales to
the size of the set of unique keys in all hashes, which can chew up a lot
of memory in the background and can play havok with preload loading of
data.

Eg, consider I want to build 10 hashes with 10 million keys each, im
actually building 11 hashes, one with 100M keys, and ten with 10M. AND if I
build of them prefork, and then build a totally different hash with similar
keys, I start COW'ing the hidden hash.

I have dreamed that we would/could change how hashes work, and only use
PL_strtab when they are blessed. So when a hash starts of it would be
unshared. When it gets blessed it gets converted into a shared hash. Eg,
maybe introduce a "no shared_keys;" pragma, or something like that.

The idea of PL_strtab as far as I know was to dedupe keys used in objects
where there is a high level of repetition of the keys. When you are build
large hashes of arbitrary strings, or aggregations in memory with lots of
breakouts that are unlikely to overlap, or are building hashes prefork, the
value is really questionable.

> The only crazy thing about it is that the "value" is a reference count
> rather than a pointer.
>
> All the current hash code assumes a fixed HE * structure, hence a fixed
> size.
>
> Whereas the ABH internals were written to be flexible and handle hashes of
> "any size you like, as long as they start with a pointer to the key"
>
> This means that the shared string table entries can be just a pointer to
> a key - ie sizeof(void *)
>
> With the reference count moved *into* the HEK.
>
> 1) This makes some of the other code simpler - sharing a HEK is cheaper
> 2) The open addressing array wastes less space for each empty "slot"
>
>
> but this means that PL_strtab can no longer be passed into regular HV*
> functions to get stats from it.
>

That probably only breaks some (inconsequential) code I wrote. AFAIK almost
nobody knows about PL_strtab. I wouldnt worry about it.

>
> > If speed is the objective AFAUI the underlying math for the birthday
> > paradox says what this should be, IIRC basically something in the %60 to
> > ~%66. After that the collision probability starts to approach 1
> > relatively quickly. As far as I know there is no difference on this for
> > open hashing versus chain, the underlying costs are related to the
> > probability of collision which is purely determined by how many elements
> > are filled, if anything open addressing benefits more from a lower load
> > factor than chain addressing I would think.
>
> Interesting, 60%-66%. I'd read this mail then forgotten the details when
> exploring - of 0.5, 0.625 and 0.75, 0.625 is the fastest

> (I typed 0.675 in my other messages. That was a typo. My fault. It was
> always 0.625 - 5/8)
>
>
> I'd read that *generally* open addressing benefits from lower load factors,
> for most collision strategies, it's better to finish soon.
>
> (Of those that I can remember, linear probing has to keep going until it
> finds an empty slot, so higher load factors mean that slot will be further,
> and both insert and lookup pay the penalty of that long probe distance.
> I think that all the others are similar - you can't know that you have a
> hash miss until you reach an empty slot, or hit your max probe distance)
>
> Whereas Robin Hood hashing's probe distance invariants mean that a lookup
> can stop *before* it finds an empty slot. Folks blogging about their
> exploits in writing hash tables seemed to suggest that it could support
> a higher load factor.
>
> But 0.625 seemed good.
>

I am pretty sure that you always want a lower load factor with open
addressing than you do with chain addressing, and that for any given load
factor open addressing will be worse in terms of collisions (which is fine,
open addressing is fast because of memory localization and more compact
memory utilization) so probably you should use a lower load factor.

Consider the probability of collision when you insert a new unique key into
a hash that contains 2 keys, both of which collide. With chain addressing
the chance of collision will be 1/k as we will have 2 items in one bucket,
with open addressing it will be 2/k as we will have two buckets occupied.
We know that with only very few entries in the hash table we will start
seeing collisions, so we can assume that most of the time a chain
addressing scheme will use less buckets than an open addressing scheme.

So I would argue you want to tune open addressing to a lower load factor
than chain addressing to get the best insert performance from it.

I am guessing that %50 for open addressing would give the same profile as
2/3rds. (I was going to say you should do the math, but then i did it
myself):

The number of used buckets in a chain addressed hash with 'd' buckets that
contains 'n' items is expected to be: d - (d * ((d-1)/d)**n). (when d=365
this is the same as saying how many distinct birthdays should be with 'n'
people in the room). The probability of a collsion is then (d - (d *
((d-1)/d)**n))/d.

Plugging some numbers into that, it turns out that when n is about 2/3rds
of d you get a collision rate of ~50%: (Which explains why 2/3rds is
favoured).

perl -le'for my $bits (3..24) { my $d= 1<<$bits; my $n= 2*$d/3; my $dm1od=
($d-1)/$d; my $r= ($d - ($d * ($dm1od**$n))); printf "n=%11d d=%11d
chain:%.4f\n", $n, $d, $r/$d;}'
n= 5 d= 8 chain:0.5094
n= 10 d= 16 chain:0.4976
n= 21 d= 32 chain:0.4920
n= 42 d= 64 chain:0.4893
n= 85 d= 128 chain:0.4879
n= 170 d= 256 chain:0.4873
n= 341 d= 512 chain:0.4869
n= 682 d= 1024 chain:0.4868
n= 1365 d= 2048 chain:0.4867
n= 2730 d= 4096 chain:0.4866
n= 5461 d= 8192 chain:0.4866
n= 10922 d= 16384 chain:0.4866
n= 21845 d= 32768 chain:0.4866
n= 43690 d= 65536 chain:0.4866
n= 87381 d= 131072 chain:0.4866
n= 174762 d= 262144 chain:0.4866
n= 349525 d= 524288 chain:0.4866
n= 699050 d= 1048576 chain:0.4866
n= 1398101 d= 2097152 chain:0.4866
n= 2796202 d= 4194304 chain:0.4866
n= 5592405 d= 8388608 chain:0.4866
n= 11184810 d= 16777216 chain:0.4866

Thus to have the same collisiion properties you want to set the max load
factor for the open addressed scheme to 50%. Note that the approximation of
2/3rd I chose produces a rough collision probability of 45%.

Cheers,
Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"