Mailing List Archive

Robin Hood Hashing for the perl core
Sorry that this message is about 4 months late for April Fool's Day.

So, this doesn't really fit the "elevator pitch" shape because I don't think
that proposal would really fit the RFC shape either.


I'd like to suggest that we switch the core's hashtable implementation from
the current array of linked-lists (which is not friendly with CPU caches) to
"open addressing" hashing (which is). More specifically, to "Robin Hood
Hashing", and most specifically to the implementation that I wrote for
MoarVM (and Perl 5) - "A Better Hash".

Why...

For the simple explanation, you need to know about "load factor".

The top level of all hashtable implementations is an array of something
(pointers or structure) - an array that gets resized "as needed", but when
that happens is a trade off.

To make things work, you can't use all the entries in this array. Some have
to be empty. The proportion that are full is your load factor. It can't go
above 100%. If you load factor gets too high, you resize the hash.
(Typically you double its size. As there are the same number of keys, your
load factor halves)

Assume a typical load factor of 50%. This means that half your top level
array elements are empty. Imagine that each used element is next to an empty
element - it makes these diagrams easier:

The current hash implementation is an array of pointers to linked lists of
HE structs, which in turn point to the key and value:

| |
+------+ +------+ +------+
| HE * | -> | Key | ---> | HEK |
+------+ +- -+ +------+
| NULL | | Val | -+
+------+ +- -+ | +------+
| | | NULL | +-> | SV |
+------+ +------+

^^^^ points to next entry in the list if needed.

The proposed implementation "open addressing" eliminates the linked lists by
storing HE structs in a flat array:

| |
+------+ +------+
| Key | ---> | HEK |
+- -+ +------+
| Val | -+
+------+ | +------+
| NULL | +-> | SV |
+- -+ +------+
| NULL |
+------+
| |

This

1) reduces the overhead by (on average) 1 pointer per hash entry,
2) eliminates one pointer dereference (and potential CPU cache miss)
3) keeps all the hash close in memory (likely more CPU cache hits)

The downsides are

1) more CPU work in maintaining that flat array in order
2) slightly more memory allocation for hashes under about 50 entries
(but usually that memory is allocated but not needed)



I propose that we use the hash that MoarVM switched to last year.
It's my independent C implementation of the design of Martin Ankerl's hash
at https://github.com/martinus/robin-hood-hashing

I named mine "A Better Hash"

There are a lot of comments explaining how it works at
https://github.com/MoarVM/MoarVM/blob/master/src/core/str_hash_table.h

The MoarVM version has 64 bit hash values, but 32 bit sizes for hashtables
(so at most 4G entries)

Perl 5 currently has 32 bit, 32 bit, and a 31 bit length limit for keys.

I propose that while we're disrupting everything else, we switch to U64
hashes, SSize_t entries and SSize_t keys.

Because this all sounds crazy and impossible and you'd never believe me, I
have a working proof of concept at

https://github.com/nwc10/perl5/tree/smoke-me/nicholas/ABH-POC

It builds DBI, Moose, Moo, JSON::XS, YAML::XS just fine.

As in, all dependencies, all tests pass, vanilla, no changes.

(it's definitely a Proof Of Concept. The code is a mess of 3 styles.
Some dump.c functionality and a few regression tests are disabled.
keys %hash = $value; is not implemented yet. ithreads cloning doesn't
clone the hash iterator. The commit messages could be better.
Apart from that it works. At least on Linux, AIX and HP/UX)

Internally the core is ABH with 64 bit hash values and 64 bit sizes in the C
APIs, but it presents a legacy 32 bit interface to XS, and XS works.

It seems that you only need 32/64 bit compatibility wrappers around 9 core
API functions.

I *did* need to patch Cpanel::JSON::XS and Sereal to remove
-Werror-no-declaration-after-statement from their C compiler flags.

Until "this week" that flag was an excellent idea, to avoid accidentally
writing non-C89 code. However, I want to use C99 in my inline wrappers, and
hence I've just discovered one of our future C99 pain points....


Sereal needed t/020_sort_keys.t temporarily disabled, and these changes:

+++ srl_encoder.c 2021-08-02 21:39:24.461838518 +0000
@@ -1203,6 +1203,23 @@
{
HE *he;
const int do_share_keys = HvSHAREKEYS((SV *)src);
+
+#ifdef HvABH
+ Perl_ABH_Table *table= HvABH(src);
+ Perl_ABH_Iterator iterator= Perl_ABH_first(table);
+ while (!Perl_ABH_at_end(table, iterator)) {
+ HE *he= (HE *) Perl_ABH_current(table, iterator);
+ SV *v= HeVAL(he);
+ if (v != &PL_sv_placeholder) {
+ srl_dump_hk(aTHX_ enc, he, do_share_keys);
+ CALL_SRL_DUMP_SV(enc, v);
+ if (--n == 0) {
+ break;
+ }
+ }
+ iterator= Perl_ABH_next(table, iterator);
+ }
+#else
HE **he_ptr= HvARRAY(src);
HE **he_end= he_ptr + HvMAX(src) + 1;

@@ -1219,6 +1236,7 @@
}
}
} while ( he_ptr < he_end );
+#endif
}

SRL_STATIC_INLINE void
@@ -1508,7 +1526,12 @@
if ( share_keys && SRL_ENC_HAVE_OPTION(enc, SRL_F_SHARED_HASHKEYS) /* only enter branch if shared hk's enabled */
#if PERL_VERSION >= 10
&& (!DO_SHARED_HASH_ENTRY_REFCOUNT_CHECK
- || src->he_valu.hent_refcount > 1)
+#ifdef HvABH
+ || src->hent_hek->hek_refcount > 1
+#else
+ || src->he_valu.hent_refcount > 1
+#endif
+ )
#endif
)
{



I think that we can fake everything in the existing (macro soup) API *except*

1) HvARRAY
2) The type of PL_strtab


HvARRAY can't be replaced or emulated. I think that 14 distros use it:

https://metacpan.org/release/Devel-Arena
https://metacpan.org/release/Devel-MAT-Dumper
https://metacpan.org/release/Devel-SizeMe
https://metacpan.org/release/Glib
https://metacpan.org/release/Hash-Spy
https://metacpan.org/release/Internals-DumpArenas
https://metacpan.org/release/JSPL
https://metacpan.org/release/JavaBin
https://metacpan.org/release/Sereal-Encoder
https://metacpan.org/release/Sub-Disable
https://metacpan.org/release/autovivification
https://metacpan.org/release/mod_perl
https://metacpan.org/release/namespace-clean-xs
https://metacpan.org/release/perlec

grep.metacpan.org can't filter out results from ppport.h (or embed.fnc)
so I might have missed some
(this is a feature of grep.cpan.me that I miss)

I think we can patch them all. Several are just truth tests - I propose
a new macro HvIS_EMPTY(hv) to replace them, which we also add to ppport.h

Others just iterate over all the hash buckets - I propose we add
hv_foreach() to core and ppport.h to handle these cleanly.

PL_strtab is mentioned in

https://metacpan.org/release/Devel-Arena
https://metacpan.org/release/Devel-GC-Helper
https://metacpan.org/release/Devel-MAT-Dumper

that I can find, and as it turns out at least in xsh/threads.h, which a
routine search can't find.

The "XS exposure" to these two is much smaller than you might imagine.
As long as the iterator APIs keep working and returning HE*s and HEK*s,
most code is happy.


Why am I suggesting "A Better Hash" and not $other...

1) C implementation exists
2) It's portable
3) It's suitably licensed
4) It supports hash randomisation
5) You can delete during hash iteration without having "problems"


Before suggesting other hashes, please could you check that they meet all 5
of the above criteria. I'm suggesting here something that is known to work
and can be polished and delivered within a month or two part time. It builds
on approximately every CPU and OS known to humanity (including sparc64 and
ia64, so any endianness, and the alignment is 64 bit clean. *Not* tested on
VMS, zOS, the CPU that zOS runs on, hppa or RiscV. Yet.)

We can in turn replace this with something even better once code for that
exists. But until then, don't let the perfect be the enemy of the good.


Randomisation and Iteration

Randomisation is implemented by each hash having a private random 64 bit
"salt", and getting a new salt for each hashtable size doubling. Instead of
perturbing the iteration order, the insert position is perturbed.

For the current Linked List Hashes, the exploitable weakness was that on
hash "grow" linked lists chains are *split* (not completely re-ordered),
meaning that exactly 1 bit of the *key*'s hash value determines the split,
and hence *that* bit is exposed by the iteration order. Without all the
current mitigations to perturb iteration order it is demonstrably possible
to infer that value of the bit in enough keys to work out the interpreter's
global hash seed.

For ABH, all 64 bits of salt (and all 32 or 64 bits of the key's hash value)
are mixed to get the insert position. On hash resize a completely new salt
is used, and hence totally new order is generated. I believe this means that
anyone attempting to attack needs to figure out 96 (ideally 128) bits of
hidden state simultaneously, not just the 1 bit for the Linked List splits.
Hence I believe that it's secure. I'd appreciate someone competent proving
me wrong on this.

This means that internally hash iterators are just an integer array index.
This avoids all sorts of complexity with dangling pointers, allocation, etc.

Iterators actually start at the highest index and count down to 0. This has
three benefits

0) The "end of hash" iterator value is the same for hashes of any size
1) It's cheap to test a CPU register against a constant.
2) It's the reverse direction from internal movements caused by insertion
and deletion

That last one is key - it means that it's possible to delete the hash entry
at the current iterator position without perturbing the iteration.

Perl documents this as supported behaviour. We need to support it.


Performance

"The marvel is not that the bear dances well, but that the bear dances at
all."


It's actually a bit better than that. It's not noticeably slower on
anything. I'm not sure what makes a good benchmark. Code I've tried isn't
bottlenecked on hashing. I've tried using cachegrind - we're using more CPU
in the hash code, and more data references, but fewer last level misses.

But some of this could be down to tuning - open addressing hashes have do
more work for insert and delete moving entries around to maintain their
invariants. Robin Hood invariants are slightly more work on insert and
delete, but intended to reduce work on lookups. And the approach here of
storing more hash bits in the metadata means more ordering constraints (so
move moving), but fewer lookups outside of the metadata bytes on reading.

As I don't know what to benchmark, I've not tried tuning anything.
(To be fair, I'm exhausted after the big push to get this working at all.
It's why I've been quiet for a few weeks)

The load factor might be too high. There's actually quite a lot that can
be tuned:

* load factor (even make it dynamic as a function of hash size)
* thresholds for resizing
* thresholds for how many hash bits in the metadata
* smaller sizes (official "4", "2" or even "1", and with a load factor of 1.0)

but without meaningful benchmarks (which I don't have) it's not going to
answer any questions.


So for now I need to take a few days break from this, but everyone else is
welcome to play with the code, and figure out how to benchmark it.
(And break the randomisation)

Nicholas Clark
Re: Robin Hood Hashing for the perl core [ In reply to ]
On Tue, Aug 03, 2021 at 02:27:42PM +0000, Nicholas Clark wrote:

> (it's definitely a Proof Of Concept. The code is a mess of 3 styles.
> Some dump.c functionality and a few regression tests are disabled.
> keys %hash = $value; is not implemented yet. ithreads cloning doesn't
> clone the hash iterator. The commit messages could be better.

and I disabled the non-recursive freeing of hashes in sv_clear, and didn't
write new code to replace it.

> Apart from that it works. At least on Linux, AIX and HP/UX)

And dromedary.p5h.org, which has a gcc/glibc combination that is old
enough to be annoying. Which is a useful test.

Nicholas Clark
Re: Robin Hood Hashing for the perl core [ In reply to ]
Thank you for posting this detailed introduction to the new hashing implementation. I just wanted to remark on one thing. Randomization of the iteration order isn't a goal in itself, but it serves two purposes: to stop attackers being able to guess the hash seed by observing the iteration order, and to 'teach people a lesson' and make sure they don't inadvertently rely on a hash key ordering that's subject to change.

If the hash implementation fixed an iteration order that didn't give away anything about the hash seed or how to engineer collisions, then it wouldn't need to be randomized. Some hash implementations allow you to return the keys in the order they were originally inserted, without extra cost. I think that would be a very desirable feature in any replacement hash implementation. If the insertion order could be preserved with only a small constant penalty, that would speed up and simplify code where you currently use the much slower Tie::IxHash, and would also speed up code which currently needs 'sort keys', not because alphabetical ordering is particularly important, but just to get deterministic output.

If your new hash implementation doesn't provide an obvious way to preserve the insertion order of keys, and so randomization will continue to be necessary, then I apologize for letting out one of the bees I have in my bonnet.

--
Ed Avis <ed.avis@qmaw.com>
Please ignore confidentiality stuff below this point.

This email and any files transmitted with it are CONFIDENTIAL and are intended solely for the use of the individual(s) or entity to whom they are addressed. Any unauthorised copying, disclosure or distribution of the material within this email is strictly forbidden. Any views or opinions presented within this email are solely those of the author and do not necessarily represent those of PGIM Limited, QMA Wadhwani LLP or their affiliates unless otherwise specifically stated. An electronic message is not binding on its sender. Any message referring to a binding agreement must be confirmed in writing and duly signed. If you have received this email in error, please notify the sender immediately and delete the original. Telephone, electronic and other communications and conversations with PGIM Limited, QMA Wadhwani LLP and/or their associated persons may be recorded and retained. PGIM Limited and QMA Wadhwani LLP are authorised and regulated by the Financial Conduct Authority. PGIM Limited (registered in England No. 3809039) has its registered office at Grand Buildings, 1-3 Strand, Trafalgar Square, London WC2N 5HR and QMA Wadhwani LLP (registered in England No. OC303168) has its registered office at 9th Floor, Orion House, 5 Upper St. Martin's Lane, London, England, WC2H 9EA.

Please note that your personal information may be stored and processed in any country where we have facilities or in which we engage service providers. If you provide personal information to us by email or otherwise, you consent to the transfer of that information to countries outside of your country of residence and these countries may have different data protection rules than your country.

To learn about our privacy policies, please use this link<https://www.pgim.com/disclaimer/privacy-center> to read the Privacy Notices.
Re: Robin Hood Hashing for the perl core [ In reply to ]
On Tue, 3 Aug 2021 14:27:42 +0000
Nicholas Clark <nick@ccl4.org> wrote:

> I think we can patch them all. Several are just truth tests - I
> propose a new macro HvIS_EMPTY(hv) to replace them, which we also add
> to ppport.h

Regardless of anything else propsed, that sounds like a good first
step - and can probably bring some useful benefits independently of any
changes to actual hashing structure if it helps decouple modules from
core decisions.

--
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk | https://metacpan.org/author/PEVANS
http://www.leonerd.org.uk/ | https://www.tindie.com/stores/leonerd/
Re: Robin Hood Hashing for the perl core [ In reply to ]
Nicholas Clark <nick@ccl4.org> wrote:
:I'd like to suggest that we switch the core's hashtable implementation from
:the current array of linked-lists (which is not friendly with CPU caches) to
:"open addressing" hashing (which is). More specifically, to "Robin Hood
:Hashing", and most specifically to the implementation that I wrote for
:MoarVM (and Perl 5) - "A Better Hash".

How separable is it, would it be possible to add the new hash algorithm
separately from the switch from linked lists? The latter sounds like it
necessarily needs a big-bang approach, but we have prior experience in
supporting alternate hash algorithms.

:The downsides are
:
:1) more CPU work in maintaining that flat array in order

Can you quantify that, or describe in more detail what is involved?
Some of my maths code involves hashes with order of 10^8 keys, typically
"seen" hashes (that let me do X if I haven't seen this key before).
Typically for those the only interaction with the hash is a single
line like:
do_X($obj) unless $seen{$obj->serialize}++;

In particular, it sounds like ABH is optimized for cases where searches
are much more common than inserts/deletes, so I think "seen" hashes
in workloads where duplication is rare would be close to worst-case.

Having read through the comments in MoarVM, I find some of it confusing,
but it sounds like the trade-off could be tweaked by controlling the
number of extra hash bits stored. Is that something we could consider
making controllable on a per-hash basis?

:I think we can patch them all. Several are just truth tests - I propose
:a new macro HvIS_EMPTY(hv) to replace them, which we also add to ppport.h

I agree with leonerd that we should do this one right away.

:Others just iterate over all the hash buckets - I propose we add
:hv_foreach() to core and ppport.h to handle these cleanly.

Does that cover all remaining uses of HvARRAY that you found? Ie could
we provide HvIS_EMPTY and hv_foreach and then deprecate HvARRAY?

I'm nervous that there are enough uses of the various affected structures
on CPAN that it seems likely there'll be yet more in the DarkPAN; I think
we should aim to mitigate as much as possible in advance with a combination
of new interfaces and deprecations.

Additionally there is a fair amount of discussion in the notes about
randomization to avoid algorithmic complexity attacks, but I see none
about the converse - turning off the randomization to allow reproducible
behaviour, eg for debugging. I think that's also important, and also
merits some commentary.

Hugo
Re: Robin Hood Hashing for the perl core [ In reply to ]
If the proposed new hash function is "better" than the existing function
then the existing hash function is not "the best" which raises the
possibility that there might be other "better" hash functions to choose
from. One such possibility might use multi way trees:
https://metacpan.org/dist/Tree-Multi to replace the collision chain with a
more efficient implementation encoded using "Single Instruction, Multiple
Data": https://metacpan.org/pod/Nasm::X86#Avx512-instructions which would
be impervious to hashing attacks regardless of the seed chosen as in the
worst case the hash would degenerate to a single constant height tree of
broad degree.

May I therefore propose that if hash functions are in fact in play: there
should be a competition to find the best hash functions currently available
today and that the best of the best should be made available to all Perl
programmers via a "use' clause that gives each Perl programmer the
opportunity to choose the best function for their specific needs? This
would provide a powerful argument for using Perl - the only language that
puts the needs of its application programmers first and unambiguously ahead
of the needs of the programmers supporting the language.


On Tue, Aug 3, 2021 at 7:32 PM Ed Avis <ed.avis@qmaw.com> wrote:

> Thank you for posting this detailed introduction to the new hashing
> implementation. I just wanted to remark on one thing. Randomization of
> the iteration order isn't a goal in itself, but it serves two purposes: to
> stop attackers being able to guess the hash seed by observing the iteration
> order, and to 'teach people a lesson' and make sure they don't
> inadvertently rely on a hash key ordering that's subject to change.
>
> If the hash implementation fixed an iteration order that didn't give away
> anything about the hash seed or how to engineer collisions, then it
> wouldn't need to be randomized. Some hash implementations allow you to
> return the keys in the order they were originally inserted, without extra
> cost. I think that would be a very desirable feature in any replacement
> hash implementation. If the insertion order could be preserved with only a
> small constant penalty, that would speed up and simplify code where you
> currently use the much slower Tie::IxHash, and would also speed up code
> which currently needs 'sort keys', not because alphabetical ordering is
> particularly important, but just to get deterministic output.
>
> If your new hash implementation doesn't provide an obvious way to preserve
> the insertion order of keys, and so randomization will continue to be
> necessary, then I apologize for letting out one of the bees I have in my
> bonnet.
>
> --
> Ed Avis <ed.avis@qmaw.com>
> Please ignore confidentiality stuff below this point.
>
>
> This email and any files transmitted with it are CONFIDENTIAL and are
> intended solely for the use of the individual(s) or entity to whom they are
> addressed. Any unauthorised copying, disclosure or distribution of the
> material within this email is strictly forbidden. Any views or opinions
> presented within this email are solely those of the author and do not
> necessarily represent those of PGIM Limited, QMA Wadhwani LLP or their
> affiliates unless otherwise specifically stated. An electronic message is
> not binding on its sender. Any message referring to a binding agreement
> must be confirmed in writing and duly signed. If you have received this
> email in error, please notify the sender immediately and delete the
> original. Telephone, electronic and other communications and conversations
> with PGIM Limited, QMA Wadhwani LLP and/or their associated persons may be
> recorded and retained. PGIM Limited and QMA Wadhwani LLP are authorised and
> regulated by the Financial Conduct Authority. PGIM Limited (registered in
> England No. 3809039) has its registered office at Grand Buildings, 1-3
> Strand, Trafalgar Square, London WC2N 5HR and QMA Wadhwani LLP (registered
> in England No. OC303168) has its registered office at 9th Floor, Orion
> House, 5 Upper St. Martin's Lane, London, England, WC2H 9EA.
>
> Please note that your personal information may be stored and processed in
> any country where we have facilities or in which we engage service
> providers. If you provide personal information to us by email or otherwise,
> you consent to the transfer of that information to countries outside of
> your country of residence and these countries may have different data
> protection rules than your country.
>
> To learn about our privacy policies, please use this link
> <https://www.pgim.com/disclaimer/privacy-center> to read the Privacy
> Notices.
>
Re: Robin Hood Hashing for the perl core [ In reply to ]
Philip R Brenan writes:

> If the proposed new hash function is "better" than the existing function
> then the existing hash function is not "the best" which raises the
> possibility that there might be other "better" hash functions to choose
> from. One such possibility might use multi way trees:

It might. But then again, Nicholas has a proof-of-concept of his
specific proposal, which is different to the possibility of other
algorithms.

> May I therefore propose that if hash functions are in fact in play:

I don't think it works that certain categories of changes are “in play”
or not: more that any proposal is considered on its merits.

> there should be a competition to find the best hash functions
> currently available today and that the best of the best should be made
> available to all Perl programmers via a "use' clause that gives each
> Perl programmer the opportunity to choose the best function for their
> specific needs?

You're suggesting that even if the proposed new hashing algorithm (with
a working proof-of-concept, a list of the fall-out from switching to it
and what needs patching, and an experienced Perl 5 developer seemingly
offering to do the work involved) is agreed to be an improvement on the
current one, that we should decline this proposal for the aspiration of
a competition and switchable algorithms?

A competition that would take effort to run and oversee (and to come up
with entries for), and then further effort implementing each of the
chosen algorithms, and their switching method?

I'm not saying there wouldn't be any advantages in that happening. But
even if volunteers were found to do all that (and those were additional
volunteers, not taking time away from doing other things on Perl) and it
ran very smoothly, it would clearly take longer to do than making the
proposed hash algorithm switch.

And there's the chance that it wouldn't go well, or efforts would fizzle
away without reaching completion.

At which point it'd be several years later and we'd still be on the
current hashing algorithm.

Let's debate a specific proposal on its own merits, and not discard it,
or derail the discussion, in the hope of something bigger and vaguer
that possibly could happen in the future.

> This would provide a powerful argument for using Perl - the only
> language that puts the needs of its application programmers first and
> unambiguously ahead of the needs of the programmers supporting the
> language.

I can think of dozens of ways in which languages' designs prioritize
their users over their own developers. I'm not sure that switchable
hashing algorithms is one that would occur to most users, nor that it's
such a key feature that it would bring hoards running to Perl. One could
even argue the opposite: that users are best served by being freed of
such low-level considerations, trusting the implementation to do
something sensible without their having to even think about it.

But that's largely irrelevant anyway: even if I'm wrong and user-
switchable hashing algorithms is just what Perl needs, that still isn't
a reason to drop the specific proposal being made in this thread.

Smylers
Re: Robin Hood Hashing for the perl core [ In reply to ]
On Tue, Aug 03, 2021 at 02:27:42PM +0000, Nicholas Clark wrote:
> I'd like to suggest that we switch the core's hashtable implementation from
> the current array of linked-lists (which is not friendly with CPU caches) to
> "open addressing" hashing (which is). More specifically, to "Robin Hood
> Hashing", and most specifically to the implementation that I wrote for
> MoarVM (and Perl 5) - "A Better Hash".

[snip long description]

Maybe my eyes glazed over at the important bit, but I didn't see anything
explaining what the RHH does when there's a collision (i.e. the equivalent
of having more than one entry in the HE linked list in the old way).

> It's actually a bit better than that. It's not noticeably slower on
> anything. I'm not sure what makes a good benchmark. Code I've tried isn't
> bottlenecked on hashing. I've tried using cachegrind - we're using more CPU
> in the hash code, and more data references, but fewer last level misses.

A good start would be:

$ ./perl -Ilib Porting/bench.pl --tests='/^expr::hash/' -j 8 -w /tmp/results -v perlbinary1 perlbinary2 ...

which runs cachegrind on various empty and full loops containing the
snippets of code matched in t/perf/benchmarks and writes the results (with
startup and loop overhead factored out) into the file /tmp/results.
The -j 8 says run 8 benchmarks in parallel. The -v says show what's going on.

Then display the results in various formats, e.g. as percentages:

$ ./perl -Ilib Porting/bench.pl -r /tmp/results

or as raw numbers:

$ ./perl -Ilib Porting/bench.pl -r /tmp/results --raw

--
Overhead, without any fuss, the stars were going out.
-- Arthur C Clarke
Re: Robin Hood Hashing for the perl core [ In reply to ]
Dave Mitchell <davem@iabyn.com> wrote:
:Maybe my eyes glazed over at the important bit, but I didn't see anything
:explaining what the RHH does when there's a collision (i.e. the equivalent
:of having more than one entry in the HE linked list in the old way).

The crux of it is described by an example in
https://github.com/MoarVM/MoarVM/blob/master/src/core/str_hash_table.h
starting around line 319 (using the top 4 bits of the ascii value as the
hash key, and optionally the bottom 4 as the "extra" bits).

Hugo
Re: Robin Hood Hashing for the perl core [ In reply to ]
??, 3 ???. 2021 ?. ? 17:28, Nicholas Clark <nick@ccl4.org>:
>
>HvARRAY can't be replaced or emulated. I think that 14 distros use it:

You've missed XS::Framework and everything that depends on it. Maybe
you've searched only in .xs files? You should add at least .h, but on
a full grep I see more - for example threads:tbb in .cc for a total of
24 distributions.

For our (mine/syber) modules, HvARRAY usage is to avoid
tiedness/boundary checks for internal hashes that we iterate over, so
if a replacement API does them - it's useless for this particular
purpose. We don't mind adopting our code for new internals, they just
shouldn't be "too closed" behind static functions.

Best regards,
Sergey Aleynikov
Re: Robin Hood Hashing for the perl core [ In reply to ]
On 2021-08-04 1:01 a.m., Smylers wrote:
> Let's debate a specific proposal on its own merits, and not discard it,
> or derail the discussion, in the hope of something bigger and vaguer
> that possibly could happen in the future.

I fully agree.

As the saying goes, the perfect is the enemy of the good.

The current discussion should just focus on whether the current proposed new
solution is significantly better than what Perl has now and thus whether Perl
switching to it would be an improvement.

Consideration of other algorithms can be their own separate discussions if or
when someone comes up with a specific example and makes a case it is better than
the winner of the current discussion, including that someone has already
committed to do the work etc.

-- Darren Duncan
Re: Robin Hood Hashing for the perl core [ In reply to ]
Op 06-08-2021 om 03:13 schreef Darren Duncan:
> On 2021-08-04 1:01 a.m., Smylers wrote:
>> Let's debate a specific proposal on its own merits, and not discard it,
>> or derail the discussion, in the hope of something bigger and vaguer
>> that possibly could happen in the future.
>
> I fully agree.
>
> As the saying goes, the perfect is the enemy of the good.
>
> The current discussion should just focus on whether the current
> proposed new solution is significantly better than what Perl has now
> and thus whether Perl switching to it would be an improvement.
>
> Consideration of other algorithms can be their own separate
> discussions if or when someone comes up with a specific example and
> makes a case it is better than the winner of the current discussion,
> including that someone has already committed to do the work etc.


Yes, but a question that may be asked is, can this be generalized to
support different algorithms easily, so programmers can choose the
hashing algorithm at runtime. And if it can, is it worth to take such a
broadening of the scope on at the same time as handling the current
proposal. If it is easy to implement, it should be looked at.


I suspect the answer to this is a firm no, but it never hurts to look at
this. Anyone who knows the internals who can say something about that?


M4
Re: Robin Hood Hashing for the perl core [ In reply to ]
On 2021-08-06 12:20 a.m., Martijn Lievaart wrote:
>
> Op 06-08-2021 om 03:13 schreef Darren Duncan:
>> On 2021-08-04 1:01 a.m., Smylers wrote:
>>> Let's debate a specific proposal on its own merits, and not discard it,
>>> or derail the discussion, in the hope of something bigger and vaguer
>>> that possibly could happen in the future.
>>
>> I fully agree.
>>
>> As the saying goes, the perfect is the enemy of the good.
>>
>> The current discussion should just focus on whether the current proposed new
>> solution is significantly better than what Perl has now and thus whether Perl
>> switching to it would be an improvement.
>>
>> Consideration of other algorithms can be their own separate discussions if or
>> when someone comes up with a specific example and makes a case it is better
>> than the winner of the current discussion, including that someone has already
>> committed to do the work etc.
>
> Yes, but a question that may be asked is, can this be generalized to support
> different algorithms easily, so programmers can choose the hashing algorithm at
> runtime. And if it can, is it worth to take such a broadening of the scope on at
> the same time as handling the current proposal. If it is easy to implement, it
> should be looked at.
>
> I suspect the answer to this is a firm no, but it never hurts to look at this.
> Anyone who knows the internals who can say something about that?

I would say that at this stage the best argument for having that feature of
hashing algorithm choice is:

1. If it turns out from measuring that some common use cases perform way better
with the current algorithm and others perform way better with the new one, and
users would then benefit from being able to choose to tune their scenarios.

2. Also implementing the ability to choose isn't too much effort or has too poor
trade-offs of is own.

3. If going forward the added maintenance cost of two algorithms rather than one
is worth it.

-- Darren Duncan
Re: Robin Hood Hashing for the perl core [ In reply to ]
On Fri, Aug 06, 2021 at 12:41:56AM -0700, Darren Duncan wrote:
> On 2021-08-06 12:20 a.m., Martijn Lievaart wrote:
> >
> > Op 06-08-2021 om 03:13 schreef Darren Duncan:
> > > On 2021-08-04 1:01 a.m., Smylers wrote:
> > > > Let's debate a specific proposal on its own merits, and not discard it,
> > > > or derail the discussion, in the hope of something bigger and vaguer
> > > > that possibly could happen in the future.
> > >
> > > I fully agree.
> > >
> > > As the saying goes, the perfect is the enemy of the good.

As I wrote in my orginal message:

We can in turn replace this with something even better once code for that
exists. But until then, don't let the perfect be the enemy of the good.

> > > The current discussion should just focus on whether the current
> > > proposed new solution is significantly better than what Perl has now
> > > and thus whether Perl switching to it would be an improvement.
> > >
> > > Consideration of other algorithms can be their own separate
> > > discussions if or when someone comes up with a specific example and
> > > makes a case it is better than the winner of the current discussion,
> > > including that someone has already committed to do the work etc.

Yes, exactly that.

> > Yes, but a question that may be asked is, can this be generalized to
> > support different algorithms easily, so programmers can choose the
> > hashing algorithm at runtime. And if it can, is it worth to take such a
> > broadening of the scope on at the same time as handling the current
> > proposal. If it is easy to implement, it should be looked at.
> >
> > I suspect the answer to this is a firm no, but it never hurts to look at
> > this. Anyone who knows the internals who can say something about that?

1) I am not confident that it's easy to implement
2) *I* am not in a position to attempt it
3) I'm not in a position to mentor someone else who might attempt it
4) I doubt I'd even be in a position to review the code if someone did
attempt it.
(I estimate that it even if someone else started now, it would take them
a couple of months at least, and I by then I won't have as much free time
as I currently have)

I'm not stopping anyone doing this, but based on previous experience of
how often large good quality surprise contributions arrive, I doubt that
it is going to happen.

> I would say that at this stage the best argument for having that feature of
> hashing algorithm choice is:
>
> 1. If it turns out from measuring that some common use cases perform way
> better with the current algorithm and others perform way better with the new
> one, and users would then benefit from being able to choose to tune their
> scenarios.
>
> 2. Also implementing the ability to choose isn't too much effort or has too
> poor trade-offs of is own.
>
> 3. If going forward the added maintenance cost of two algorithms rather than
> one is worth it.

That's a good summary.

We have finite programmer time.

(eg - I think all but two messages in this thread so far have been from
people who can consider designs and trade offs, but can't review the C code)

If we want to maintain two (or three) effectively we'd be stealing from
one to feed the other(s).

Much like the previous PSC opinion on sort, I'm going to extrapolate and
suggest that it's only better to have 2+, if it's clear that a significant
amount of the user base are actually going to actively *need* to use
different hashing algorithms.

But if all we really end up with is an "optional feature" that rarely
anyone uses (because it costs them more to learn about it and evaluate it
than they gain by choosing it), then we've wasted effort that could have
been spent better elsewhere.


So if someone else steps up an implements another hash, great.

If someone else steps up and figures out how to switch between 2 (or more),
great.

And most of what they need is the approach taken in this branch (cleaned up)
anyway, to get to the point where it was possible to switch at all.


But don't wait for this. Don't assume that it's going to happen.

But until then we should consider what we know is possible, rather than
distract ourselves with talk about what someone else might do.


Nicholas Clark

PS I doubt I'll reply to all the relevant messages in this thread (or others)
today.
Re: Robin Hood Hashing for the perl core [ In reply to ]
On Tue, Aug 03, 2021 at 02:36:33PM +0000, Ed Avis wrote:

> If the hash implementation fixed an iteration order that didn't give away anything about the hash seed or how to engineer collisions, then it wouldn't need to be randomized. Some hash implementations allow you to return the keys in the order they were originally inserted, without extra cost. I think that would be a very desirable feature in any replacement hash implementation. If the insertion order could be preserved with only a small constant penalty, that would speed up and simplify code where you currently use the much slower Tie::IxHash, and would also speed up code which currently needs 'sort keys', not because alphabetical ordering is particularly important, but just to get deterministic output.
>
> If your new hash implementation doesn't provide an obvious way to preserve the insertion order of keys, and so randomization will continue to be necessary, then I apologize for letting out one of the bees I have in my bonnet.

It's a good question.

No, it doesn't provide any way to preserve insertion order.

It's inherent in the Robin Hood algorithm that it controls the order within
the hash. It's actually *exploiting* the known order and ordering
constraints it adds to make searching terminate sooner.

So to maintain/remember insertion order would require storing more
information, which increases complexity, and likely size too.

(*likely* size, because by using a different data structure, you can
possibly make size savings elsewhere)

Nicholas Clark
Re: Robin Hood Hashing for the perl core [ In reply to ]
On Tue, 3 Aug 2021 at 16:28, Nicholas Clark <nick@ccl4.org> wrote:

> Sorry that this message is about 4 months late for April Fool's Day.
>
> So, this doesn't really fit the "elevator pitch" shape because I don't
> think
> that proposal would really fit the RFC shape either.
>
>
> I'd like to suggest that we switch the core's hashtable implementation from
> the current array of linked-lists (which is not friendly with CPU caches)
> to
> "open addressing" hashing (which is). More specifically, to "Robin Hood
> Hashing", and most specifically to the implementation that I wrote for
> MoarVM (and Perl 5) - "A Better Hash".
>

Wow. I had looked into this once, and got discouraged.

I am super happy to hear that you did this and I am in favour.

NIcholas if there is anything I can do to help please reach out to me at my
work email.


> Why...
>
> For the simple explanation, you need to know about "load factor".
>

When you say this I assume you mean "to understand why open hashing is
preferable you need to know about load factor". That stray pesky comma...
:-)


>
> The top level of all hashtable implementations is an array of something
> (pointers or structure) - an array that gets resized "as needed", but when
> that happens is a trade off.
>
> To make things work, you can't use all the entries in this array. Some have
> to be empty. The proportion that are full is your load factor. It can't go
> above 100%. If you load factor gets too high, you resize the hash.
> (Typically you double its size. As there are the same number of keys, your
> load factor halves)
>
> Assume a typical load factor of 50%. This means that half your top level
> array elements are empty. Imagine that each used element is next to an
> empty
> element - it makes these diagrams easier:
>

Just for the record: in modern perls we hash split on collision only and
when the load factor is about 66%, prior to that change we used to split
when the load factor reached 100%. (100% is considered very high.)


>
> The current hash implementation is an array of pointers to linked lists of
> HE structs, which in turn point to the key and value:
>
> | |
> +------+ +------+ +------+
> | HE * | -> | Key | ---> | HEK |
> +------+ +- -+ +------+
> | NULL | | Val | -+
> +------+ +- -+ | +------+
> | | | NULL | +-> | SV |
> +------+ +------+
>
> ^^^^ points to next entry in the list if needed.
>
> The proposed implementation "open addressing" eliminates the linked lists
> by
> storing HE structs in a flat array:
>
> | |
> +------+ +------+
> | Key | ---> | HEK |
> +- -+ +------+
> | Val | -+
> +------+ | +------+
> | NULL | +-> | SV |
> +- -+ +------+
> | NULL |
> +------+
> | |
>
> This
>
> 1) reduces the overhead by (on average) 1 pointer per hash entry,
> 2) eliminates one pointer dereference (and potential CPU cache miss)
> 3) keeps all the hash close in memory (likely more CPU cache hits)
>
> The downsides are
>
> 1) more CPU work in maintaining that flat array in order
> 2) slightly more memory allocation for hashes under about 50 entries
> (but usually that memory is allocated but not needed)
>

One additional downside of open hashing is that it does not allow
constructing a hash that can have more than 100% load factor. i dont think
we care, but its an important point. With chained hashing you can have a
load factor higher than 100% if you want, and i believe that is what
happens when the existing bucket array gets oversized. I dont think it
matters, in practice such a hash would be insanely large in memory.

I also believe that the penalty of a poor hash function is higher, but I
dont know that I can prove it. (Eg, a poor hash function that produces too
many collisiions causes more damage to open addressing than to
chain-addressed) but with SipHash as the standard hash function this isnt a
problem.

I propose that we use the hash that MoarVM switched to last year.
> It's my independent C implementation of the design of Martin Ankerl's hash
> at https://github.com/martinus/robin-hood-hashing
>
> I named mine "A Better Hash"
>
> There are a lot of comments explaining how it works at
> https://github.com/MoarVM/MoarVM/blob/master/src/core/str_hash_table.h
>
> The MoarVM version has 64 bit hash values, but 32 bit sizes for hashtables
> (so at most 4G entries)
>
> Perl 5 currently has 32 bit, 32 bit, and a 31 bit length limit for keys.
>
> I propose that while we're disrupting everything else, we switch to U64
> hashes, SSize_t entries and SSize_t keys.
>
> Because this all sounds crazy and impossible and you'd never believe me, I
> have a working proof of concept at
>
> https://github.com/nwc10/perl5/tree/smoke-me/nicholas/ABH-POC
>
> It builds DBI, Moose, Moo, JSON::XS, YAML::XS just fine.
>
> As in, all dependencies, all tests pass, vanilla, no changes.
>
> (it's definitely a Proof Of Concept. The code is a mess of 3 styles.
> Some dump.c functionality and a few regression tests are disabled.
> keys %hash = $value; is not implemented yet. ithreads cloning doesn't
> clone the hash iterator. The commit messages could be better.
> Apart from that it works. At least on Linux, AIX and HP/UX)
>
> Internally the core is ABH with 64 bit hash values and 64 bit sizes in the
> C
> APIs, but it presents a legacy 32 bit interface to XS, and XS works.
>
> It seems that you only need 32/64 bit compatibility wrappers around 9 core
> API functions.
>
> I *did* need to patch Cpanel::JSON::XS and Sereal to remove
> -Werror-no-declaration-after-statement from their C compiler flags.
>

Why? Doesn't this just make it unprotable to AIX and old MS builds?


> Until "this week" that flag was an excellent idea, to avoid accidentally
> writing non-C89 code. However, I want to use C99 in my inline wrappers, and
> hence I've just discovered one of our future C99 pain points....
>
>
> Sereal needed t/020_sort_keys.t temporarily disabled, and these changes:
>

FWIW, I would be happy to take patches to Sereal to ensure that it compiles
properly with the old and new hash. And I would love to try it out and help
to the extent I can.


>
> +++ srl_encoder.c 2021-08-02 21:39:24.461838518 +0000
> @@ -1203,6 +1203,23 @@
> {
> HE *he;
> const int do_share_keys = HvSHAREKEYS((SV *)src);
> +
> +#ifdef HvABH
> + Perl_ABH_Table *table= HvABH(src);
> + Perl_ABH_Iterator iterator= Perl_ABH_first(table);
> + while (!Perl_ABH_at_end(table, iterator)) {
> + HE *he= (HE *) Perl_ABH_current(table, iterator);
> + SV *v= HeVAL(he);
> + if (v != &PL_sv_placeholder) {
> + srl_dump_hk(aTHX_ enc, he, do_share_keys);
> + CALL_SRL_DUMP_SV(enc, v);
> + if (--n == 0) {
> + break;
> + }
> + }
> + iterator= Perl_ABH_next(table, iterator);
> + }
> +#else
> HE **he_ptr= HvARRAY(src);
> HE **he_end= he_ptr + HvMAX(src) + 1;
>
> @@ -1219,6 +1236,7 @@
> }
> }
> } while ( he_ptr < he_end );
> +#endif
> }
>
> SRL_STATIC_INLINE void
> @@ -1508,7 +1526,12 @@
> if ( share_keys && SRL_ENC_HAVE_OPTION(enc,
> SRL_F_SHARED_HASHKEYS) /* only enter branch if shared hk's enabled */
> #if PERL_VERSION >= 10
> && (!DO_SHARED_HASH_ENTRY_REFCOUNT_CHECK
> - || src->he_valu.hent_refcount > 1)
> +#ifdef HvABH
> + || src->hent_hek->hek_refcount > 1
> +#else
> + || src->he_valu.hent_refcount > 1
> +#endif
> + )
> #endif
> )
> {
>
>
>
> I think that we can fake everything in the existing (macro soup) API
> *except*
>
> 1) HvARRAY



2) The type of PL_strtab
>

I am not sure I get this one, can you explain more.


>
>
> HvARRAY can't be replaced or emulated. I think that 14 distros use it:
>
> https://metacpan.org/release/Devel-Arena
> https://metacpan.org/release/Devel-MAT-Dumper
> https://metacpan.org/release/Devel-SizeMe
> https://metacpan.org/release/Glib
> https://metacpan.org/release/Hash-Spy
> https://metacpan.org/release/Internals-DumpArenas
> https://metacpan.org/release/JSPL
> https://metacpan.org/release/JavaBin
> https://metacpan.org/release/Sereal-Encoder
> https://metacpan.org/release/Sub-Disable
> https://metacpan.org/release/autovivification
> https://metacpan.org/release/mod_perl
> https://metacpan.org/release/namespace-clean-xs
> https://metacpan.org/release/perlec
>
> grep.metacpan.org can't filter out results from ppport.h (or embed.fnc)
> so I might have missed some
> (this is a feature of grep.cpan.me that I miss)
>
> I think we can patch them all. Several are just truth tests - I propose
> a new macro HvIS_EMPTY(hv) to replace them, which we also add to ppport.h
>

Make sense.


> Others just iterate over all the hash buckets - I propose we add
> hv_foreach() to core and ppport.h to handle these cleanly.
>

Ok, the above comment about HvARRY is explained now I guess.


>
> PL_strtab is mentioned in
>
> https://metacpan.org/release/Devel-Arena
> https://metacpan.org/release/Devel-GC-Helper
> https://metacpan.org/release/Devel-MAT-Dumper
>
> that I can find, and as it turns out at least in xsh/threads.h, which a
> routine search can't find.
>

Its moral existance is pretty important to Sereal, even if we dont mention
it directly. I still dont get the issue with it.


>
> The "XS exposure" to these two is much smaller than you might imagine.
> As long as the iterator APIs keep working and returning HE*s and HEK*s,
> most code is happy.
>
>
> Why am I suggesting "A Better Hash" and not $other...
>
> 1) C implementation exists
> 2) It's portable
> 3) It's suitably licensed
> 4) It supports hash randomisation
> 5) You can delete during hash iteration without having "problems"
>

It can? Interesting. I need to read up on that, I didnt think that was true.


>
>
> Before suggesting other hashes, please could you check that they meet all 5
> of the above criteria. I'm suggesting here something that is known to work
> and can be polished and delivered within a month or two part time. It
> builds
> on approximately every CPU and OS known to humanity (including sparc64 and
> ia64, so any endianness, and the alignment is 64 bit clean. *Not* tested on
> VMS, zOS, the CPU that zOS runs on, hppa or RiscV. Yet.)
>
> We can in turn replace this with something even better once code for that
> exists. But until then, don't let the perfect be the enemy of the good.
>

Agreed. An open hashing algorithm that is plumbed into Perl already is
worth 5 that arent.


>
> Randomisation and Iteration
>
> Randomisation is implemented by each hash having a private random 64 bit
> "salt", and getting a new salt for each hashtable size doubling. Instead of
> perturbing the iteration order, the insert position is perturbed.
>

If you happened to have a link to this in github id be grateful.


>
> For the current Linked List Hashes, the exploitable weakness was that on
> hash "grow" linked lists chains are *split* (not completely re-ordered),
> meaning that exactly 1 bit of the *key*'s hash value determines the split,
> and hence *that* bit is exposed by the iteration order. Without all the
> current mitigations to perturb iteration order it is demonstrably possible
> to infer that value of the bit in enough keys to work out the interpreter's
> global hash seed.
>
> For ABH, all 64 bits of salt (and all 32 or 64 bits of the key's hash
> value)
> are mixed to get the insert position. On hash resize a completely new salt
> is used, and hence totally new order is generated. I believe this means
> that
> anyone attempting to attack needs to figure out 96 (ideally 128) bits of
> hidden state simultaneously, not just the 1 bit for the Linked List splits.
> Hence I believe that it's secure. I'd appreciate someone competent proving
> me wrong on this.
>

Sounds good. Also since the old one was only "a bit more secure than doing
nothing" I dont think you need to hit the target of "demostrably secure"
you just need to be more secure than what I did, and I think that is pretty
easy to do. If you can provide a code link to the relevant logic it would
be helpful.


>
> This means that internally hash iterators are just an integer array index.
> This avoids all sorts of complexity with dangling pointers, allocation,
> etc.
>
> Iterators actually start at the highest index and count down to 0. This has
> three benefits
>
> 0) The "end of hash" iterator value is the same for hashes of any size
> 1) It's cheap to test a CPU register against a constant.
> 2) It's the reverse direction from internal movements caused by insertion
> and deletion
>
> That last one is key - it means that it's possible to delete the hash entry
> at the current iterator position without perturbing the iteration.
>
> Perl documents this as supported behaviour. We need to support it.
>

Yep. If we can insert safely during implementation then that would be very
cool.


>
>
> Performance
>
> "The marvel is not that the bear dances well, but that the bear dances at
> all."
>
>
> It's actually a bit better than that. It's not noticeably slower on
> anything. I'm not sure what makes a good benchmark. Code I've tried isn't
> bottlenecked on hashing. I've tried using cachegrind - we're using more CPU
> in the hash code, and more data references, but fewer last level misses.
>
> But some of this could be down to tuning - open addressing hashes have do
> more work for insert and delete moving entries around to maintain their
> invariants. Robin Hood invariants are slightly more work on insert and
> delete, but intended to reduce work on lookups. And the approach here of
> storing more hash bits in the metadata means more ordering constraints (so
> move moving), but fewer lookups outside of the metadata bytes on reading.
>
> As I don't know what to benchmark, I've not tried tuning anything.
> (To be fair, I'm exhausted after the big push to get this working at all.
> It's why I've been quiet for a few weeks)
>

Test it with a broken hash function, or a precomputed set of keys that
collide in the right way. If it doesnt degrade terribly IMO you are good.


>
> The load factor might be too high. There's actually quite a lot that can
> be tuned:
>
> * load factor (even make it dynamic as a function of hash size)
>

If speed is the objective AFAUI the underlying math for the birthday
paradox says what this should be, IIRC basically something in the %60 to
~%66. After that the collision probability starts to approach 1
relatively quickly. As far as I know there is no difference on this for
open hashing versus chain, the underlying costs are related to the
probability of collision which is purely determined by how many elements
are filled, if anything open addressing benefits more from a lower load
factor than chain addressing I would think.

* thresholds for resizing
> * thresholds for how many hash bits in the metadata
> * smaller sizes (official "4", "2" or even "1", and with a load factor of
> 1.0)
>
> but without meaningful benchmarks (which I don't have) it's not going to
> answer any questions.
>

I need to read up on the implementation to say anything about that part.
Looking forward to it. :-)


So for now I need to take a few days break from this, but everyone else is
> welcome to play with the code, and figure out how to benchmark it.
> (And break the randomisation)
>

This is super awesome, I will try to find some time to look into more
generally and I will definitely find some time to look at specific parts,
and if you ask me to do something to help ill put it on my priority list.
Thanks a lot, this is one of the more exciting changes for some time from
my POV.

Yves

--
perl -Mre=debug -e "/just|another|perl|hacker/"
Re: Robin Hood Hashing for the perl core [ In reply to ]
On Fri, 6 Aug 2021 at 10:07, Nicholas Clark <nick@ccl4.org> wrote:

> On Fri, Aug 06, 2021 at 12:41:56AM -0700, Darren Duncan wrote:
> > On 2021-08-06 12:20 a.m., Martijn Lievaart wrote:
> > >
> > > Op 06-08-2021 om 03:13 schreef Darren Duncan:
> > > > On 2021-08-04 1:01 a.m., Smylers wrote:
> > > > > Let's debate a specific proposal on its own merits, and not
> discard it,
> > > > > or derail the discussion, in the hope of something bigger and
> vaguer
> > > > > that possibly could happen in the future.
> > > >
> > > > I fully agree.
> > > >
> > > > As the saying goes, the perfect is the enemy of the good.
>
> As I wrote in my orginal message:
>
> We can in turn replace this with something even better once code for
> that
> exists. But until then, don't let the perfect be the enemy of the good.
>
> +1


> > > > The current discussion should just focus on whether the current
> > > > proposed new solution is significantly better than what Perl has now
> > > > and thus whether Perl switching to it would be an improvement.
> > > >
> > > > Consideration of other algorithms can be their own separate
> > > > discussions if or when someone comes up with a specific example and
> > > > makes a case it is better than the winner of the current discussion,
> > > > including that someone has already committed to do the work etc.
>
> Yes, exactly that.
>
> > > Yes, but a question that may be asked is, can this be generalized to
> > > support different algorithms easily, so programmers can choose the
> > > hashing algorithm at runtime. And if it can, is it worth to take such a
> > > broadening of the scope on at the same time as handling the current
> > > proposal. If it is easy to implement, it should be looked at.
> > >
> > > I suspect the answer to this is a firm no, but it never hurts to look
> at
> > > this. Anyone who knows the internals who can say something about that?
>
> 1) I am not confident that it's easy to implement
> 2) *I* am not in a position to attempt it
> 3) I'm not in a position to mentor someone else who might attempt it
> 4) I doubt I'd even be in a position to review the code if someone did
> attempt it.
> (I estimate that it even if someone else started now, it would take them
> a couple of months at least, and I by then I won't have as much free
> time
> as I currently have)
>
> I'm not stopping anyone doing this, but based on previous experience of
> how often large good quality surprise contributions arrive, I doubt that
> it is going to happen.
>
>
I dont see that we need it either. I can understand why someone might want
to swap out the hash function, but not the hash table.


> > I would say that at this stage the best argument for having that feature
> of
> > hashing algorithm choice is:
> >
> > 1. If it turns out from measuring that some common use cases perform way
> > better with the current algorithm and others perform way better with the
> new
> > one, and users would then benefit from being able to choose to tune their
> > scenarios.
> >
> > 2. Also implementing the ability to choose isn't too much effort or has
> too
> > poor trade-offs of is own.
> >
> > 3. If going forward the added maintenance cost of two algorithms rather
> than
> > one is worth it.
>
> That's a good summary.
>
> We have finite programmer time.
>
> (eg - I think all but two messages in this thread so far have been from
> people who can consider designs and trade offs, but can't review the C
> code)
>
> If we want to maintain two (or three) effectively we'd be stealing from
> one to feed the other(s).
>
>
As the sereal patch shows it also cascades into CPAN.


> Much like the previous PSC opinion on sort, I'm going to extrapolate and
> suggest that it's only better to have 2+, if it's clear that a significant
> amount of the user base are actually going to actively *need* to use
> different hashing algorithms.
>
> But if all we really end up with is an "optional feature" that rarely
> anyone uses (because it costs them more to learn about it and evaluate it
> than they gain by choosing it), then we've wasted effort that could have
> been spent better elsewhere.
>
>
> So if someone else steps up an implements another hash, great.
>

From the POV of Sereal and CPAN I would not say "great" unless it truly
could be swapped out without Sereal breaking.

cheers,
Yves
Re: Robin Hood Hashing for the perl core [ In reply to ]
On Tue, Aug 03, 2021 at 02:27:42PM +0000, Nicholas Clark wrote:
> The proposed implementation "open addressing" eliminates the linked lists by
> storing HE structs in a flat array:

So in effect HvARRAY() changes from being an array of SV pointers to an
array of HE structs?

How will this affect the hv_fetch_ent() and similar functions, which
return a pointer to a HE? In particular what happens to such a pointer if
the HvARRAY() gets reallocated in the meantime. E.g.

HE *he = hv_fetch_ent(hv, ...);
hv_store(hv,...); /* XXX triggers HvARRAY realloc */
do_something_with(he->val); /* using freed memory! */


--
"Emacs isn't a bad OS once you get used to it.
It just lacks a decent editor."
Re: Robin Hood Hashing for the perl core [ In reply to ]
On Sat, Aug 07, 2021 at 11:24:56AM +0100, Dave Mitchell wrote:

Sorry for the delay in replying. I've been on holiday. There were loads of
pumpkins there, but I think that they will require a lot of training up
before they become useful to us:

https://gist.github.com/nwc10/37a85e5272bcfc460b77809ec648d518

> On Tue, Aug 03, 2021 at 02:27:42PM +0000, Nicholas Clark wrote:
> > The proposed implementation "open addressing" eliminates the linked lists by
> > storing HE structs in a flat array:
>
> So in effect HvARRAY() changes from being an array of SV pointers to an
> array of HE structs?

Effectively yes (except that you meant HE struct pointers. It's AvARRAY
that is SV pointers. Well, SV**)

Which isn't the same thing - in as much as I can't see any way to fake up
an interface that offers an array of pointers-to (so can be passed around
as values of type HE **) when all that exists in memory is an array of HE *,
short of

1) allocating memory for an entire array of HE **
2) initialising it correctly
3) "there you go"
4) freeing it ("later")

hence why I eliminated the macro HvARRAY and renamed the entry in the union
that it wrapped. "Break early, break often..." ish.

> How will this affect the hv_fetch_ent() and similar functions, which
> return a pointer to a HE? In particular what happens to such a pointer if
> the HvARRAY() gets reallocated in the meantime. E.g.
>
> HE *he = hv_fetch_ent(hv, ...);
> hv_store(hv,...); /* XXX triggers HvARRAY realloc */
> do_something_with(he->val); /* using freed memory! */

It will go wrong the way that you figured out.

We could work around *this* problem by deferring the memory free onto the
save stack - the entire hash structure is a single block of memory.

But array reallocation only happens on insert, and insert can move things
within the allocated block.


Which got me thinking more generally...

https://grep.metacpan.org/search?q=hv_fetch_ent.%2A%5C%28

thinks that there are 128 distributions using hv_fetch_ent. That's small
enough to audit (but not small enough to do while writing an e-mail) to
see if any do this.

(and 85 using hv_store_ent, most of which use it in void context)


I'm going to guess (but haven't checked this yet) that few to none do this.
Any that do can be patched. (Like with HvARRAY)

At which point we can say all of

1) We have audited *all* of CPAN for expected problems
2) We are confident that the entire public ecosystem is good
3) Roughly 0.0something% of CPAN distributions were affected
4) If you have your own XS code, you should audit for these things [...]


where we have a good feel for how small 0.0something is.


But really this is the only general guarantee that we can make for any
stable release

1) We checked CPAN
2) We figured out the probability of breakage
3) If it's published, we have found it and fixed it already
4) If it's not published, we can't fix that for you


Nicholas Clark
Re: Robin Hood Hashing for the perl core [ In reply to ]
Could I mention one more topic related to hashing, which might turn out to be related to the proposed new implementation. When a reference is used as a hash key, I believe that currently it gets stringified to something like ARRAY(0xabc) and this string is then hashed. (Correct me if I'm wrong.)

For the programmer this is a bit awkward, since you can't get back the original objects by iterating over the keys. You have to use a wrapper like Tie::RefHash, which is slower, and awkward for multidimensional data structures. (There is Tie::RefHash::Nestable for hashes of hashes, but even that doesn't cover a hash of arrays of hashes, and so on.)

But I think it must also be slower. Surely if you have the reference you can treat it as an address (a 64-bit or 32-bit integer) and have a fast way to transform that to a position in the hash data structure. Even if the "address hashing" function ends up being fairly complex, it still has to be faster than converting the address to ASCII text and then hashing the string. Storing the string will also use more memory than storing a pointer plus a tag to say what type of object.

So I am wondering whether, if the hash internals get reworked for greater speed, the treatment of references can also change. Then using references as a hash key would be faster. Moreover, it could be fast to get back the original references rather than a stringified version -- perhaps with a new opcode or new builtin that is like 'keys' but doesn't stringify. Tie::RefHash could change to call that, for compatibility with existing programs, while new programs could use the builtin directly and be even quicker (no tying overhead).

I have asked before whether references could always be preserved in hash keys -- so "Tie::RefHash all of the time". There are some corner cases where this would change the semantics, such as adding to a hash both a string and a reference which stringifies to the same thing, even if you ignore the question of code that explicitly checks ref() on each hash key returned. So for now I'm not advocating this as a global change. However, I think if Perl could provide a fast builtin way to use references as hash keys and get them back unharmed, it would be very useful for new code. Is that something that can be fitted into the new hashing implementation?


This email and any files transmitted with it are CONFIDENTIAL and are intended solely for the use of the individual(s) or entity to whom they are addressed. Any unauthorised copying, disclosure or distribution of the material within this email is strictly forbidden. Any views or opinions presented within this email are solely those of the author and do not necessarily represent those of PGIM Limited, QMA Wadhwani LLP or their affiliates unless otherwise specifically stated. An electronic message is not binding on its sender. Any message referring to a binding agreement must be confirmed in writing and duly signed. If you have received this email in error, please notify the sender immediately and delete the original. Telephone, electronic and other communications and conversations with PGIM Limited, QMA Wadhwani LLP and/or their associated persons may be recorded and retained. PGIM Limited and QMA Wadhwani LLP are authorised and regulated by the Financial Conduct Authority. PGIM Limited (registered in England No. 3809039) has its registered office at Grand Buildings, 1-3 Strand, Trafalgar Square, London WC2N 5HR and QMA Wadhwani LLP (registered in England No. OC303168) has its registered office at 9th Floor, Orion House, 5 Upper St. Martin's Lane, London, England, WC2H 9EA.

Please note that your personal information may be stored and processed in any country where we have facilities or in which we engage service providers. If you provide personal information to us by email or otherwise, you consent to the transfer of that information to countries outside of your country of residence and these countries may have different data protection rules than your country.

To learn about our privacy policies, please use this link<https://www.pgim.com/disclaimer/privacy-center> to read the Privacy Notices.
Re: Robin Hood Hashing for the perl core [ In reply to ]
On Wed, Aug 18, 2021 at 01:09:47PM +0000, Ed Avis wrote:
> Could I mention one more topic related to hashing, which might turn out to be related to the proposed new implementation. When a reference is used as a hash key, I believe that currently it gets stringified to something like ARRAY(0xabc) and this string is then hashed. (Correct me if I'm wrong.)

Yes, but if it's blessed then the class name is part of that:

$ perl -lwe 'my $obj = bless {}; print $obj'
main=HASH(0x9a0fb0)

and if has overloaded stringification, it can do whatever it wants.

Which becomes relevant...

> For the programmer this is a bit awkward, since you can't get back the original objects by iterating over the keys. You have to use a wrapper like Tie::RefHash, which is slower, and awkward for multidimensional data structures. (There is Tie::RefHash::Nestable for hashes of hashes, but even that doesn't cover a hash of arrays of hashes, and so on.)
>
> But I think it must also be slower. Surely if you have the reference you can treat it as an address (a 64-bit or 32-bit integer) and have a fast way to transform that to a position in the hash data structure. Even if the "address hashing" function ends up being fairly complex, it still has to be faster than converting the address to ASCII text and then hashing the string. Storing the string will also use more memory than storing a pointer plus a tag to say what type of object.
>
> So I am wondering whether, if the hash internals get reworked for greater speed, the treatment of references can also change. Then using references as a hash key would be faster. Moreover, it could be fast to get back the original references rather than a stringified version -- perhaps with a new opcode or new builtin that is like 'keys' but doesn't stringify. Tie::RefHash could change to call that, for compatibility with existing programs, while new programs could use the builtin directly and be even quicker (no tying overhead).
>
> I have asked before whether references could always be preserved in hash keys -- so "Tie::RefHash all of the time". There are some corner cases where this would change the semantics, such as adding to a hash both a string and a reference which stringifies to the same thing, even if you ignore the question of code that explicitly checks ref() on each hash key returned. So for now I'm not advocating this as a global change. However, I think if Perl could provide a fast builtin way to use references as hash keys and get them back unharmed, it would be very useful for new code. Is that something that can be fitted into the new hashing implementation?

Even with the current hash implementation, I think that this would be simple
enough for the extremely constrained corner case of

* exclusively references
* "hashed"/equality determined by address


(and either with the current hash or ABH, it needs more C code, and more
if statements or function pointers or something else equivalent...)

because one would

1) just store the pointer to the SV
2) do the numeric hashing using the pointer address, and comparison using
pointer address


and that "hashing"/"comparison" step is *not* using the same approach as
Perl currently uses for (string) comparison between references.


You can't (then) also store regular strings as keys in the same hash.

Unless you then change to using the current stringification of references
as your "key" for hashing. (even if you calculate it as needed, or cache it)
as otherwise (as you observe) effectively you could store duplicate keys
(the true reference, and its stringification)


But then you hit another problem:

$ perl -lwe 'my $obj = bless {}; print $obj; bless $obj, "other"; print $obj'
main=HASH(0x1a56fb0)
other=HASH(0x1a56fb0)


"I" can change the string that a reference hashes to, after I insert it into
the hash.

So is that disallowed? If so how?


And this also can't support references that want to create objects that
act as "value" types. So these two are different:

$ perl -MMath::BigInt -MScalar::Util=refaddr -lwe 'my $x = Math::BigInt->new(42); my $y = Math::BigInt->new(42); print $x == $y ? "==" : "!="; print refaddr($x) == refaddr($y) ? "Same" : "different"'
==
different


but that means that

* either one has to use exactly the same object to look up in the hash
* or one has to then implement some form of overloaded comparison


the first makes it extremely inflexible and counter intuitive
the second is not going to be massively performant


I don't see a winning move here.

Nicholas Clark
Re: Robin Hood Hashing for the perl core [ In reply to ]
Thanks for your reply. I imagined that if you use the pointer address as the hash key, you'd need a separate field storing the type of object or the package name it's blessed into. That's needed to always return the same string as when it was first inserted. Having the separate lookup adds overhead, but I think it would still be a win compared to making, hashing, and storing the string each time.

You make a good point that the stringification of an object can change while its address does not. I would say that using such objects as hash keys is asking for trouble to start with. Of course, it has to be handled if the new implementation is to be 100% compatible. Perhaps one day, the semantics could change a little so that if an object's stringification changes, the stringified hash key also changes but the object retains its position in the hash, since it's hashed by identity. But that would be out of scope for the current work.

The question of "value" types like Math::BigInt is a bigger reason why it's hard to change the current behaviour. I think it's unfortunate; Perl has painted itself into a corner with forcing all hash keys to be strings and losing the original identity, and there's no clear path to change it without a compatibility break, when in my opinion using an object as a hash key and getting back the same object falls under "easy things should be easy".


This email and any files transmitted with it are CONFIDENTIAL and are intended solely for the use of the individual(s) or entity to whom they are addressed. Any unauthorised copying, disclosure or distribution of the material within this email is strictly forbidden. Any views or opinions presented within this email are solely those of the author and do not necessarily represent those of PGIM Limited, QMA Wadhwani LLP or their affiliates unless otherwise specifically stated. An electronic message is not binding on its sender. Any message referring to a binding agreement must be confirmed in writing and duly signed. If you have received this email in error, please notify the sender immediately and delete the original. Telephone, electronic and other communications and conversations with PGIM Limited, QMA Wadhwani LLP and/or their associated persons may be recorded and retained. PGIM Limited and QMA Wadhwani LLP are authorised and regulated by the Financial Conduct Authority. PGIM Limited (registered in England No. 3809039) has its registered office at Grand Buildings, 1-3 Strand, Trafalgar Square, London WC2N 5HR and QMA Wadhwani LLP (registered in England No. OC303168) has its registered office at 9th Floor, Orion House, 5 Upper St. Martin's Lane, London, England, WC2H 9EA.

Please note that your personal information may be stored and processed in any country where we have facilities or in which we engage service providers. If you provide personal information to us by email or otherwise, you consent to the transfer of that information to countries outside of your country of residence and these countries may have different data protection rules than your country.

To learn about our privacy policies, please use this link<https://www.pgim.com/disclaimer/privacy-center> to read the Privacy Notices.
Re: Robin Hood Hashing for the perl core [ In reply to ]
On Wed, Aug 18, 2021 at 01:09:47PM +0000, Ed Avis wrote:
> But I think it must also be slower. Surely if you have the reference you can treat it as an address (a 64-bit or 32-bit integer) and have a fast way to transform that to a position in the hash data structure. Even if the "address hashing" function ends up being fairly complex, it still has to be faster than converting the address to ASCII text and then hashing the string. Storing the string will also use more memory than storing a pointer plus a tag to say what type of object.

In addition to Nicholas's response, when objects are duplicated into a
new thread the duplicated object has a new address which breaks
the hash, and finding the object given the address.

$ perl -Mthreads -le 'my $x = bless {}; print $x; threads->create(sub { print $x })->join; print $x'
main=HASH(0x55ddc8938278)
main=HASH(0x55ddc8a7b1c0)
main=HASH(0x55ddc8938278)

Tony
Re: Robin Hood Hashing for the perl core [ In reply to ]
Tony Cook wrote:

>In addition to Nicholas's response, when objects are duplicated into a
>new thread the duplicated object has a new address which breaks
>the hash, and finding the object given the address.

But that's equally true with the way hashing works currently. The address changes, so the string representation changes, so you can no longer look up with the same key.

% perl -Mthreads -le 'my $x = bless {}; $h{$x} = "hello"; print $h{$x}; threads->create(sub { print $h{$x} // "undef" })->join'
hello
undef

So while it's a reminder to be careful about object identity when you are using threads, it's not a point of difference between stringifying hash keys and hashing by address.


This email and any files transmitted with it are CONFIDENTIAL and are intended solely for the use of the individual(s) or entity to whom they are addressed. Any unauthorised copying, disclosure or distribution of the material within this email is strictly forbidden. Any views or opinions presented within this email are solely those of the author and do not necessarily represent those of PGIM Limited, QMA Wadhwani LLP or their affiliates unless otherwise specifically stated. An electronic message is not binding on its sender. Any message referring to a binding agreement must be confirmed in writing and duly signed. If you have received this email in error, please notify the sender immediately and delete the original. Telephone, electronic and other communications and conversations with PGIM Limited, QMA Wadhwani LLP and/or their associated persons may be recorded and retained. PGIM Limited and QMA Wadhwani LLP are authorised and regulated by the Financial Conduct Authority. PGIM Limited (registered in England No. 3809039) has its registered office at Grand Buildings, 1-3 Strand, Trafalgar Square, London WC2N 5HR and QMA Wadhwani LLP (registered in England No. OC303168) has its registered office at 9th Floor, Orion House, 5 Upper St. Martin's Lane, London, England, WC2H 9EA.

Please note that your personal information may be stored and processed in any country where we have facilities or in which we engage service providers. If you provide personal information to us by email or otherwise, you consent to the transfer of that information to countries outside of your country of residence and these countries may have different data protection rules than your country.

To learn about our privacy policies, please use this link<https://www.pgim.com/disclaimer/privacy-center> to read the Privacy Notices.
Re: Robin Hood Hashing for the perl core [ In reply to ]
On Sun, 22 Aug 2021 at 05:40, Ed Avis <ed.avis@qmaw.com> wrote:

> Tony Cook wrote:
>
> >In addition to Nicholas's response, when objects are duplicated into a
> >new thread the duplicated object has a new address which breaks
> >the hash, and finding the object given the address.
>
> But that's equally true with the way hashing works currently. The address
> changes, so the string representation changes, so you can no longer look up
> with the same key.
>
> % perl -Mthreads -le 'my $x = bless {}; $h{$x} = "hello"; print $h{$x};
> threads->create(sub { print $h{$x} // "undef" })->join'
> hello
> undef
>
> So while it's a reminder to be careful about object identity when you are
> using threads, it's not a point of difference between stringifying hash
> keys and hashing by address.
>

It is a demonstration that we dont have a well defined object identity
function and that having one would be useful. :-)

yves

1 2  View All