Sorry that this message is about 4 months late for April Fool's Day.
So, this doesn't really fit the "elevator pitch" shape because I don't think
that proposal would really fit the RFC shape either.
I'd like to suggest that we switch the core's hashtable implementation from
the current array of linked-lists (which is not friendly with CPU caches) to
"open addressing" hashing (which is). More specifically, to "Robin Hood
Hashing", and most specifically to the implementation that I wrote for
MoarVM (and Perl 5) - "A Better Hash".
Why...
For the simple explanation, you need to know about "load factor".
The top level of all hashtable implementations is an array of something
(pointers or structure) - an array that gets resized "as needed", but when
that happens is a trade off.
To make things work, you can't use all the entries in this array. Some have
to be empty. The proportion that are full is your load factor. It can't go
above 100%. If you load factor gets too high, you resize the hash.
(Typically you double its size. As there are the same number of keys, your
load factor halves)
Assume a typical load factor of 50%. This means that half your top level
array elements are empty. Imagine that each used element is next to an empty
element - it makes these diagrams easier:
The current hash implementation is an array of pointers to linked lists of
HE structs, which in turn point to the key and value:
| |
+------+ +------+ +------+
| HE * | -> | Key | ---> | HEK |
+------+ +- -+ +------+
| NULL | | Val | -+
+------+ +- -+ | +------+
| | | NULL | +-> | SV |
+------+ +------+
^^^^ points to next entry in the list if needed.
The proposed implementation "open addressing" eliminates the linked lists by
storing HE structs in a flat array:
| |
+------+ +------+
| Key | ---> | HEK |
+- -+ +------+
| Val | -+
+------+ | +------+
| NULL | +-> | SV |
+- -+ +------+
| NULL |
+------+
| |
This
1) reduces the overhead by (on average) 1 pointer per hash entry,
2) eliminates one pointer dereference (and potential CPU cache miss)
3) keeps all the hash close in memory (likely more CPU cache hits)
The downsides are
1) more CPU work in maintaining that flat array in order
2) slightly more memory allocation for hashes under about 50 entries
(but usually that memory is allocated but not needed)
I propose that we use the hash that MoarVM switched to last year.
It's my independent C implementation of the design of Martin Ankerl's hash
at https://github.com/martinus/robin-hood-hashing
I named mine "A Better Hash"
There are a lot of comments explaining how it works at
https://github.com/MoarVM/MoarVM/blob/master/src/core/str_hash_table.h
The MoarVM version has 64 bit hash values, but 32 bit sizes for hashtables
(so at most 4G entries)
Perl 5 currently has 32 bit, 32 bit, and a 31 bit length limit for keys.
I propose that while we're disrupting everything else, we switch to U64
hashes, SSize_t entries and SSize_t keys.
Because this all sounds crazy and impossible and you'd never believe me, I
have a working proof of concept at
https://github.com/nwc10/perl5/tree/smoke-me/nicholas/ABH-POC
It builds DBI, Moose, Moo, JSON::XS, YAML::XS just fine.
As in, all dependencies, all tests pass, vanilla, no changes.
(it's definitely a Proof Of Concept. The code is a mess of 3 styles.
Some dump.c functionality and a few regression tests are disabled.
keys %hash = $value; is not implemented yet. ithreads cloning doesn't
clone the hash iterator. The commit messages could be better.
Apart from that it works. At least on Linux, AIX and HP/UX)
Internally the core is ABH with 64 bit hash values and 64 bit sizes in the C
APIs, but it presents a legacy 32 bit interface to XS, and XS works.
It seems that you only need 32/64 bit compatibility wrappers around 9 core
API functions.
I *did* need to patch Cpanel::JSON::XS and Sereal to remove
-Werror-no-declaration-after-statement from their C compiler flags.
Until "this week" that flag was an excellent idea, to avoid accidentally
writing non-C89 code. However, I want to use C99 in my inline wrappers, and
hence I've just discovered one of our future C99 pain points....
Sereal needed t/020_sort_keys.t temporarily disabled, and these changes:
+++ srl_encoder.c 2021-08-02 21:39:24.461838518 +0000
@@ -1203,6 +1203,23 @@
{
HE *he;
const int do_share_keys = HvSHAREKEYS((SV *)src);
+
+#ifdef HvABH
+ Perl_ABH_Table *table= HvABH(src);
+ Perl_ABH_Iterator iterator= Perl_ABH_first(table);
+ while (!Perl_ABH_at_end(table, iterator)) {
+ HE *he= (HE *) Perl_ABH_current(table, iterator);
+ SV *v= HeVAL(he);
+ if (v != &PL_sv_placeholder) {
+ srl_dump_hk(aTHX_ enc, he, do_share_keys);
+ CALL_SRL_DUMP_SV(enc, v);
+ if (--n == 0) {
+ break;
+ }
+ }
+ iterator= Perl_ABH_next(table, iterator);
+ }
+#else
HE **he_ptr= HvARRAY(src);
HE **he_end= he_ptr + HvMAX(src) + 1;
@@ -1219,6 +1236,7 @@
}
}
} while ( he_ptr < he_end );
+#endif
}
SRL_STATIC_INLINE void
@@ -1508,7 +1526,12 @@
if ( share_keys && SRL_ENC_HAVE_OPTION(enc, SRL_F_SHARED_HASHKEYS) /* only enter branch if shared hk's enabled */
#if PERL_VERSION >= 10
&& (!DO_SHARED_HASH_ENTRY_REFCOUNT_CHECK
- || src->he_valu.hent_refcount > 1)
+#ifdef HvABH
+ || src->hent_hek->hek_refcount > 1
+#else
+ || src->he_valu.hent_refcount > 1
+#endif
+ )
#endif
)
{
I think that we can fake everything in the existing (macro soup) API *except*
1) HvARRAY
2) The type of PL_strtab
HvARRAY can't be replaced or emulated. I think that 14 distros use it:
https://metacpan.org/release/Devel-Arena
https://metacpan.org/release/Devel-MAT-Dumper
https://metacpan.org/release/Devel-SizeMe
https://metacpan.org/release/Glib
https://metacpan.org/release/Hash-Spy
https://metacpan.org/release/Internals-DumpArenas
https://metacpan.org/release/JSPL
https://metacpan.org/release/JavaBin
https://metacpan.org/release/Sereal-Encoder
https://metacpan.org/release/Sub-Disable
https://metacpan.org/release/autovivification
https://metacpan.org/release/mod_perl
https://metacpan.org/release/namespace-clean-xs
https://metacpan.org/release/perlec
grep.metacpan.org can't filter out results from ppport.h (or embed.fnc)
so I might have missed some
(this is a feature of grep.cpan.me that I miss)
I think we can patch them all. Several are just truth tests - I propose
a new macro HvIS_EMPTY(hv) to replace them, which we also add to ppport.h
Others just iterate over all the hash buckets - I propose we add
hv_foreach() to core and ppport.h to handle these cleanly.
PL_strtab is mentioned in
https://metacpan.org/release/Devel-Arena
https://metacpan.org/release/Devel-GC-Helper
https://metacpan.org/release/Devel-MAT-Dumper
that I can find, and as it turns out at least in xsh/threads.h, which a
routine search can't find.
The "XS exposure" to these two is much smaller than you might imagine.
As long as the iterator APIs keep working and returning HE*s and HEK*s,
most code is happy.
Why am I suggesting "A Better Hash" and not $other...
1) C implementation exists
2) It's portable
3) It's suitably licensed
4) It supports hash randomisation
5) You can delete during hash iteration without having "problems"
Before suggesting other hashes, please could you check that they meet all 5
of the above criteria. I'm suggesting here something that is known to work
and can be polished and delivered within a month or two part time. It builds
on approximately every CPU and OS known to humanity (including sparc64 and
ia64, so any endianness, and the alignment is 64 bit clean. *Not* tested on
VMS, zOS, the CPU that zOS runs on, hppa or RiscV. Yet.)
We can in turn replace this with something even better once code for that
exists. But until then, don't let the perfect be the enemy of the good.
Randomisation and Iteration
Randomisation is implemented by each hash having a private random 64 bit
"salt", and getting a new salt for each hashtable size doubling. Instead of
perturbing the iteration order, the insert position is perturbed.
For the current Linked List Hashes, the exploitable weakness was that on
hash "grow" linked lists chains are *split* (not completely re-ordered),
meaning that exactly 1 bit of the *key*'s hash value determines the split,
and hence *that* bit is exposed by the iteration order. Without all the
current mitigations to perturb iteration order it is demonstrably possible
to infer that value of the bit in enough keys to work out the interpreter's
global hash seed.
For ABH, all 64 bits of salt (and all 32 or 64 bits of the key's hash value)
are mixed to get the insert position. On hash resize a completely new salt
is used, and hence totally new order is generated. I believe this means that
anyone attempting to attack needs to figure out 96 (ideally 128) bits of
hidden state simultaneously, not just the 1 bit for the Linked List splits.
Hence I believe that it's secure. I'd appreciate someone competent proving
me wrong on this.
This means that internally hash iterators are just an integer array index.
This avoids all sorts of complexity with dangling pointers, allocation, etc.
Iterators actually start at the highest index and count down to 0. This has
three benefits
0) The "end of hash" iterator value is the same for hashes of any size
1) It's cheap to test a CPU register against a constant.
2) It's the reverse direction from internal movements caused by insertion
and deletion
That last one is key - it means that it's possible to delete the hash entry
at the current iterator position without perturbing the iteration.
Perl documents this as supported behaviour. We need to support it.
Performance
"The marvel is not that the bear dances well, but that the bear dances at
all."
It's actually a bit better than that. It's not noticeably slower on
anything. I'm not sure what makes a good benchmark. Code I've tried isn't
bottlenecked on hashing. I've tried using cachegrind - we're using more CPU
in the hash code, and more data references, but fewer last level misses.
But some of this could be down to tuning - open addressing hashes have do
more work for insert and delete moving entries around to maintain their
invariants. Robin Hood invariants are slightly more work on insert and
delete, but intended to reduce work on lookups. And the approach here of
storing more hash bits in the metadata means more ordering constraints (so
move moving), but fewer lookups outside of the metadata bytes on reading.
As I don't know what to benchmark, I've not tried tuning anything.
(To be fair, I'm exhausted after the big push to get this working at all.
It's why I've been quiet for a few weeks)
The load factor might be too high. There's actually quite a lot that can
be tuned:
* load factor (even make it dynamic as a function of hash size)
* thresholds for resizing
* thresholds for how many hash bits in the metadata
* smaller sizes (official "4", "2" or even "1", and with a load factor of 1.0)
but without meaningful benchmarks (which I don't have) it's not going to
answer any questions.
So for now I need to take a few days break from this, but everyone else is
welcome to play with the code, and figure out how to benchmark it.
(And break the randomisation)
Nicholas Clark
So, this doesn't really fit the "elevator pitch" shape because I don't think
that proposal would really fit the RFC shape either.
I'd like to suggest that we switch the core's hashtable implementation from
the current array of linked-lists (which is not friendly with CPU caches) to
"open addressing" hashing (which is). More specifically, to "Robin Hood
Hashing", and most specifically to the implementation that I wrote for
MoarVM (and Perl 5) - "A Better Hash".
Why...
For the simple explanation, you need to know about "load factor".
The top level of all hashtable implementations is an array of something
(pointers or structure) - an array that gets resized "as needed", but when
that happens is a trade off.
To make things work, you can't use all the entries in this array. Some have
to be empty. The proportion that are full is your load factor. It can't go
above 100%. If you load factor gets too high, you resize the hash.
(Typically you double its size. As there are the same number of keys, your
load factor halves)
Assume a typical load factor of 50%. This means that half your top level
array elements are empty. Imagine that each used element is next to an empty
element - it makes these diagrams easier:
The current hash implementation is an array of pointers to linked lists of
HE structs, which in turn point to the key and value:
| |
+------+ +------+ +------+
| HE * | -> | Key | ---> | HEK |
+------+ +- -+ +------+
| NULL | | Val | -+
+------+ +- -+ | +------+
| | | NULL | +-> | SV |
+------+ +------+
^^^^ points to next entry in the list if needed.
The proposed implementation "open addressing" eliminates the linked lists by
storing HE structs in a flat array:
| |
+------+ +------+
| Key | ---> | HEK |
+- -+ +------+
| Val | -+
+------+ | +------+
| NULL | +-> | SV |
+- -+ +------+
| NULL |
+------+
| |
This
1) reduces the overhead by (on average) 1 pointer per hash entry,
2) eliminates one pointer dereference (and potential CPU cache miss)
3) keeps all the hash close in memory (likely more CPU cache hits)
The downsides are
1) more CPU work in maintaining that flat array in order
2) slightly more memory allocation for hashes under about 50 entries
(but usually that memory is allocated but not needed)
I propose that we use the hash that MoarVM switched to last year.
It's my independent C implementation of the design of Martin Ankerl's hash
at https://github.com/martinus/robin-hood-hashing
I named mine "A Better Hash"
There are a lot of comments explaining how it works at
https://github.com/MoarVM/MoarVM/blob/master/src/core/str_hash_table.h
The MoarVM version has 64 bit hash values, but 32 bit sizes for hashtables
(so at most 4G entries)
Perl 5 currently has 32 bit, 32 bit, and a 31 bit length limit for keys.
I propose that while we're disrupting everything else, we switch to U64
hashes, SSize_t entries and SSize_t keys.
Because this all sounds crazy and impossible and you'd never believe me, I
have a working proof of concept at
https://github.com/nwc10/perl5/tree/smoke-me/nicholas/ABH-POC
It builds DBI, Moose, Moo, JSON::XS, YAML::XS just fine.
As in, all dependencies, all tests pass, vanilla, no changes.
(it's definitely a Proof Of Concept. The code is a mess of 3 styles.
Some dump.c functionality and a few regression tests are disabled.
keys %hash = $value; is not implemented yet. ithreads cloning doesn't
clone the hash iterator. The commit messages could be better.
Apart from that it works. At least on Linux, AIX and HP/UX)
Internally the core is ABH with 64 bit hash values and 64 bit sizes in the C
APIs, but it presents a legacy 32 bit interface to XS, and XS works.
It seems that you only need 32/64 bit compatibility wrappers around 9 core
API functions.
I *did* need to patch Cpanel::JSON::XS and Sereal to remove
-Werror-no-declaration-after-statement from their C compiler flags.
Until "this week" that flag was an excellent idea, to avoid accidentally
writing non-C89 code. However, I want to use C99 in my inline wrappers, and
hence I've just discovered one of our future C99 pain points....
Sereal needed t/020_sort_keys.t temporarily disabled, and these changes:
+++ srl_encoder.c 2021-08-02 21:39:24.461838518 +0000
@@ -1203,6 +1203,23 @@
{
HE *he;
const int do_share_keys = HvSHAREKEYS((SV *)src);
+
+#ifdef HvABH
+ Perl_ABH_Table *table= HvABH(src);
+ Perl_ABH_Iterator iterator= Perl_ABH_first(table);
+ while (!Perl_ABH_at_end(table, iterator)) {
+ HE *he= (HE *) Perl_ABH_current(table, iterator);
+ SV *v= HeVAL(he);
+ if (v != &PL_sv_placeholder) {
+ srl_dump_hk(aTHX_ enc, he, do_share_keys);
+ CALL_SRL_DUMP_SV(enc, v);
+ if (--n == 0) {
+ break;
+ }
+ }
+ iterator= Perl_ABH_next(table, iterator);
+ }
+#else
HE **he_ptr= HvARRAY(src);
HE **he_end= he_ptr + HvMAX(src) + 1;
@@ -1219,6 +1236,7 @@
}
}
} while ( he_ptr < he_end );
+#endif
}
SRL_STATIC_INLINE void
@@ -1508,7 +1526,12 @@
if ( share_keys && SRL_ENC_HAVE_OPTION(enc, SRL_F_SHARED_HASHKEYS) /* only enter branch if shared hk's enabled */
#if PERL_VERSION >= 10
&& (!DO_SHARED_HASH_ENTRY_REFCOUNT_CHECK
- || src->he_valu.hent_refcount > 1)
+#ifdef HvABH
+ || src->hent_hek->hek_refcount > 1
+#else
+ || src->he_valu.hent_refcount > 1
+#endif
+ )
#endif
)
{
I think that we can fake everything in the existing (macro soup) API *except*
1) HvARRAY
2) The type of PL_strtab
HvARRAY can't be replaced or emulated. I think that 14 distros use it:
https://metacpan.org/release/Devel-Arena
https://metacpan.org/release/Devel-MAT-Dumper
https://metacpan.org/release/Devel-SizeMe
https://metacpan.org/release/Glib
https://metacpan.org/release/Hash-Spy
https://metacpan.org/release/Internals-DumpArenas
https://metacpan.org/release/JSPL
https://metacpan.org/release/JavaBin
https://metacpan.org/release/Sereal-Encoder
https://metacpan.org/release/Sub-Disable
https://metacpan.org/release/autovivification
https://metacpan.org/release/mod_perl
https://metacpan.org/release/namespace-clean-xs
https://metacpan.org/release/perlec
grep.metacpan.org can't filter out results from ppport.h (or embed.fnc)
so I might have missed some
(this is a feature of grep.cpan.me that I miss)
I think we can patch them all. Several are just truth tests - I propose
a new macro HvIS_EMPTY(hv) to replace them, which we also add to ppport.h
Others just iterate over all the hash buckets - I propose we add
hv_foreach() to core and ppport.h to handle these cleanly.
PL_strtab is mentioned in
https://metacpan.org/release/Devel-Arena
https://metacpan.org/release/Devel-GC-Helper
https://metacpan.org/release/Devel-MAT-Dumper
that I can find, and as it turns out at least in xsh/threads.h, which a
routine search can't find.
The "XS exposure" to these two is much smaller than you might imagine.
As long as the iterator APIs keep working and returning HE*s and HEK*s,
most code is happy.
Why am I suggesting "A Better Hash" and not $other...
1) C implementation exists
2) It's portable
3) It's suitably licensed
4) It supports hash randomisation
5) You can delete during hash iteration without having "problems"
Before suggesting other hashes, please could you check that they meet all 5
of the above criteria. I'm suggesting here something that is known to work
and can be polished and delivered within a month or two part time. It builds
on approximately every CPU and OS known to humanity (including sparc64 and
ia64, so any endianness, and the alignment is 64 bit clean. *Not* tested on
VMS, zOS, the CPU that zOS runs on, hppa or RiscV. Yet.)
We can in turn replace this with something even better once code for that
exists. But until then, don't let the perfect be the enemy of the good.
Randomisation and Iteration
Randomisation is implemented by each hash having a private random 64 bit
"salt", and getting a new salt for each hashtable size doubling. Instead of
perturbing the iteration order, the insert position is perturbed.
For the current Linked List Hashes, the exploitable weakness was that on
hash "grow" linked lists chains are *split* (not completely re-ordered),
meaning that exactly 1 bit of the *key*'s hash value determines the split,
and hence *that* bit is exposed by the iteration order. Without all the
current mitigations to perturb iteration order it is demonstrably possible
to infer that value of the bit in enough keys to work out the interpreter's
global hash seed.
For ABH, all 64 bits of salt (and all 32 or 64 bits of the key's hash value)
are mixed to get the insert position. On hash resize a completely new salt
is used, and hence totally new order is generated. I believe this means that
anyone attempting to attack needs to figure out 96 (ideally 128) bits of
hidden state simultaneously, not just the 1 bit for the Linked List splits.
Hence I believe that it's secure. I'd appreciate someone competent proving
me wrong on this.
This means that internally hash iterators are just an integer array index.
This avoids all sorts of complexity with dangling pointers, allocation, etc.
Iterators actually start at the highest index and count down to 0. This has
three benefits
0) The "end of hash" iterator value is the same for hashes of any size
1) It's cheap to test a CPU register against a constant.
2) It's the reverse direction from internal movements caused by insertion
and deletion
That last one is key - it means that it's possible to delete the hash entry
at the current iterator position without perturbing the iteration.
Perl documents this as supported behaviour. We need to support it.
Performance
"The marvel is not that the bear dances well, but that the bear dances at
all."
It's actually a bit better than that. It's not noticeably slower on
anything. I'm not sure what makes a good benchmark. Code I've tried isn't
bottlenecked on hashing. I've tried using cachegrind - we're using more CPU
in the hash code, and more data references, but fewer last level misses.
But some of this could be down to tuning - open addressing hashes have do
more work for insert and delete moving entries around to maintain their
invariants. Robin Hood invariants are slightly more work on insert and
delete, but intended to reduce work on lookups. And the approach here of
storing more hash bits in the metadata means more ordering constraints (so
move moving), but fewer lookups outside of the metadata bytes on reading.
As I don't know what to benchmark, I've not tried tuning anything.
(To be fair, I'm exhausted after the big push to get this working at all.
It's why I've been quiet for a few weeks)
The load factor might be too high. There's actually quite a lot that can
be tuned:
* load factor (even make it dynamic as a function of hash size)
* thresholds for resizing
* thresholds for how many hash bits in the metadata
* smaller sizes (official "4", "2" or even "1", and with a load factor of 1.0)
but without meaningful benchmarks (which I don't have) it's not going to
answer any questions.
So for now I need to take a few days break from this, but everyone else is
welcome to play with the code, and figure out how to benchmark it.
(And break the randomisation)
Nicholas Clark