Mailing List Archive

non-shared hash keys and large hashes (was Re: Robin Hood Hashing for the perl core)
On Thu, Sep 23, 2021 at 03:07:14PM +0200, demerphq wrote:
> On Fri, 10 Sept 2021 at 13:48, Nicholas Clark <nick@ccl4.org> wrote:

> > But, this *might* be worth doing for us, if there are lots of hashes with
> > 1 to 4 keys. (version objects seem to be 4 keys, MRO lookup tables are 1
> > key, at worst 2) - "official" size 4 buckets with a load factor of 1.0,
> > meaning actually allocating 7 buckets and storing up to 4 keys.
> >
>
> Doesn't sound like a good idea to me. But you know best. :-)

No, I don't know best. In that, for some things I have a good feel for how
far the code can be bent in a new direction (without it breaking), but I
don't have a great feel for which directions it is actually important to
bend in. Or how to measure success.

> > VC12.0 is capable of *targeting* XP and Windows Server 2003
> > (build host requirement is Windows 7/Windows Server 2012)
> >
>
> We only introduced the flag so we would stop breaking our windows builds.
> IF that isnt a problem anymore then yay.

I hope the explanation in the commit make sense in
https://github.com/Sereal/Sereal/pull/265


Although *that* commit was the first that I wrote, and maybe the text in
our perldelta.pod is clearer:

=item *

The Perl C source code now uses some C99 features, which we have verified are
supported by all compilers we target. This means that Perl's headers now
contain some code that is legal in C99 but not C89.

This may cause problems for some XS modules that unconditionally add
C<-Werror=declaration-after-statement> to their C compiler flags if compiling
with gcc or clang. Earlier versions of Perl support long obsolete compilers
that are strict in rejecting certain C99 features, particularly mixed
declarations and code, and hence it makes sense for XS module authors to audit
that their code does not violate this. However, doing this is now only
possible on these earlier versions of Perl, hence these modules need to be
changed to only add this flag for C<<$] < 5.035005>>.

=cut

> Yes, that is on my todo list to play with. For certain types of hashes IMO
> the PL_strtab is a massive pessimization. We have to do a double store and
> a lot of complex crap and we end up having one hidden hash that scales to
> the size of the set of unique keys in all hashes, which can chew up a lot
> of memory in the background and can play havok with preload loading of
> data.
>
> Eg, consider I want to build 10 hashes with 10 million keys each, im
> actually building 11 hashes, one with 100M keys, and ten with 10M. AND if I
> build of them prefork, and then build a totally different hash with similar
> keys, I start COW'ing the hidden hash.
>
> I have dreamed that we would/could change how hashes work, and only use
> PL_strtab when they are blessed. So when a hash starts of it would be
> unshared. When it gets blessed it gets converted into a shared hash. Eg,
> maybe introduce a "no shared_keys;" pragma, or something like that.
>
> The idea of PL_strtab as far as I know was to dedupe keys used in objects
> where there is a high level of repetition of the keys. When you are build
> large hashes of arbitrary strings, or aggregations in memory with lots of
> breakouts that are unlikely to overlap, or are building hashes prefork, the
> value is really questionable.

Sort of like this?

https://github.com/nwc10/perl5/tree/larger-hashes-without-shared-keys

It turns off shared keys once hashes get (roughly) larger than 42 keys.
(Except for objects and symbol tables)

As I wrote above:

No, I don't know best. In that, for some things I have a good feel for how
far the code can be bent in a new direction (without it breaking), but I
don't have a great feel for which directions it is actually important to
bend in. Or how to measure success.


in that, it all ought to be faster for the use cases described, but I fail
to reliably benchmark it.


Taking my previous (not great) benchmark that is roughly a "seen" hash:

$ cat ~/test/big-hash-r3-my.pl
#!perl -w
use strict;

my $size = shift;
$size ||= 1e5;

my %hash;

if ($size < 0) {
$size = -$size;
keys %hash = $size;
}

# Avoid a lot of scope entries/exits so that we're mostly timing what we care
# about

++$hash{$_ * 3}
for 1 .. $size;

$hash{$_ * 5} || 0
for 1 .. $size;

$hash{$_ * 5} || 0
for 1 .. $size;

$hash{$_ * 5} || 0
for 1 .. $size;

__END__


For blead, one run:

$ valgrind --tool=cachegrind --branch-sim=yes /home/nick/Sandpit/snap-v5.35.4-150-geb058c0793/bin/perl5.35.5 ~/test/big-hash-r3-my.pl 1e7 ==2083422== Cachegrind, a cache and branch-prediction profiler
==2083422== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==2083422== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
==2083422== Command: /home/nick/Sandpit/snap-v5.35.4-150-geb058c0793/bin/perl5.35.5 /home/nick/test/big-hash-r3-my.pl 1e7
==2083422==
--2083422-- warning: L3 cache found, using its data for the LL simulation.
==2083422== brk segment overflow in thread #1: can't grow to 0x4838000
==2083422== (see section Limitations in user manual)
==2083422== NOTE: further instances of this message will not be shown
==2083422==
==2083422== I refs: 37,648,207,135
==2083422== I1 misses: 31,488
==2083422== LLi misses: 6,735
==2083422== I1 miss rate: 0.00%
==2083422== LLi miss rate: 0.00%
==2083422==
==2083422== D refs: 17,888,205,650 (12,036,875,789 rd + 5,851,329,861 wr)
==2083422== D1 misses: 245,352,635 ( 220,149,381 rd + 25,203,254 wr)
==2083422== LLd misses: 232,768,464 ( 209,874,979 rd + 22,893,485 wr)
==2083422== D1 miss rate: 1.4% ( 1.8% + 0.4% )
==2083422== LLd miss rate: 1.3% ( 1.7% + 0.4% )
==2083422==
==2083422== LL refs: 245,384,123 ( 220,180,869 rd + 25,203,254 wr)
==2083422== LL misses: 232,775,199 ( 209,881,714 rd + 22,893,485 wr)
==2083422== LL miss rate: 0.4% ( 0.4% + 0.4% )
==2083422==
==2083422== Branches: 5,764,880,483 ( 5,252,453,290 cond + 512,427,193 ind)
==2083422== Mispredicts: 426,430,089 ( 106,405,461 cond + 320,024,628 ind)
==2083422== Mispred rate: 7.4% ( 2.0% + 62.5% )


for that branch (almost - before rebasing):

$ valgrind --tool=cachegrind --branch-sim=yes /home/nick/Sandpit/snap-v5.35.4-159-g82b0176e75/bin/perl5.35.5 ~/test/big-hash-r3-my.pl 1e7 ==2082396== Cachegrind, a cache and branch-prediction profiler
==2082396== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==2082396== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright info
==2082396== Command: /home/nick/Sandpit/snap-v5.35.4-159-g82b0176e75/bin/perl5.35.5 /home/nick/test/big-hash-r3-my.pl 1e7
==2082396==
--2082396-- warning: L3 cache found, using its data for the LL simulation.
==2082396== brk segment overflow in thread #1: can't grow to 0x4832000
==2082396== (see section Limitations in user manual)
==2082396== NOTE: further instances of this message will not be shown
==2082396==
==2082396== I refs: 36,103,067,458
==2082396== I1 misses: 31,762
==2082396== LLi misses: 6,739
==2082396== I1 miss rate: 0.00%
==2082396== LLi miss rate: 0.00%
==2082396==
==2082396== D refs: 17,049,471,109 (11,534,685,391 rd + 5,514,785,718 wr)
==2082396== D1 misses: 205,496,802 ( 188,346,327 rd + 17,150,475 wr)
==2082396== LLd misses: 194,991,458 ( 179,112,354 rd + 15,879,104 wr)
==2082396== D1 miss rate: 1.2% ( 1.6% + 0.3% )
==2082396== LLd miss rate: 1.1% ( 1.6% + 0.3% )
==2082396==
==2082396== LL refs: 205,528,564 ( 188,378,089 rd + 17,150,475 wr)
==2082396== LL misses: 194,998,197 ( 179,119,093 rd + 15,879,104 wr)
==2082396== LL miss rate: 0.4% ( 0.4% + 0.3% )
==2082396==
==2082396== Branches: 5,594,219,347 ( 5,081,890,292 cond + 512,329,055 ind)
==2082396== Mispredicts: 409,023,997 ( 89,004,563 cond + 320,019,434 ind)
==2082396== Mispred rate: 7.3% ( 1.8% + 62.5% )



so it looks sort-of better, but

1) not fantastically
2) I know that something (hash randomisation) moves those numbers about


So, assuming that you have some test cases that more easily benchmark things
like "what memory doesn't stay shared when you fork?" it would be useful if
you could try benchmarking that branch (vs blead) to see if it helps the
use cases that you (or your $work) are interested in.


I'm really not doing very well at benchmarking whether changes are useful.

I can make changes, but I seem to be shooting in the dark, which isn't
really working out.


Nicholas Clark
Re: non-shared hash keys and large hashes (was Re: Robin Hood Hashing for the perl core) [ In reply to ]
On Tue, 19 Oct 2021 at 14:28, Nicholas Clark <nick@ccl4.org> wrote:

> On Thu, Sep 23, 2021 at 03:07:14PM +0200, demerphq wrote:
> > On Fri, 10 Sept 2021 at 13:48, Nicholas Clark <nick@ccl4.org> wrote:
>
> > > But, this *might* be worth doing for us, if there are lots of hashes
> with
> > > 1 to 4 keys. (version objects seem to be 4 keys, MRO lookup tables are
> 1
> > > key, at worst 2) - "official" size 4 buckets with a load factor of 1.0,
> > > meaning actually allocating 7 buckets and storing up to 4 keys.
> > >
> >
> > Doesn't sound like a good idea to me. But you know best. :-)
>
> No, I don't know best. In that, for some things I have a good feel for how
> far the code can be bent in a new direction (without it breaking), but I
> don't have a great feel for which directions it is actually important to
> bend in. Or how to measure success.
>

I just meant that using an artifact of overallocating the bucket array
sounds fragile to me, but you know the algorithm best.


>
> > > VC12.0 is capable of *targeting* XP and Windows Server 2003
> > > (build host requirement is Windows 7/Windows Server 2012)
> > >
> >
> > We only introduced the flag so we would stop breaking our windows builds.
> > IF that isnt a problem anymore then yay.
>
> I hope the explanation in the commit make sense in
> https://github.com/Sereal/Sereal/pull/265
>
>
> Although *that* commit was the first that I wrote, and maybe the text in
> our perldelta.pod is clearer:
>
> =item *
>
> The Perl C source code now uses some C99 features, which we have verified
> are
> supported by all compilers we target. This means that Perl's headers now
> contain some code that is legal in C99 but not C89.
>
> This may cause problems for some XS modules that unconditionally add
> C<-Werror=declaration-after-statement> to their C compiler flags if
> compiling
> with gcc or clang. Earlier versions of Perl support long obsolete compilers
> that are strict in rejecting certain C99 features, particularly mixed
> declarations and code, and hence it makes sense for XS module authors to
> audit
> that their code does not violate this. However, doing this is now only
> possible on these earlier versions of Perl, hence these modules need to be
> changed to only add this flag for C<<$] < 5.035005>>.
>
> =cut
>

Ack.


>
> > Yes, that is on my todo list to play with. For certain types of hashes
> IMO
> > the PL_strtab is a massive pessimization. We have to do a double store
> and
> > a lot of complex crap and we end up having one hidden hash that scales to
> > the size of the set of unique keys in all hashes, which can chew up a lot
> > of memory in the background and can play havok with preload loading of
> > data.
> >
> > Eg, consider I want to build 10 hashes with 10 million keys each, im
> > actually building 11 hashes, one with 100M keys, and ten with 10M. AND
> if I
> > build of them prefork, and then build a totally different hash with
> similar
> > keys, I start COW'ing the hidden hash.
> >
> > I have dreamed that we would/could change how hashes work, and only use
> > PL_strtab when they are blessed. So when a hash starts of it would be
> > unshared. When it gets blessed it gets converted into a shared hash. Eg,
> > maybe introduce a "no shared_keys;" pragma, or something like that.
> >
> > The idea of PL_strtab as far as I know was to dedupe keys used in
> objects
> > where there is a high level of repetition of the keys. When you are build
> > large hashes of arbitrary strings, or aggregations in memory with lots of
> > breakouts that are unlikely to overlap, or are building hashes prefork,
> the
> > value is really questionable.
>
> Sort of like this?
>
> https://github.com/nwc10/perl5/tree/larger-hashes-without-shared-keys
>
> It turns off shared keys once hashes get (roughly) larger than 42 keys.
> (Except for objects and symbol tables)
>

Oooh. Neat.


>
> As I wrote above:
>
> No, I don't know best. In that, for some things I have a good feel for
> how
> far the code can be bent in a new direction (without it breaking), but I
> don't have a great feel for which directions it is actually important to
> bend in. Or how to measure success.
>
>
> in that, it all ought to be faster for the use cases described, but I fail
> to reliably benchmark it.
>

So I would do something along the following lines:

my %hash;
my $str= "aaaaaaaaaa";
$hash{$str++}++ for 1..(1<<23);
for (1..4) {
my $pid= fork // die "couldn't fork";
if (!$pid) {
my %copy;
$copy{$_}++ for keys %hash;
} else {
push @pids, $pid;
}
}
wait($_) for @pids;

You should see a very different memory effect in the children. With our
current code the entirety strtab of the mother's refcounts will be
COWedinto the child processes, with your patch I would expect them not to
be. So the net system usage should be something like 5units (1+4) instead
of 9 units (1+4*2). Whether you notice it in actual performance I imagine
would depend on how much work you have to do to produce each new key.

You could also put some diganostics in to see how big Pl_strtab gets. I
expect with your patch that it doesnt get that large. Add some diagnostics
so when perl destroys PL_strtab you find out how big its table was (not how
many items it contained, rather what was the maximum size of the table).

Also you need to test the effect of adding and remove the same keys to many
hashes so you can see the effect of having to bump the refcounts in the
PL_strtab versus not. Your benchmark only touched one hash. Try writing to
many hashes with at least some overlapping keys. Something like this:

my @hashes;
$hashes[rand 100]{int rand 1000} for 1..1_000_000;

Id expect that to show up in a benchmark.


>
>
> Taking my previous (not great) benchmark that is roughly a "seen" hash:
>
> $ cat ~/test/big-hash-r3-my.pl
> #!perl -w
> use strict;
>
> my $size = shift;
> $size ||= 1e5;
>
> my %hash;
>
> if ($size < 0) {
> $size = -$size;
> keys %hash = $size;
> }
>
> # Avoid a lot of scope entries/exits so that we're mostly timing what we
> care
> # about
>
> ++$hash{$_ * 3}
> for 1 .. $size;
>
>
None of the following is helpful for this. The overheads are purely on
write. Also you need to try modifying many different hashes with a medium
numbers of keys. Say 100 hashes with 10k each, then look at the size of
PL_strtab.


> $hash{$_ * 5} || 0
> for 1 .. $size;
>
> $hash{$_ * 5} || 0
> for 1 .. $size;
>
> $hash{$_ * 5} || 0
> for 1 .. $size;
>
> __END__
>
>
> For blead, one run:
>
> $ valgrind --tool=cachegrind --branch-sim=yes
> /home/nick/Sandpit/snap-v5.35.4-150-geb058c0793/bin/perl5.35.5 ~/test/
> big-hash-r3-my.pl 1e7 ==2083422== Cachegrind, a cache and
> branch-prediction profiler
> ==2083422== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote
> et al.
> ==2083422== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright
> info
> ==2083422== Command:
> /home/nick/Sandpit/snap-v5.35.4-150-geb058c0793/bin/perl5.35.5
> /home/nick/test/big-hash-r3-my.pl 1e7
> ==2083422==
> --2083422-- warning: L3 cache found, using its data for the LL simulation.
> ==2083422== brk segment overflow in thread #1: can't grow to 0x4838000
> ==2083422== (see section Limitations in user manual)
> ==2083422== NOTE: further instances of this message will not be shown
> ==2083422==
> ==2083422== I refs: 37,648,207,135
> ==2083422== I1 misses: 31,488
> ==2083422== LLi misses: 6,735
> ==2083422== I1 miss rate: 0.00%
> ==2083422== LLi miss rate: 0.00%
> ==2083422==
> ==2083422== D refs: 17,888,205,650 (12,036,875,789 rd +
> 5,851,329,861 wr)
> ==2083422== D1 misses: 245,352,635 ( 220,149,381 rd +
> 25,203,254 wr)
> ==2083422== LLd misses: 232,768,464 ( 209,874,979 rd +
> 22,893,485 wr)
> ==2083422== D1 miss rate: 1.4% ( 1.8% +
> 0.4% )
> ==2083422== LLd miss rate: 1.3% ( 1.7% +
> 0.4% )
> ==2083422==
> ==2083422== LL refs: 245,384,123 ( 220,180,869 rd +
> 25,203,254 wr)
> ==2083422== LL misses: 232,775,199 ( 209,881,714 rd +
> 22,893,485 wr)
> ==2083422== LL miss rate: 0.4% ( 0.4% +
> 0.4% )
> ==2083422==
> ==2083422== Branches: 5,764,880,483 ( 5,252,453,290 cond +
> 512,427,193 ind)
> ==2083422== Mispredicts: 426,430,089 ( 106,405,461 cond +
> 320,024,628 ind)
> ==2083422== Mispred rate: 7.4% ( 2.0% +
> 62.5% )
>
>
> for that branch (almost - before rebasing):
>
> $ valgrind --tool=cachegrind --branch-sim=yes
> /home/nick/Sandpit/snap-v5.35.4-159-g82b0176e75/bin/perl5.35.5 ~/test/
> big-hash-r3-my.pl 1e7 ==2082396== Cachegrind, a cache and
> branch-prediction profiler
> ==2082396== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote
> et al.
> ==2082396== Using Valgrind-3.16.1 and LibVEX; rerun with -h for copyright
> info
> ==2082396== Command:
> /home/nick/Sandpit/snap-v5.35.4-159-g82b0176e75/bin/perl5.35.5
> /home/nick/test/big-hash-r3-my.pl 1e7
> ==2082396==
> --2082396-- warning: L3 cache found, using its data for the LL simulation.
> ==2082396== brk segment overflow in thread #1: can't grow to 0x4832000
> ==2082396== (see section Limitations in user manual)
> ==2082396== NOTE: further instances of this message will not be shown
> ==2082396==
> ==2082396== I refs: 36,103,067,458
> ==2082396== I1 misses: 31,762
> ==2082396== LLi misses: 6,739
> ==2082396== I1 miss rate: 0.00%
> ==2082396== LLi miss rate: 0.00%
> ==2082396==
> ==2082396== D refs: 17,049,471,109 (11,534,685,391 rd +
> 5,514,785,718 wr)
> ==2082396== D1 misses: 205,496,802 ( 188,346,327 rd +
> 17,150,475 wr)
> ==2082396== LLd misses: 194,991,458 ( 179,112,354 rd +
> 15,879,104 wr)
> ==2082396== D1 miss rate: 1.2% ( 1.6% +
> 0.3% )
> ==2082396== LLd miss rate: 1.1% ( 1.6% +
> 0.3% )
> ==2082396==
> ==2082396== LL refs: 205,528,564 ( 188,378,089 rd +
> 17,150,475 wr)
> ==2082396== LL misses: 194,998,197 ( 179,119,093 rd +
> 15,879,104 wr)
> ==2082396== LL miss rate: 0.4% ( 0.4% +
> 0.3% )
> ==2082396==
> ==2082396== Branches: 5,594,219,347 ( 5,081,890,292 cond +
> 512,329,055 ind)
> ==2082396== Mispredicts: 409,023,997 ( 89,004,563 cond +
> 320,019,434 ind)
> ==2082396== Mispred rate: 7.3% ( 1.8% +
> 62.5% )
>
>
>
> so it looks sort-of better, but
>
> 1) not fantastically
> 2) I know that something (hash randomisation) moves those numbers about
>
>
> So, assuming that you have some test cases that more easily benchmark
> things
> like "what memory doesn't stay shared when you fork?" it would be useful if
> you could try benchmarking that branch (vs blead) to see if it helps the
> use cases that you (or your $work) are interested in.
>

Its not so much a benchmarking issue, more a memory management issues,
although i bet it will show up. Ill do what i can to test your branch.


>
>
> I'm really not doing very well at benchmarking whether changes are useful.
>

If the change in effect is neutral is the underlying code change an
improvement in quality or theoretically better? If so, then its probably
good to keep.

cheers,
Yves
Re: non-shared hash keys and large hashes (was Re: Robin Hood Hashing for the perl core) [ In reply to ]
On Tue, Oct 19, 2021 at 03:17:48PM +0200, demerphq wrote:


Things I didn't spot until it didn't work :-)

> So I would do something along the following lines:
>
> my %hash;
> my $str= "aaaaaaaaaa";
> $hash{$str++}++ for 1..(1<<23);
> for (1..4) {
> my $pid= fork // die "couldn't fork";
> if (!$pid) {
> my %copy;
> $copy{$_}++ for keys %hash;

needs an exit here. Or _exit.

> } else {
> push @pids, $pid;
> }
> }
> wait($_) for @pids;

waitpid $_, 0


Anyway, fairly consistently, blead:

37.31user 6.49system 0:18.70elapsed 234%CPU (0avgtext+0avgdata 2177592maxresident)k
0inputs+0outputs (0major+2303622minor)pagefaults 0swaps

36.72user 6.44system 0:18.44elapsed 234%CPU (0avgtext+0avgdata 2177740maxresident)k
0inputs+0outputs (0major+2324684minor)pagefaults 0swaps

36.66user 6.29system 0:18.58elapsed 231%CPU (0avgtext+0avgdata 2177668maxresident)k
0inputs+0outputs (0major+2327457minor)pagefaults 0swaps


the branch:

26.07user 3.73system 0:13.50elapsed 220%CPU (0avgtext+0avgdata 2308668maxresident)k
0inputs+0outputs (0major+1650325minor)pagefaults 0swaps

25.92user 3.99system 0:13.14elapsed 227%CPU (0avgtext+0avgdata 2308888maxresident)k
0inputs+0outputs (0major+1650798minor)pagefaults 0swaps

26.77user 3.79system 0:13.83elapsed 220%CPU (0avgtext+0avgdata 2308744maxresident)k
0inputs+0outputs (0major+1646734minor)pagefaults 0swaps



I struggled to measure memory consumption, particularly shared memory, even
if I inserted a sleep before the exit in the children so that I had some time
to inspect top.

but, it does seem to be faster, which I assume is because it's spending
less time doing the "write" part of COW memory pages after fork.
If this is the cause, why is the "user" time also less?
"system" time (I assume) is the kernel stuff. Or is it all smushed together?

I don't have root access to anything big enough to measure performance
usefully with perf or similar.


Nicholas Clark
Re: [External] Re: non-shared hash keys and large hashes (was Re: Robin Hood Hashing for the perl core) [ In reply to ]
On Tue, Oct 19, 2021 at 9:30 PM Nicholas Clark <nick@ccl4.org> wrote:

> On Tue, Oct 19, 2021 at 03:17:48PM +0200, demerphq wrote:
>
>
> Things I didn't spot until it didn't work :-)
>
> > So I would do something along the following lines:
> >
> > my %hash;
> > my $str= "aaaaaaaaaa";
> > $hash{$str++}++ for 1..(1<<23);
> > for (1..4) {
> > my $pid= fork // die "couldn't fork";
> > if (!$pid) {
> > my %copy;
> > $copy{$_}++ for keys %hash;
>
> needs an exit here. Or _exit.
>
>
So it does. Sorry mate.


> > } else {
> > push @pids, $pid;
> > }
> > }
> > wait($_) for @pids;
>
> waitpid $_, 0
>

Yeah, my bad. I wrote that quickly in between meetings by memory, and tbh I
havent used fork directly in ages. We have a wrapper around it that we use
here at Booking.com that makes it much less easy to blow your foot off. I
should release it, it makes writing multi forking code trivial.


>
>
> Anyway, fairly consistently, blead:
>
> 37.31user 6.49system 0:18.70elapsed 234%CPU (0avgtext+0avgdata
> 2177592maxresident)k
> 0inputs+0outputs (0major+2303622minor)pagefaults 0swaps
>
> 36.72user 6.44system 0:18.44elapsed 234%CPU (0avgtext+0avgdata
> 2177740maxresident)k
> 0inputs+0outputs (0major+2324684minor)pagefaults 0swaps
>
> 36.66user 6.29system 0:18.58elapsed 231%CPU (0avgtext+0avgdata
> 2177668maxresident)k
> 0inputs+0outputs (0major+2327457minor)pagefaults 0swaps
>
>
> the branch:
>
> 26.07user 3.73system 0:13.50elapsed 220%CPU (0avgtext+0avgdata
> 2308668maxresident)k
> 0inputs+0outputs (0major+1650325minor)pagefaults 0swaps
>
> 25.92user 3.99system 0:13.14elapsed 227%CPU (0avgtext+0avgdata
> 2308888maxresident)k
> 0inputs+0outputs (0major+1650798minor)pagefaults 0swaps
>
> 26.77user 3.79system 0:13.83elapsed 220%CPU (0avgtext+0avgdata
> 2308744maxresident)k
> 0inputs+0outputs (0major+1646734minor)pagefaults 0swaps
>

That is a huge win. More than 25% it would seem.


>
>
>
> I struggled to measure memory consumption, particularly shared memory, even
> if I inserted a sleep before the exit in the children so that I had some
> time
> to inspect top.
>

Try installing https://metacpan.org/pod/Linux::Smaps::Tiny and use it to
have each process report its stats before it exists.


>
> but, it does seem to be faster,


Yeah ~25% faster seems pretty conclusive to me. Id say that measurement
alone justifies all your time.

which I assume is because it's spending
> less time doing the "write" part of COW memory pages after fork.
> If this is the cause, why is the "user" time also less?
>

No idea. Maybe the user processes have to block while the system does its
thing so it counts against both? (Does that make sense?)


> "system" time (I assume) is the kernel stuff. Or is it all smushed
> together?
>
> I don't have root access to anything big enough to measure performance
> usefully with perf or similar.
>

I'll see what I can do.

BTW, replying from my work account primarily because the mail never arrived
at my normal account.

cheers,
Yves

--
Yves Orton
Principal Developer & Fellow
[image: Booking.com] <https://www.booking.com/>
Making it easier for everyone
to experience the world.
Re: non-shared hash keys and large hashes (was Re: Robin Hood Hashing for the perl core) [ In reply to ]
Nicholas Clark <nick@ccl4.org> wrote:
:Sort of like this?
:
:https://github.com/nwc10/perl5/tree/larger-hashes-without-shared-keys
:
:It turns off shared keys once hashes get (roughly) larger than 42 keys.
:(Except for objects and symbol tables)

TLDR: yes please, my test case was 6.25% faster, and used 20% less memory.


It seems to make a big difference for me on a recent real (well, pure)
maths case I worked on (https://github.com/hvds/seq/tree/master/A051602);
the bulk of the work happening is in the 'check' function at:
https://github.com/hvds/seq/blob/master/A051602/classify#L72

Methodology:
- fetch blead @ad41dd2f09, install under /opt/hash-blead
- rebase above nwc branch on that blead, install under /opt/hash-nwc
- in both case configure with:
./Configure -des -Dcc=gcc -Dprefix=$PREFIX -Doptimize='-g -O6' \
-Dusedevel -Uversiononly
- create 4.7GB input with `./try -v -q 15 >vq15`, representing approx
50M sets of points that (under symmetries) reduce to 10M distinct sets
- simultaneously run classification under each of those perls to find
those duplicates:

% /usr/bin/time /opt/hash-blead/bin/perl ./classify -u -q <vq15
3594.48user 1.38system 59:55.77elapsed 100%CPU (0avgtext+0avgdata 2378648maxresident)k
7016inputs+0outputs (23major+593602minor)pagefaults 0swaps

% /usr/bin/time /opt/hash-nwc/bin/perl ./classify -u -q <vq15
3369.20user 1.99system 56:11.16elapsed 100%CPU (0avgtext+0avgdata 1905472maxresident)k
9330096inputs+0outputs (25major+475349minor)pagefaults 0swaps

Overall hash-nwc was 1/16 quicker on this work (6.25%), which is an
impressive speedup - it does quite a bit of non-hash work as well.

I observed in top(1) that around the halfway point, hash-blead was
using 10% more memory than hash-nwc (1.1GB v. 1.0GB), but the final
figure was 25% more (2.26GB v. 1.81GB). That seemed odd to me, I had
expected the ratio to stay pretty constant.

The 7016 v. 9330096 "inputs" look screwy - this is "Number of file system
inputs by the process", which should be identical between the two runs.

The 25% more pagefaults under blead sensibly matches the memory figures.

I also notice that this commit:

Use each HEK's own flags to decide "shared or not", instead of the HV's

introduces a warning:

hv.c: In function 'S_hv_free_ent_ret':
hv.c:1767:29: warning: unused parameter 'hv' [-Wunused-parameter]
S_hv_free_ent_ret(pTHX_ HV *hv, HE *entry)

.. by removing the last use of 'hv'.

Hugo
Re: [External] Re: non-shared hash keys and large hashes (was Re: Robin Hood Hashing for the perl core) [ In reply to ]
On Wed, Oct 20, 2021 at 02:59:06PM +0100, hv@crypt.org wrote:

> TLDR: yes please, my test case was 6.25% faster, and used 20% less memory.

> Overall hash-nwc was 1/16 quicker on this work (6.25%), which is an
> impressive speedup - it does quite a bit of non-hash work as well.

And not a fork in sight (if I have it right)

In that this speedup is not the same use case as Yves was thinking would get
a meaningful speedup. This is just one process that uses less RAM and hence
(presumably) fewer cache misses etc.

> I also notice that this commit:
>
> Use each HEK's own flags to decide "shared or not", instead of the HV's
>
> introduces a warning:
>
> hv.c: In function 'S_hv_free_ent_ret':
> hv.c:1767:29: warning: unused parameter 'hv' [-Wunused-parameter]
> S_hv_free_ent_ret(pTHX_ HV *hv, HE *entry)
>
> .. by removing the last use of 'hv'.

Thanks. I didn't notice this because I was doing debugging builds, and they
cause macros such as PERL_ARGS_ASSERT_HV_FREE_ENT_RET to be non-empty, and
here that was asserting that hv wasn't NULL, so it "was" used.

I fixed that commit in the rebase so as not to warn. And then kept pulling
at the loose ends, which became this:

commit f892bc2ea53230c3397936db20b9e658950f924e
Author: Nicholas Clark <nick@ccl4.org>
Date: Thu Oct 21 18:53:01 2021 +0000

Drop the unused hv argument from S_hv_free_ent_ret()

In turn, this means that the hv argument to Perl_hv_free_ent() and
Perl_hv_delayfree_ent() is now clearly unused, to mark it as such. Both
functions are deemed to be API, so unlike the static function
S_hv_free_ent_ret we can't simply change their parameters.

However, change all the internal callers to pass NULL instead of the hv, as
this makes it obvious that the function does not read hv, and might cause
the compiler to generate better code.


On Wed, Oct 20, 2021 at 09:46:18AM +0200, Yves Orton via perl5-porters wrote:
> On Tue, Oct 19, 2021 at 9:30 PM Nicholas Clark <nick@ccl4.org> wrote:

> Yeah, my bad. I wrote that quickly in between meetings by memory, and tbh I
> havent used fork directly in ages. We have a wrapper around it that we use
> here at Booking.com that makes it much less easy to blow your foot off. I
> should release it, it makes writing multi forking code trivial.

Yes, seems that would useful.

Also, fixing code you wrote from memory was much faster than me starting
from scratch.

Both your example and Hugo's example seem to demonstrate that I'm not very
good at translating other people's rough descriptions of "my code does this
hot thing, so write a benchmark that times that" into correct code that
really does the hot thing such that the benchmark shows changes.

> That is a huge win. More than 25% it would seem.

I thought it looked that good, but was rather surprised that it was that
obvious. I'm not trusting my measurements here. :-)


"No plan survives contact with the enemy", but the plan is that we're going
away for a bit, so it is possible that I'm "Away From Keyboard" for the next
week (or only intermittently near it). So I've not tried doing the other
things that you've suggested.

However based on running your benchmark and Hugo's different benchmark, I
figured that we want this change, and maybe it's now more about figuring
out whether the heuristic is correct. So I made a PR:

https://github.com/Perl/perl5/pull/19208

> I'll see what I can do.

I'm curious what some of your work-scale code (and data) makes of it. But I
realise that might not be easy to test.

> BTW, replying from my work account primarily because the mail never arrived
> at my normal account.

gmail hates me. Sometimes it thinks that I'm a spammer. Sometimes not.
Sometimes *for the same message* sent to the list - person A gets it,
person B does not. (This last one is particularly confusing)

I'm not sure what more I can do to tell the faceless gmail that I am
genuine long pig, and not some canned pink "meat" thing.

Nicholas Clark
Re: [External] Re: non-shared hash keys and large hashes (was Re: Robin Hood Hashing for the perl core) [ In reply to ]
On Fri, Oct 22, 2021 at 11:54 AM Nicholas Clark <nick@ccl4.org> wrote:

>
> On Wed, Oct 20, 2021 at 02:59:06PM +0100, hv@crypt.org wrote:
>
> > TLDR: yes please, my test case was 6.25% faster, and used 20% less
> memory.
>
> > Overall hash-nwc was 1/16 quicker on this work (6.25%), which is an
> > impressive speedup - it does quite a bit of non-hash work as well.
>
> And not a fork in sight (if I have it right)
>
> In that this speedup is not the same use case as Yves was thinking would
> get
> a meaningful speedup. This is just one process that uses less RAM and hence
> (presumably) fewer cache misses etc.
>

Yeah, removing the extra store/refcount juggling should be noticeable. I
have a problem similar to Hv's that I am waiting on being able to test. Ill
report back when I can.


>
> > I also notice that this commit:
> >
> > Use each HEK's own flags to decide "shared or not", instead of the HV's
> >
> > introduces a warning:
> >
> > hv.c: In function 'S_hv_free_ent_ret':
> > hv.c:1767:29: warning: unused parameter 'hv' [-Wunused-parameter]
> > S_hv_free_ent_ret(pTHX_ HV *hv, HE *entry)
> >
> > .. by removing the last use of 'hv'.
>
> Thanks. I didn't notice this because I was doing debugging builds, and they
> cause macros such as PERL_ARGS_ASSERT_HV_FREE_ENT_RET to be non-empty, and
> here that was asserting that hv wasn't NULL, so it "was" used.
>
> I fixed that commit in the rebase so as not to warn. And then kept pulling
> at the loose ends, which became this:
>
> commit f892bc2ea53230c3397936db20b9e658950f924e
> Author: Nicholas Clark <nick@ccl4.org>
> Date: Thu Oct 21 18:53:01 2021 +0000
>
> Drop the unused hv argument from S_hv_free_ent_ret()
>
> In turn, this means that the hv argument to Perl_hv_free_ent() and
> Perl_hv_delayfree_ent() is now clearly unused, to mark it as such. Both
> functions are deemed to be API, so unlike the static function
> S_hv_free_ent_ret we can't simply change their parameters.
>
> However, change all the internal callers to pass NULL instead of the
> hv, as
> this makes it obvious that the function does not read hv, and might
> cause
> the compiler to generate better code.
>
>
> On Wed, Oct 20, 2021 at 09:46:18AM +0200, Yves Orton via perl5-porters
> wrote:
> > On Tue, Oct 19, 2021 at 9:30 PM Nicholas Clark <nick@ccl4.org> wrote:
>
> > Yeah, my bad. I wrote that quickly in between meetings by memory, and
> tbh I
> > havent used fork directly in ages. We have a wrapper around it that we
> use
> > here at Booking.com that makes it much less easy to blow your foot off. I
> > should release it, it makes writing multi forking code trivial.
>
> Yes, seems that would useful.
>
>
Ack. Willdo.


> Also, fixing code you wrote from memory was much faster than me starting
> from scratch.
>
> Both your example and Hugo's example seem to demonstrate that I'm not very
> good at translating other people's rough descriptions of "my code does this
> hot thing, so write a benchmark that times that" into correct code that
> really does the hot thing such that the benchmark shows changes.
>

Too close to the forest probably. Not to worry. You did the important bit.
We can work up the benchmarks.


>
> > That is a huge win. More than 25% it would seem.
>
> I thought it looked that good, but was rather surprised that it was that
> obvious. I'm not trusting my measurements here. :-)
>

It isnt inconceivable to me. We saw some really interesting effects using
perl to aggregate really large cardinality data sets, your data IMO
corroborates my suspicions it was due to the PL_strtab. I am kinda kicking
myself I didnt release a newHV_shared() in one of the Hash::Util modules to
test this previously. Too many things to do. Note the latter would
actually allow us to benchmark the unshared aspect without your core
changes. Its a trivial patch too.

I still think it would be really helpful to have a pragma to turn off
shared hash creation in scope. Something like:

{
no shared 'hashes';
my %hash;
for my (@$tuples) {
my ($x,$y,$z)=@$tuple;
$hash{$x}{$y}{$z}++;
}
}

IN such a case having a "switch to unshared when keysize exceeds X" would
save a lot of conversions from shared to unshared.

Alternatively, we could switch the sharing on at bless time, and only
blessed hashes would be shared.


> "No plan survives contact with the enemy", but the plan is that we're going
> away for a bit, so it is possible that I'm "Away From Keyboard" for the
> next
> week (or only intermittently near it). So I've not tried doing the other
> things that you've suggested.
>
> However based on running your benchmark and Hugo's different benchmark, I
> figured that we want this change, and maybe it's now more about figuring
> out whether the heuristic is correct. So I made a PR:
>
>
> https://urldefense.com/v3/__https://github.com/Perl/perl5/pull/19208__;!!FzMMvhmfRQ!5QqgtfT3HPsSjBD0Y5t1ryfGTCxENYghFBI-P-stJFuI_G1gTowZfb2A814WvF4EkA$
>
> > I'll see what I can do.
>
> I'm curious what some of your work-scale code (and data) makes of it. But I
> realise that might not be easy to test.
>
>
Its more about getting code built. I have asked our resident build expert
to help but he was/is off.


> > BTW, replying from my work account primarily because the mail never
> arrived
> > at my normal account.
>
> gmail hates me. Sometimes it thinks that I'm a spammer. Sometimes not.
> Sometimes *for the same message* sent to the list - person A gets it,
> person B does not. (This last one is particularly confusing)
>
> I'm not sure what more I can do to tell the faceless gmail that I am
> genuine long pig, and not some canned pink "meat" thing.
>
>
Nod, well i didnt see you in the spam folder either, but ill check again in
the future.

Yves


--
Yves Orton
Principal Developer & Fellow
[image: Booking.com] <https://www.booking.com/>
Making it easier for everyone
to experience the world.