Mailing List Archive

Benchmarking a 'no-snails' world (was: Re: PSC #049 2022-01-07)
TL;DR: benchmarks demonstrate no performance gain is possible.


On Mon, 17 Jan 2022 09:55:34 -0500
Felipe Gasper <felipe@felipegasper.com> wrote:

> > 1) leaving @_ untouched when calling a signatured-sub (i.e. it is
> > still the @_ of the caller).
> >
> > This will have a significant performance boost, especially when
> > calling small stub functions like accessors. At the moment perl has
> > to do the equivalent of
...
> The points heretofore raised in response to this seem to be:
>
> 1) There is no viable branch currently that implements leaving @_
> untouched.
>
> 2) The performance gain has yet to be shown.
>
> I’d love to help in either of these regards, but I lack the knowledge
> to assist with #1, and #2 can’t happen without the former.

Taking a slightly-edited version of rjbs's example code from elsewhere
I get the following benchmarks on my machine. Each test was performed
several times and I tried to ignore ones that showed weird timing skew
(probably from background noise of my laptop doing other things at the
time), and have tried to select a "typical" example.

The three test functions are:

sub full { die "arity" unless @_ == 1; my ($x) = @_; return $x * $x }
sub bare { my ($x) = @_; return $x * $x }
sub sigs ($x) { return $x * $x }


First, a 5.34.0 release:

$ perl5.34.0 benchmark-entersub.pl
full: 1.6000s
bare: 1.3186s (speedup x1.21)
sigs: 1.4231s (speedup x1.12)

The signatured version is about 12% faster while performing the same
behaviour. The bare version is 21% faster than full, though lacks the
arity check.

Next up, a perl built from my discourage-defav-in-sigsub branch (this
is significantly slower than the release perl above in absolute terms,
because it's an unoptimised debug build; but ignore that):

$ ./perl -Ilib benchmark-entersub.pl
full: 9.0413s
bare: 7.0131s (speedup x1.29)
sigs: 6.4964s (speedup x1.39)

That's more in line with rjbs's original observations - bare is faster
than full (by about 29%) but signatures easily win out here, coming in
at 39% faster (and also being faster than the bare version).

Next up, an edit of a point partway on my "no-snails" branch. At this
point, I've edited the various pp_arg* functions to look in the AV
found in PAD_SVl(0) instead of GvAV(PL_defav), and I skip the
assignment to &GvAV(PL_defav) in this case. The actual code being
skipped is tiny[1] - as far as I can tell basically a single pointer
assignment; since in order to make pp_arg* work at all we still have to
copy the args to the AV found in PAD_SVl(0). As perhaps expected, this
change makes no observable difference to timing:

$ ./perl -Ilib benchmark-entersub.pl
full: 8.7698s
bare: 6.9522s (speedup x1.26)
sigs: 6.3569s (speedup x1.38)

Finally, by noticing that the example code we're benchmarking doesn't
really depend on the values it returns, I decided to break perl by
doing *even less work* than would actually be required to make the args
give the right answers, just to get an upper bound on the highest
possible speedup that could be achieved. In this broken version, I don't
set up GvAV(PL_defav), nor do I set up the AV in PAD_SVl(0). I don't
copy the arguments anywhere at all. OP_ARGELEM now can't find them and
will just return undef. I even had to stub out the contents of
pp_argcheck so it doesn't even perform an arity check. To be clear: this
version of perl is totally useless, but should be even faster than it
is possible to achieve for real, because any real perl would have to do
more work than this version:

$ ./perl -Ilib benchmark-entersub.pl
full: 8.7818s
bare: 6.9137s (speedup x1.27)
sigs: 6.7048s (speedup x1.31)

I find this result the most difficult to understand as it is very
surprising. I've made pp_entersub slower for everyone (I suspect now
because it has to make an extra conditional jump on CvSIGNATURE(cv))
but what's worse is that calling the signatured subs is only 31% faster
than the speed of the full ones (it used to be 38% faster; see above).
And all this for a broken implementation which doesn't even make the
arguments visible or do any arity checking. Adding those things back
would necessarily involve adding more code to what I currently have,
and thus slow it down further.


In conclusion:

As they stand in current bleadperl, signatured subs are already faster
to call (by a measurable > 30%) than pureperl code that performs the
same work by a snail-unpack - either with or without an additional
manually-coded arity check. This is true even considering that perl is
creating the snail (GvAV(PL_defgv)) and pad-zero (PAD_SVl(0)) AV and
copying the argument values into it. (The same AV is shared by both
places).

An edited version of perl that conditionally does not attempt to set up
the snail or pad-zero array for signatured subs does not perform any
faster than this (and indeed runs slower), even before one attempts to
add in any code that might implement passing the actual argument values
into a signatured sub.

I do not believe that it is possible to gain any performance benefit by
skipping the snail-array setup that is performed by non-signatured subs
in "legacy" perl mode.



In case folks want to attempt to replicate or extend these tests for
themselves, I have attached

benchmark-entersub.pl
- the script used to print these numbers

0001-No-setup-snail-array-or-PADSVlzero.diff
- the full set of changes from current blead, to the (broken) perl
that I used for the final benchmark


-----

Footnotes:

[1]: The code to skip assigning to GvAV(PL_defav):

diff --git a/pp_hot.c b/pp_hot.c
index 477cdd48b8..e596615743 100644
--- a/pp_hot.c
+++ b/pp_hot.c
@@ -5246,7 +5246,10 @@ PP(pp_entersub)

defavp = &GvAV(PL_defgv);
cx->blk_sub.savearray = *defavp;
- *defavp = MUTABLE_AV(SvREFCNT_inc_simple_NN(av));
+ if(!CvSIGNATURE(cv))
+ *defavp = MUTABLE_AV(SvREFCNT_inc_simple_NN(av));
+ else
+ SvREFCNT_inc_simple_void_NN(*defavp);

/* it's the responsibility of whoever leaves a sub to ensure
* that a clean, empty AV is left in pad[0]. This is
normally


--
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk | https://metacpan.org/author/PEVANS
http://www.leonerd.org.uk/ | https://www.tindie.com/stores/leonerd/
Re: Benchmarking a 'no-snails' world (was: Re: PSC #049 2022-01-07) [ In reply to ]
On Mon, Jan 17, 2022, at 1:28 PM, Paul "LeoNerd" Evans wrote:
> TL;DR: benchmarks demonstrate no performance gain is possible.

Thanks, Paul. I am unequipped to evaluate the changes here, but I reckon this is the time for somebody who is (Dave M., Nicholas, etc.) to say, "Oh well" or "good start by you forgot to delete the now-delete-able line that makes things go slow!"

Anybody?

--
rjbs
Re: Benchmarking a 'no-snails' world (was: Re: PSC #049 2022-01-07) [ In reply to ]
On Mon, 17 Jan 2022 18:28:19 +0000
"Paul \"LeoNerd\" Evans" <leonerd@leonerd.org.uk> wrote:

> Next up, a perl built from my discourage-defav-in-sigsub branch (this
> is significantly slower than the release perl above in absolute terms,
> because it's an unoptimised debug build; but ignore that):

ilmari points out that using an unoptimised debug build is unlikely to
give useful results.

I've rebuilt these tests using non-debug -O3, and also running the
actual code for 4 times longer to reduce timing jitter/noise:

> $ ./perl -Ilib benchmark-entersub.pl
> full: 9.0413s
> bare: 7.0131s (speedup x1.29)
> sigs: 6.4964s (speedup x1.39)

$ ./perl -Ilib benchmark-entersub.pl
full: 4.2597s
bare: 3.5885s (speedup x1.19)
sigs: 3.5397s (speedup x1.20)

$ ./perl -Ilib benchmark-entersub.pl
full: 4.3395s
bare: 3.6818s (speedup x1.18)
sigs: 3.5970s (speedup x1.21)

$ ./perl -Ilib benchmark-entersub.pl
full: 4.0384s
bare: 3.3404s (speedup x1.21)
sigs: 3.4798s (speedup x1.16)

So both bare and sigs are "about 20% faster", though not much
difference between them. I seem to be getting more noisy results this
time around though, so maybe harder to interpret. This is why I'm
posting three separate runs.

> Finally, by noticing that the example code we're benchmarking doesn't
> really depend on the values it returns, I decided to break perl by
> doing *even less work* than would actually be required to make the
> args give the right answers, just to get an upper bound on the highest
> possible speedup that could be achieved. In this broken version, I
> don't set up GvAV(PL_defav), nor do I set up the AV in PAD_SVl(0). I
> don't copy the arguments anywhere at all. OP_ARGELEM now can't find
> them and will just return undef. I even had to stub out the contents
> of pp_argcheck so it doesn't even perform an arity check. To be
> clear: this version of perl is totally useless, but should be even
> faster than it is possible to achieve for real, because any real perl
> would have to do more work than this version:
>
> $ ./perl -Ilib benchmark-entersub.pl
> full: 8.7818s
> bare: 6.9137s (speedup x1.27)
> sigs: 6.7048s (speedup x1.31)

Now (on three separate runs):

$ ./perl -Ilib benchmark-entersub.pl
full: 4.0394s
bare: 3.3806s (speedup x1.19)
sigs: 3.3622s (speedup x1.20)

$ ./perl -Ilib benchmark-entersub.pl
full: 3.9630s
bare: 3.2788s (speedup x1.21)
sigs: 3.2075s (speedup x1.24)

$ ./perl -Ilib benchmark-entersub.pl
full: 4.0202s
bare: 3.6163s (speedup x1.11)
sigs: 3.3566s (speedup x1.20)

Again fairly noisy (especially this last one), but still quite
indistinguishable from the first runs; or indeed from those of a plain
unpatched bleadperl:

$ ./perl -Ilib benchmark-entersub.pl
full: 3.3048s
bare: 2.6910s (speedup x1.23)
sigs: 2.5559s (speedup x1.29)

$ ./perl -Ilib benchmark-entersub.pl
full: 3.4209s
bare: 2.7970s (speedup x1.22)
sigs: 2.6608s (speedup x1.29)

$ ./perl -Ilib benchmark-entersub.pl
full: 3.2664s
bare: 2.6584s (speedup x1.23)
sigs: 2.5078s (speedup x1.30)

In fact if anything, I'd say these plain bleadperls are running even
more efficiently than the patched versions.

--
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk | https://metacpan.org/author/PEVANS
http://www.leonerd.org.uk/ | https://www.tindie.com/stores/leonerd/
Re: Benchmarking a 'no-snails' world (was: Re: PSC #049 2022-01-07) [ In reply to ]
On Mon, Jan 17, 2022 at 06:28:19PM +0000, Paul "LeoNerd" Evans wrote:
> since in order to make pp_arg* work at all we still have to
> copy the args to the AV found in PAD_SVl(0).

(Sorry I can't type a lot right now.)

Can't entersub leave the arguments (and mark) on the stack for a
signature sub and pp_argelem access the arguments on the stack
instead?

The final argelem would need to pop the mark and clean up the stack.

This would save the cost of setting up the AV.

Tony
Re: Benchmarking a 'no-snails' world (was: Re: PSC #049 2022-01-07) [ In reply to ]
On Tue, 18 Jan 2022 at 06:52, Tony Cook <tony@develop-help.com> wrote:

> On Mon, Jan 17, 2022 at 06:28:19PM +0000, Paul "LeoNerd" Evans wrote:
> > since in order to make pp_arg* work at all we still have to
> > copy the args to the AV found in PAD_SVl(0).
>
> (Sorry I can't type a lot right now.)
>
> Can't entersub leave the arguments (and mark) on the stack for a
> signature sub and pp_argelem access the arguments on the stack
> instead?
>
> The final argelem would need to pop the mark and clean up the stack.
>
> This would save the cost of setting up the AV.
>

That sounds like a big change in behaviour? It seems like this would lead
to the following situation:

- everything in the signature is now an alias instead of a copy
- signature args are no longer refcounted (stack-not-refcounted has long
been a source of various problems!)

The first one would lead to very surprising behaviour with cases like `sub
ltrim ($x) { $x =~ s{^\s+}{}; $x }`, and the second likewise for lexical
captures such as `(sub ($x) { sub { $x }
})->($thing_that_drops_out_of_scope_shortly_after)`.
Re: Benchmarking a 'no-snails' world (was: Re: PSC #049 2022-01-07) [ In reply to ]
On Tue, Jan 18, 2022 at 07:17:13AM +0800, Tom Molesworth wrote:
> On Tue, 18 Jan 2022 at 06:52, Tony Cook <tony@develop-help.com> wrote:
>
> > On Mon, Jan 17, 2022 at 06:28:19PM +0000, Paul "LeoNerd" Evans wrote:
> > > since in order to make pp_arg* work at all we still have to
> > > copy the args to the AV found in PAD_SVl(0).
> >
> > (Sorry I can't type a lot right now.)
> >
> > Can't entersub leave the arguments (and mark) on the stack for a
> > signature sub and pp_argelem access the arguments on the stack
> > instead?
> >
> > The final argelem would need to pop the mark and clean up the stack.
> >
> > This would save the cost of setting up the AV.
> >
>
> That sounds like a big change in behaviour? It seems like this would lead
> to the following situation:
>
> - everything in the signature is now an alias instead of a copy
> - signature args are no longer refcounted (stack-not-refcounted has long
> been a source of various problems!)

The values would be copied to the lexicals just as they are now.

The only change is the source of the values, right now it's the
unrefcounted @_, with my suggestion above they would come from the
unrefcounted stack.

Tony
Re: Benchmarking a 'no-snails' world (was: Re: PSC #049 2022-01-07) [ In reply to ]
On 2022-01-17 3:17 p.m., Tom Molesworth via perl5-porters wrote:
> That sounds like a big change in behaviour? It seems like this would lead to the
> following situation:
>
> - everything in the signature is now an alias instead of a copy
> - signature args are no longer refcounted (stack-not-refcounted has long been a
> source of various problems!)
>
> The first one would lead to very surprising behaviour with cases like `sub ltrim
> ($x) { $x =~ s{^\s+}{}; $x }`, and the second likewise for lexical captures such
> as `(sub ($x) { sub { $x } })->($thing_that_drops_out_of_scope_shortly_after)`.

Surprising behavior can be avoided by forbidding assignment to routine
parameters. I believe it is a best practice to forbid such assignments anyway
even where it is supported by the language. Assigning to parameters just makes
code harder to understand what is going on, one can't reliably look at the
parameter from anywhere in the code and know it has the argument that was passed
in, maybe it does, but maybe it doesn't. -- Darren Duncan
Re: Benchmarking a 'no-snails' world (was: Re: PSC #049 2022-01-07) [ In reply to ]
On Mon, Jan 17, 2022 at 10:15:33PM +0000, Paul "LeoNerd" Evans wrote:
[lots of benchmarking stuff]

I'm going to have to respectfully disagree with your benchmarking results
and conclusions for populating @_ :-).

First off, I think (if I am reading your diffs correctly), you have
missed cutting out all the @_ tearing down at sub exit that appears
in Perl_cx_popsub_args().

But more generally, note that perl already has a mechanism for calling a
sub without populating @_: the &foo; calling convention. This is already
special-cased in pp_entersub and elsewhere with the hasargs and CxHASARGS
flags. So it should be possible (in theory) to exploit the existing non-@_
entry and exit code paths without adding extra overhead.

Since this pathway already exists, it's possible to benchmark it without
hacking the perl interpreter itself:

use Benchmark ':all';

use feature 'signatures';
no warnings 'experimental';

sub foo0 { }
sub foo2 ($x,$y) { }

cmpthese(-3, {
ampersand => sub { &foo0; },
args0 => sub { foo0(); },
args2 => sub { foo2(1,2); },
});

with that I get:

Rate args2 args0 ampersand
args2 15642116/s -- -64% -69%
args0 43208035/s 176% -- -14%
ampersand 50372768/s 222% 17% --

converting that into microseconds per call, I get

0.1985199622145 ampersand
0.2314384350040 args0
0.6392996957700 args2

that seems to me to show that about 0.033 us is spent per call just setting
up and tearing down @_ itself, even in the absence of any arguments. That
represents about 5% of the total overhead of calling a 2-arg signature
sub.

But it is important to note that signature sub arg processing is NOT yet
optimised. It's always been my long-term plan that the current arrangement
of an OP_ARGELEM (plus nextstate) per arg will be replaced (by the
peephole optimiser) by a single OP_SIGNATURE op which implements a simple
FSM to populate all args. This will be much faster than the current
arrangement. When that comes to pass, the overhead of populating @_ will
then represent considerably more than 5% of the total sub calling overhead.


--
Music lesson: a symbiotic relationship whereby a pupil's embellishments
concerning the amount of practice performed since the last lesson are
rewarded with embellishments from the teacher concerning the pupil's
progress over the corresponding period.