Mailing List Archive

Optimised lexical assignments / faster macroöps
A question in two parts:
1. A specific question,
2. ... which leads on to a broader generalisation

-----

I've been looking lately into the details of how the optree is formed
around regular assignments into lexical variables. Assignments into
scalar lexicals aren't too bad, they look like:

my $lex = (some RHS scalar value);

OP_SASSIGN=BINOP:
(whatever ops for the RHS value)
OP_PADSV [targ = $lex + OPf_REF|OPf_MOD flags]

assignments into arrays or hashes look a little bit more complex:

my @lex = (some RHS list value);

OP_AASSIGN=BINOP:
OP_LIST
OP_PUSHMARK
(whatever ops for the RHS value)
OP_LIST
OP_PUSHMARK
OP_PADAV [targ = $lex + OPf_REF|OPf_MOD flags]

with similar for hashes except using OP_PADHV

I can't help thinking that this puts quite a bit of churn on the value
stack, and also the markstack in the list assignment case. I wonder
whether in these relatively-common cases, it might make sense to
peephole-optimise a set of three specialised ops that use op_targ to
store the pad index of a variable to be assigned into; thus turning
these cases into:

OP_PADSV_STORE=UNOP [targ = $lex]
(whatever ops for the RHS value)

OP_PADAV_STORE=LISTOP [targ = @lex]
OP_PUSHMARK
(whatever ops for RHS value)

(plus OP_PADHV_STORE which would look similar)

To this end I might have a go at making a little CPAN module for doing
this.

It would also be useful to measure whether it actually makes any
performance benefit. If so it might become a useful core addition.

-----

Except, now this leads me onto the larger question. There's nothing
*particularly* specific to lexical assignments about this. I'm sure
similar optimisations could be argued about for many other situations
in core perl.

Right now, we have a few bits of core already that do things like this;
e.g. OP_MULTICONCAT or OP_AELEMFAST_LEX, which are just high-speed
optimisations of common optree shapes. They take the observation that
running a few, larger ops ends up being faster overall than lots of
small ones.

It's a nice pattern - I've already written a module for doing similar
things to MULTICONCAT with maths operations:

https://metacpan.org/pod/Faster::Maths

This is also loosely inspired by Zefram's

https://metacpan.org/pod/Devel::GoFaster

I wonder if there is scope for creating a general way to do this sort
of thing, and a way to measure the performance boost you get from doing
that in any reasonable workflow, to judge whether it's worth doing that.

--
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk | https://metacpan.org/author/PEVANS
http://www.leonerd.org.uk/ | https://www.tindie.com/stores/leonerd/
Re: Optimised lexical assignments / faster macroöps [ In reply to ]
2021-10-6? 20:18 Paul "LeoNerd" Evans <leonerd@leonerd.org.uk> wrote:

> A question in two parts:
> 1. A specific question,
> 2. ... which leads on to a broader generalisation
>
> -----
>
> I've been looking lately into the details of how the optree is formed
> around regular assignments into lexical variables. Assignments into
> scalar lexicals aren't too bad, they look like:
>
> my $lex = (some RHS scalar value);
>
> OP_SASSIGN=BINOP:
> (whatever ops for the RHS value)
> OP_PADSV [targ = $lex + OPf_REF|OPf_MOD flags]
>
> assignments into arrays or hashes look a little bit more complex:
>
> my @lex = (some RHS list value);
>
> OP_AASSIGN=BINOP:
> OP_LIST
> OP_PUSHMARK
> (whatever ops for the RHS value)
> OP_LIST
> OP_PUSHMARK
> OP_PADAV [targ = $lex + OPf_REF|OPf_MOD flags]
>
> with similar for hashes except using OP_PADHV
>
> I can't help thinking that this puts quite a bit of churn on the value
> stack, and also the markstack in the list assignment case. I wonder
> whether in these relatively-common cases, it might make sense to
> peephole-optimise a set of three specialised ops that use op_targ to
> store the pad index of a variable to be assigned into; thus turning
> these cases into:
>
> OP_PADSV_STORE=UNOP [targ = $lex]
> (whatever ops for the RHS value)
>
> OP_PADAV_STORE=LISTOP [targ = @lex]
> OP_PUSHMARK
> (whatever ops for RHS value)
>
> (plus OP_PADHV_STORE which would look similar)
>
> To this end I might have a go at making a little CPAN module for doing
> this.
>
> It would also be useful to measure whether it actually makes any
> performance benefit. If so it might become a useful core addition.
>
> -----
>
> Except, now this leads me onto the larger question. There's nothing
> *particularly* specific to lexical assignments about this. I'm sure
> similar optimisations could be argued about for many other situations
> in core perl.
>
> Right now, we have a few bits of core already that do things like this;
> e.g. OP_MULTICONCAT or OP_AELEMFAST_LEX, which are just high-speed
> optimisations of common optree shapes. They take the observation that
> running a few, larger ops ends up being faster overall than lots of
> small ones.
>
> It's a nice pattern - I've already written a module for doing similar
> things to MULTICONCAT with maths operations:
>
> https://metacpan.org/pod/Faster::Maths
>
> This is also loosely inspired by Zefram's
>
> https://metacpan.org/pod/Devel::GoFaster
>
> I wonder if there is scope for creating a general way to do this sort
> of thing, and a way to measure the performance boost you get from doing
> that in any reasonable workflow, to judge whether it's worth doing that.
>
> --
> Paul "LeoNerd" Evans
>
> leonerd@leonerd.org.uk | https://metacpan.org/author/PEVANS
> http://www.leonerd.org.uk/ | https://www.tindie.com/stores/leonerd/


Sorry, this topic is too difficult for me.

It is good to talk to the core team members who know the details of the op
tree.
Re: Optimised lexical assignments / faster macroöps [ In reply to ]
On Wed, 6 Oct 2021 12:18:09 +0100
"Paul \"LeoNerd\" Evans" <leonerd@leonerd.org.uk> wrote:

> Except, now this leads me onto the larger question. There's nothing
> *particularly* specific to lexical assignments about this. I'm sure
> similar optimisations could be argued about for many other situations
> in core perl.

A few other situations that come to mind:

* Giving OP_CONST the OA_TARGMY flag; so situations like

$idx = 1;

Could optimise away the OP_PADSV and OP_SASSIGN into a single
void-context OP_CONST with a TARG set in it.

* Permitting (somehow? - no idea how) OA_TARGMY to also apply in
LVINTRO situations, meaning that every case where OA_TARGMY might
apply would also apply in situations like

my $idx = 1;

* Extending the "op_targ" concept into the incoming argument. I
notice that a few ops currently abuse that field, which is supposed
to be about the "target" (i.e. output destination) of an op, to
really encode where its source argument comes from. A few ops have
the OPf_STACKED flag optionally set, and if absent it means they
take their incoming argument from the pad in op_targ.

This would add a new field to every UNOP to store the pad offset
of its incoming argument, and then optimise every UNOP whose
op_first is a PADSV into having no KID and putting the pad offset
in that field instead. I suspect this would apply in many more
cases in real code, than the current OA_TARGMY optimisation does
now, and would give benefits to optree size and execution time, at
the small cost of memory for every UNOP. This one would need some
more in-depth benchmarking to make a case justifying it, but if we
had some good benchmarking test cases it could show a good win.

> I wonder if there is scope for creating a general way to do this sort
> of thing, and a way to measure the performance boost you get from
> doing that in any reasonable workflow, to judge whether it's worth
> doing that.

So yes I want to reïterate this part of the question here. I can easily
write code that "eh, might be better" but I'd want to have real-life
test cases to measure with benchmarks to demonstrate "this change makes
real programs X% faster", for some X, so we can evaluate the magnitude
of the improvement, verses the additional code complexity costs.

I would really like people to supply some interesting test cases. Right
now for my Faster::Maths module, I have a single, small script that
attempts to generate Julia fractals, to demonstrate that the module
makes it run about 30% faster:

https://metacpan.org/release/PEVANS/Faster-Maths-0.02/source/t/95benchmark.t

In summary:

If you want perl to run faster, send me examples of your slow
programs so I can optimise them :)

--
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk | https://metacpan.org/author/PEVANS
http://www.leonerd.org.uk/ | https://www.tindie.com/stores/leonerd/
RE: Optimised lexical assignments / faster macroöps [ In reply to ]
macroöps:
Notice, that only few human languages consider ö as "o" which isn't a part of diphthong.
German language (and influenced ones, like Estonian) consider this as a vowel similar to o but pronounced differently.

How will you read the word öö, which is a normal Estinian word? Non-exotic one, it translates as night.

Russian language have ? which is read differently compared to Mo?t
Ukrainian language has ? which appears to be in the name of this country,"???????".

Also search engines could misinterpret your ö.

I think no one is going to read macroops as "macru:ps"

Mine 2 roubles on that.


Internal Use - Confidential

-----Original Message-----
From: Paul "LeoNerd" Evans <leonerd@leonerd.org.uk>
Sent: Thursday, December 2, 2021 6:56 PM
To: perl5-porters@perl.org
Subject: Re: Optimised lexical assignments / faster macroöps


[EXTERNAL EMAIL]

On Wed, 6 Oct 2021 12:18:09 +0100
"Paul \"LeoNerd\" Evans" <leonerd@leonerd.org.uk> wrote:

> Except, now this leads me onto the larger question. There's nothing
> *particularly* specific to lexical assignments about this. I'm sure
> similar optimisations could be argued about for many other situations
> in core perl.

A few other situations that come to mind:

* Giving OP_CONST the OA_TARGMY flag; so situations like

$idx = 1;

Could optimise away the OP_PADSV and OP_SASSIGN into a single
void-context OP_CONST with a TARG set in it.

* Permitting (somehow? - no idea how) OA_TARGMY to also apply in
LVINTRO situations, meaning that every case where OA_TARGMY might
apply would also apply in situations like

my $idx = 1;

* Extending the "op_targ" concept into the incoming argument. I
notice that a few ops currently abuse that field, which is supposed
to be about the "target" (i.e. output destination) of an op, to
really encode where its source argument comes from. A few ops have
the OPf_STACKED flag optionally set, and if absent it means they
take their incoming argument from the pad in op_targ.

This would add a new field to every UNOP to store the pad offset
of its incoming argument, and then optimise every UNOP whose
op_first is a PADSV into having no KID and putting the pad offset
in that field instead. I suspect this would apply in many more
cases in real code, than the current OA_TARGMY optimisation does
now, and would give benefits to optree size and execution time, at
the small cost of memory for every UNOP. This one would need some
more in-depth benchmarking to make a case justifying it, but if we
had some good benchmarking test cases it could show a good win.

> I wonder if there is scope for creating a general way to do this sort
> of thing, and a way to measure the performance boost you get from
> doing that in any reasonable workflow, to judge whether it's worth
> doing that.

So yes I want to reïterate this part of the question here. I can easily write code that "eh, might be better" but I'd want to have real-life test cases to measure with benchmarks to demonstrate "this change makes real programs X% faster", for some X, so we can evaluate the magnitude of the improvement, verses the additional code complexity costs.

I would really like people to supply some interesting test cases. Right now for my Faster::Maths module, I have a single, small script that attempts to generate Julia fractals, to demonstrate that the module makes it run about 30% faster:

https://urldefense.com/v3/__https://metacpan.org/release/PEVANS/Faster-Maths-0.02/source/t/95benchmark.t__;!!LpKI!3AlX7-pZLESDone77VwMHaWzhuowvx63J8ncB4F0ErVS8aFybo1qJ6BOS4cGxPLSDoOH$ [metacpan[.]org]

In summary:

If you want perl to run faster, send me examples of your slow
programs so I can optimise them :)

--
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk | https://urldefense.com/v3/__https://metacpan.org/author/PEVANS__;!!LpKI!3AlX7-pZLESDone77VwMHaWzhuowvx63J8ncB4F0ErVS8aFybo1qJ6BOS4cGxFw-VbD6$ [metacpan[.]org]
https://urldefense.com/v3/__http://www.leonerd.org.uk/__;!!LpKI!3AlX7-pZLESDone77VwMHaWzhuowvx63J8ncB4F0ErVS8aFybo1qJ6BOS4cGxPsfAP6h$ [leonerd[.]org[.]uk] | https://urldefense.com/v3/__https://www.tindie.com/stores/leonerd/__;!!LpKI!3AlX7-pZLESDone77VwMHaWzhuowvx63J8ncB4F0ErVS8aFybo1qJ6BOS4cGxJ_vXXIs$ [tindie[.]com]
Re: Optimised lexical assignments / faster macro?ps [ In reply to ]
On Thu, Dec 02, 2021 at 03:55:56PM +0000, Paul "LeoNerd" Evans wrote:
> I would really like people to supply some interesting test cases. Right
> now for my Faster::Maths module, I have a single, small script that
> attempts to generate Julia fractals, to demonstrate that the module
> makes it run about 30% faster:
>
> https://metacpan.org/release/PEVANS/Faster-Maths-0.02/source/t/95benchmark.t

Simply adding float only versions of the arithmetic OPs gives you most
of that performance improvement, see:

https://github.com/Perl/perl5/pull/17831

Tony
Re: Optimised lexical assignments / faster macroöps [ In reply to ]
On Thu, 2 Dec 2021 15:55:56 +0000
"Paul \"LeoNerd\" Evans" <leonerd@leonerd.org.uk> wrote:

> A few other situations that come to mind:
>
> * Giving OP_CONST the OA_TARGMY flag; so situations like
>
> $idx = 1;
>
> Could optimise away the OP_PADSV and OP_SASSIGN into a single
> void-context OP_CONST with a TARG set in it.

Well... This did not go well.

Having made this the subject of my latest TPF grant proposal[1], I
didn't get very far into implementing it when I hit upon a huge snag.
Namely, that OP_CONST is an SVOP and so on threaded builds of perl, the
op_targ field is (ab)used to store not the target of the op's result,
but the source SV holding the actual constant.

Hrm.

> * Permitting (somehow? - no idea how) OA_TARGMY to also apply in
> LVINTRO situations, meaning that every case where OA_TARGMY might
> apply would also apply in situations like
>
> my $idx = 1;

That clearly makes this a non-starter as well.

> * Extending the "op_targ" concept into the incoming argument. I
> notice that a few ops currently abuse that field, which is
> supposed to be about the "target" (i.e. output destination) of an op,
> to really encode where its source argument comes from. A few ops have
> the OPf_STACKED flag optionally set, and if absent it means they
> take their incoming argument from the pad in op_targ.
>
> This would add a new field to every UNOP to store the pad offset
> of its incoming argument, and then optimise every UNOP whose
> op_first is a PADSV into having no KID and putting the pad offset
> in that field instead. I suspect this would apply in many more
> cases in real code, than the current OA_TARGMY optimisation does
> now, and would give benefits to optree size and execution time, at
> the small cost of memory for every UNOP. This one would need some
> more in-depth benchmarking to make a case justifying it, but if we
> had some good benchmarking test cases it could show a good win.

I therefore wonder whether this part might need to be done first,
performing a somewhat more radical overhaul of the op system by tidying
up the meaning of op_targ to really mean the "target" (i.e. output
destination) of every op, and never to mean the source even in UNOP or
SVOP cases. Having done *that*, then an assigning form of OP_CONST
could then use both fields in this case.

In addition I suspect I might also have a go at the
OP_PAD{S,A,H}V_STORE ops I previously mentioned in the first mail of
this thread, as another way to optimise these assignments in a more
general way.


> In summary:
>
> If you want perl to run faster, send me examples of your slow
> programs so I can optimise them :)

Still verymuch looking for this. So far I've received.. er.. nothing.
Can I conclude from that that everybody already thinks perl is already
quite fast enough, and not a single person has a single program they'd
ever want to run faster?

For me I know that's no true, so unless I receive any other examples
from anyone I'm going to use my `sdview` program[2] while parsing
complicated POD or nroff markup and rendering manpages from them. This
does perhaps feel like an odd thing to optimise perl for, but I don't
currently have any other examples ;)

--
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk | https://metacpan.org/author/PEVANS
http://www.leonerd.org.uk/ | https://www.tindie.com/stores/leonerd/
Re: Optimised lexical assignments / faster macroöps [ In reply to ]
On Mon, 6 Jun 2022 11:44:13 +0100
"Paul \"LeoNerd\" Evans" <leonerd@leonerd.org.uk> wrote:

> Having made this the subject of my latest TPF grant proposal[1],

> For me I know that's no true, so unless I receive any other examples
> from anyone I'm going to use my `sdview` program[2]

Ugh, I forgot to apply the footnotes:

[1]: https://news.perlfoundation.org/post/grant_proposal_optree_optimisation_paul_evans

[2]: https://metacpan.org/dist/App-sdview/view/bin/sdview

--
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk | https://metacpan.org/author/PEVANS
http://www.leonerd.org.uk/ | https://www.tindie.com/stores/leonerd/
Re: Optimised lexical assignments / faster macro?ps [ In reply to ]
On Mon, Jun 06, 2022 at 11:44:13AM +0100, Paul "LeoNerd" Evans wrote:
> On Thu, 2 Dec 2021 15:55:56 +0000
> "Paul \"LeoNerd\" Evans" <leonerd@leonerd.org.uk> wrote:
>
> > A few other situations that come to mind:
> >
> > * Giving OP_CONST the OA_TARGMY flag; so situations like
> >
> > $idx = 1;
> >
> > Could optimise away the OP_PADSV and OP_SASSIGN into a single
> > void-context OP_CONST with a TARG set in it.
>
> Well... This did not go well.
>
> Having made this the subject of my latest TPF grant proposal[1], I
> didn't get very far into implementing it when I hit upon a huge snag.
> Namely, that OP_CONST is an SVOP and so on threaded builds of perl, the
> op_targ field is (ab)used to store not the target of the op's result,
> but the source SV holding the actual constant.

The situation with consts on threaded builds is a bit of mess. For OPs
containing a pointer to a GV (such as OP_GVSV), the type of the op varies
between unthreaded and threaded builds, from an SVOP (which has an op_sv
field) to a PADOP (which has an op_padix field) - for both these op
types, that field is in addition to the always-present op_targ field.

However for OP_CONST, its always an SVOP, with a NULL op_sv indicating
that op_targ should be used instead.

In theory the way forward would be to first regularlise OP_CONST to be
like the GV ops and become a PADOP on threaded builds. I think this may be
complicated by when pad slots are located for consts - I think this is
done quite late in the compilation of a sub. If it were done earlier (like
however GVs are moved to the pad) then perhaps this becomes possible.

Even better might be to eliminate the PADOP type altogether, and just
make the op_sv field of the SVOP an SV* or PADOFFSET depending on whether
the build is threaded.

> > * Extending the "op_targ" concept into the incoming argument. I
> > notice that a few ops currently abuse that field, which is
> > supposed to be about the "target" (i.e. output destination) of an op,
> > to really encode where its source argument comes from. A few ops have
> > the OPf_STACKED flag optionally set, and if absent it means they
> > take their incoming argument from the pad in op_targ.
> >
> > This would add a new field to every UNOP to store the pad offset
> > of its incoming argument, and then optimise every UNOP whose
> > op_first is a PADSV into having no KID and putting the pad offset
> > in that field instead. I suspect this would apply in many more
> > cases in real code, than the current OA_TARGMY optimisation does
> > now, and would give benefits to optree size and execution time, at
> > the small cost of memory for every UNOP. This one would need some
> > more in-depth benchmarking to make a case justifying it, but if we
> > had some good benchmarking test cases it could show a good win.
>
> I therefore wonder whether this part might need to be done first,
> performing a somewhat more radical overhaul of the op system by tidying
> up the meaning of op_targ to really mean the "target" (i.e. output
> destination) of every op, and never to mean the source even in UNOP or
> SVOP cases. Having done *that*, then an assigning form of OP_CONST
> could then use both fields in this case.

Not sure I understand that proposal.

--
If life gives you lemons, you'll probably develop a citric acid allergy.
Re: Optimised lexical assignments / faster macroöps [ In reply to ]
On Tue, 7 Jun 2022 14:46:32 +0100
Dave Mitchell <davem@iabyn.com> wrote:

> In theory the way forward would be to first regularlise OP_CONST to be
> like the GV ops and become a PADOP on threaded builds. I think this
> may be complicated by when pad slots are located for consts - I think
> this is done quite late in the compilation of a sub. If it were done
> earlier (like however GVs are moved to the pad) then perhaps this
> becomes possible.
>
> Even better might be to eliminate the PADOP type altogether, and just
> make the op_sv field of the SVOP an SV* or PADOFFSET depending on
> whether the build is threaded.

Yes I think all of that sounds like some useful neatenings that can be
performed. It'd be nice to get rid of the weird way OP_CONST works.

> > I therefore wonder whether this part might need to be done first,
> > performing a somewhat more radical overhaul of the op system by
> > tidying up the meaning of op_targ to really mean the "target" (i.e.
> > output destination) of every op, and never to mean the source even
> > in UNOP or SVOP cases. Having done *that*, then an assigning form
> > of OP_CONST could then use both fields in this case.
>
> Not sure I understand that proposal.

Perhaps it's mostly covered by your suggestion above.

My observation is that for most ops, the "op_targ" does mean the target
- i.e. the output destination where the result of the op will be
written. But OP_CONST appears to act differently; its op_targ contains
the input source value, which it will read from and push the result to
the (data) stack. Maybe there are other similar ops; or maybe it really
is the only one. But the plan would be to tidy up this op so it's
nicely consistent - op_targ would always be the output then. Having
done that, then a statement like

$x = 123;

Could be represented by a single OP_CONST having the padix of $x in its
op_targ, and the {SV pointer or pad offset of the constant} in the
other op field:

OP_CONST [sv=123 into targ=$x]

There'd be no need to do the full 3-node tree of:

OP_SASSIGN
OP_PADSV [$x]
OP_CONST [123]

--
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk | https://metacpan.org/author/PEVANS
http://www.leonerd.org.uk/ | https://www.tindie.com/stores/leonerd/