Mailing List Archive: Re: Optimising perl: 26% speed-up

Neat. I'm glad to see someone playing with this sort of thing.

: Now for a summary and some thoughts on what to do for other
: architectures/compilers.
: (1) Keep op in a global register.
: That ought to provide a good speed-up for almost any architecture.
: The problem is how to arrange for a global variable to be kept in
: a register. With GNU cc you do
: register struct op * op asm("ebx"); /* current op--in a global register */
: (before any functions have been mentioned) but you need to change
: the register name to an "appropriate" one for your architecture.
: More #ifdefs, I'm afraid, and probably some Configure support would
: be nice/necessary too. I can't find out how to do the same for DEC's
: C compiler for OSF/1 (and it may not even be possible). I suspect
: those using other vendor's non-GNU compilers may have a similar problem.

An alternate approach on some systems would be to post-process the
assembler code, presuming there's some register you can steal easily.

: (2) Keep stack_sp in a global register.
: I'm less sure that this will provide a universal speed-up.

Especially if you do (3).

: (3) Remove stack-overflow checking code by arranging for an
: automagically growing stack. I'm going to code this myself but
: a single mmap(...MAP_ANON) should do the trick with possibly some
: checking elsewhere that nothing tries to realloc the stack.
: Older Unices without mmap or similar may have to do without this.

The old Bourne shell used to do this sort of thing, actually. Made it
hell to port to systems that couldn't reliably restart from SIGSEGV
or SIGBUS...

: Finally, some of the things that ought to be looked at but I
: didn't get round to doing at the weekend:
:
: (4) Tail-recursion for ops so that they jump straight from one
: to the next.

This was the major motivation for writing pp.h in the first place. I
think the main thing that's needed is that currently some pp functions
call other pp functions as subroutines--see pp_dump(), which calls
pp_goto(). Or they call a common routine assuming they can just
return--see pp_keys(), for instance, which calls do_kv(). The compiler
just needs to remap the destination addresses to remove these
assumptions about the calling mechanism.

There are some other places outside of the pp files that call into
various pp_functions. These just need to be factored out into
service routines that can be called by the pp functions. Or the
code can be duplicated if it's small. I'm not sure what's best
to do about pp_entersub though.

Another thing is that running into an op_next of 0 currently returns
out of run(). We'd have to always make sure that op_next points
to an explicit pp function that knows how to return to run(). We don't
want to test op at the end of every pp routine.

But by and large, I was trying to get close to the notion of threaded
code. (Some PDP-11 junkies will remember the instruction JMP @R4++
from DEC's compilers.) It'll take a little work, but it shouldn't be
excruciatingly difficult to make Perl do this, since I was trying to
design the capability into Perl 5, and only cheated in a few spots.

: (5) Optimisation (if possible) of clear_scope and sv_setsv.

I don't think there's going to be much to do there. It may be that
one could do statistical analysis on the number of scalars of various
sorts that go through sv_setsv(), and on machines where switch statements
are slow to set up, throw in a conditional for the most common case.
But it's probably not going to get you a whole lot, since I always
knew that sv_setsv() was a bottleneck, and cringe every time anyone
inserts new code into it. (Hi there, Ilya, Gurusamy. :-)

You might get a little in leave_scope() by identifying scalars that
can be processed very simply and making more "save" codes. On the
pushing end, we might save a lot by inlining some of the SAVE* macros.
It doesn't stick out as much as the leave_scope() because it's
more distributed, but there's some overhead, depending on
how efficiently your architecture does function calls.

One might also think about factoring out the stack checks on the
the save stack. See the SSCHECK() macro in sv.h.

: (6) Optimisation of other ops in pp_hot.c. It may even be worth
: either hand-coding some in assembler or else tweaking the C
: enough to confuse the compiler into generating better code.
: pp_hot.c isn't called pp_hot for nothing.

I put them all together in hopes of staying in a cache, but for
specific architectures there might be better arrangements, especially
if we can avoid register window dumps.

: Most of the above is going to cause more portability problems.
: That's not to say that perl is any *less* portable in itself,
: but it means lots of #ifdefs or Configure support for the
: platform/architecture/compiler-dependent features.

To the extent that we can isolate the #ifdefs to header files, they're
relatively painless. That's why I've got all the funky macros in
there. They're there to be redefined someday.

Larry