Mailing List Archive: Where the speed is lost! (was: 1.6 speed)

Where the speed is lost! (was: 1.6 speed)

tismer at trixie

Apr 24, 2000, 7:01 AM

Post #1 of 7 (748 views)

> Christian Tismer wrote:
> >
> > "A.M. Kuchling" wrote:
> > >
> > > Python 1.6a2 is around 10% slower than 1.5 on pystone.
> > > Any idea why?
...
> > Stackless 1.5.2+ is 10 percent faster than Stackless 1.6a2.
> >
> > Claim:
> > This is not related to ceval.c .
> > Something else must have introduced a significant speed loss.

I guess I can explain now what's happening, at least
for the Windows platform.
Python 1.5.2's .dll was nearly about 512K, something more.
I think to remember that 512K is a common size of the secondary
cache.
Now, linking with the MS linker does not give you any
particularly useful order of modules. When I look into
the map file, the modules appear sorted by name.
This is for sure not providing optimum performance.
As I read the docs, explicit ordering of the linkage
would only make sense for C++ and wouldn't work out
for C, since we could order the exported functions, but
not the private ones, giving even more distance between
releated code.

My solution to see if I might be right was this:
I ripped out almost all builtin extension modules
and compiled/linked without them. This shrunk
the dll size down from 647K to 557K, very close
to the 1.5.2 size.
Now I get the following figures:

Python 1.6, with stackless patches:

D:\python\spc\Python-slp\PCbuild>python /python/lib/test/pystone.py
Pystone(1.1) time for 10000 passes = 1.95468
This machine benchmarks at 5115.92 pystones/second

Python 1.6, from the dist:

D:\Python16>python /python/lib/test/pystone.py
Pystone(1.1) time for 10000 passes = 2.09214
This machine benchmarks at 4779.8 pystones/second

That means my optimizations are in charge again,
after the overall code size went below about 512K.

I think these 10 percent are quite valuable.
These options come to my mind:

a) try to do optimum code ordering in the too large .dll .
This seems to be hard to achieve.
b) Split the dll into two dll's in a way that all the
necessary internal stuff sits closely in one of them.
c) try to split the library like above, but use
a static library layout for one of them, and link the
static library into the final dll. This would hopefully
keep related things together.

I don't know if c) is possible, but it might be tried.

Any thoughts?

ciao - chris

--
Christian Tismer :^) <mailto:tismer@appliedbiometrics.com>
Applied Biometrics GmbH : Have a break! Take a ride on Python's
Kaunstr. 26 : *Starship* http://starship.python.net
14163 Berlin : PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF
where do you want to jump today? http://www.stackless.com

Re: Where the speed is lost! (was: 1.6 speed) [ In reply to ]

tismer at trixie

Apr 24, 2000, 8:19 AM

Post #2 of 7 (721 views)

Sorry, it was not really found...

Christian Tismer wrote:
[thought he had found the speed leak]

After re-inserting all the builtin modules,
I got nearly the same result after a complete
re-build, just marginally slower.

There must something else be happening that I cannot
understand. Stackless Python upon 1.5.2+ is still
nearly 10 percent faster, regardless what I do to
Python 1.6.

Testing whether Unicode has some effect?
I changed PyUnicode_Check to always return 0.
This should optimize most related stuff away.
Result: No change at all!

Which changes were done after the pre-unicode tag,
which might really count for performance?

I'm quite desperate, any ideas?

ciao - chris

--
Christian Tismer :^) <mailto:tismer@appliedbiometrics.com>
Applied Biometrics GmbH : Have a break! Take a ride on Python's
Kaunstr. 26 : *Starship* http://starship.python.net
14163 Berlin : PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF
where do you want to jump today? http://www.stackless.com

Re: Where the speed is lost! (was: 1.6 speed) [ In reply to ]

Apr 25, 2000, 10:20 AM

Post #3 of 7 (719 views)

The performance difference I see on my Sparc is smaller. The machine
is a 200MHz Ultra Sparc 2 with 256MB of RAM, built both versions with
GCC 2.8.1. It appears that 1.6a2 is about 3.3% slower.

The median pystone time taken from 10 measurements are:
1.5.2 4.87
1.6a2 5.035

For comparison, the numbers I see on my Linux box (dual PII 266) are:

1.5.2 3.18
1.6a2 3.53

That's about 10% faster under 1.5.2.

I'm not sure how important this change is. Three percent isn't enough
for me to worry about, but it's a minority platform. I suppose 10
percent is right on the cusp. If the performance difference is the
cost of the many improvements of 1.6, I think it's worth the price.

Jeremy

Re: Where the speed is lost! (was: 1.6 speed) [ In reply to ]

tismer at tismer

Apr 25, 2000, 11:12 AM

Post #4 of 7 (719 views)

Jeremy Hylton wrote:
>
> The performance difference I see on my Sparc is smaller. The machine
> is a 200MHz Ultra Sparc 2 with 256MB of RAM, built both versions with
> GCC 2.8.1. It appears that 1.6a2 is about 3.3% slower.
>
> The median pystone time taken from 10 measurements are:
> 1.5.2 4.87
> 1.6a2 5.035
>
> For comparison, the numbers I see on my Linux box (dual PII 266) are:
>
> 1.5.2 3.18
> 1.6a2 3.53
>
> That's about 10% faster under 1.5.2.

Which GCC was it on the Linux box,
and how much RAM does it have?

> I'm not sure how important this change is. Three percent isn't enough
> for me to worry about, but it's a minority platform. I suppose 10
> percent is right on the cusp. If the performance difference is the
> cost of the many improvements of 1.6, I think it's worth the price.

Yes, and I'm happy to pay the price if I can see where I pay.
That's the problem, the changes between the pre-unicode tag
and the current CVS are not enough to justify that speed loss.
There must be something substantial.
I also don't grasp why my optimizations are so much more
powerful on 1.5.2+ as on 1.6 .

Mark Hammond pointed me to the int/long unification.
Was this done *after* the unicode patches?

ciao - chris

--
Christian Tismer :^) <mailto:tismer@appliedbiometrics.com>
Applied Biometrics GmbH : Have a break! Take a ride on Python's
Kaunstr. 26 : *Starship* http://starship.python.net
14163 Berlin : PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF
where do you want to jump today? http://www.stackless.com

Re: Where the speed is lost! (was: 1.6 speed) [ In reply to ]

akuchlin at mems-exchange

Apr 25, 2000, 1:54 PM

Post #5 of 7 (722 views)

Christian Tismer writes:
>Mark Hammond pointed me to the int/long unification.
>Was this done *after* the unicode patches?

Before. It seems unlikely they're the cause (they just add a 'if
(PyLong_Check(key)' branch to the slicing functions in abstract.c.
OTOH, if pystone really exercises sequence multiplication, maybe
they're related (but 10% worth?).

--
A.M. Kuchling http://starship.python.net/crew/amk/
I know flattery when I hear it; but I do not often hear it.
-- Robertson Davies, _Fifth Business_

Re: Where the speed is lost! (was: 1.6 speed) [ In reply to ]

Apr 27, 2000, 3:12 PM

Post #6 of 7 (720 views)

>>>>> "CT" == Christian Tismer <tismer@tismer.com> writes:

CT> Summary: We had two effects here. Effect 1: Wasting time with
CT> extra errors in instance creation. Effect 2: Loss of locality
CT> due to code size increase.

CT> Solution to 1 is Jeremy's patch. Solution to 2 could be a
CT> little renaming of the one or the other module, in order to get
CT> the default link order to support locality better.

CT> Now everything is clear to me. My first attempts with reordering
CT> could not reveal the loss with the instance stuff.

CT> All together, Python 1.6 is a bit faster than 1.5.2 if we try to
CT> get related code ordered better.

I reach a different conclusion. The performance difference 1.5.2 and
1.6, measured with pystone and pybench, is so small that effects like
the order in which the compiler assembles the code make a difference.
I don't think we should make any non-trivial effort to improve
performance based on this kind of voodoo.

I also question the claim that the two effects here explain the
performance difference between 1.5.2 and 1.6. Rather, they explain
the performance difference of pystone and pybench running on different
versions of the interpreter. Saying that pystone is the same speed is
a far cry from saying that python is the same speed! Remember that
performance on a benchmark is just that. (It's like the old joke
about a person's IQ: It is a very good indicator of how well they did
on the IQ test.)

I think we could use better benchmarks of two sorts. The pybench
microbenchmarks are quite helpful individually, though the overall
number isn't particularly meaningful. However, these benchmarks are
sometimes a little too big to be useful. For example, the instance
creation effect was tracked down by running this code:

class Foo:
pass

for i in range(big_num):
Foo()

The pybench test "CreateInstance" does all sorts of other stuff. It
tests creation with and without an __init__ method. It tests instance
deallocation (because all the created objected need to be dealloced,
too). It also tests attribute assignment, since many of the __init__
methods make assignments.

What would be better (and I'm not sure what priority should be placed
on doing it) is a set of nano-benchmarks that try to limit themselves
to a single feature or small set of features. Guido suggested having
a hierarchy so that there are multiple nano-benchmarks for instance
creation, each identifying a particular effect, and a micro-benchmark
that is the aggregate of all these nano-benchmarks.

We could also use some better large benchmarks. Using pystone is
pretty crude, because it doesn't necessarily measure the performance
of things we care about. It would be better to have a collection of
5-10 apps that each do something we care about -- munging text files
or XML data, creating lots of objects, etc.

For example, I used the compiler package (in nondist/src/Compiler) to
compile itself. Based on that benchmark, an interpreter built from
the current CVS tree is still 9-11% slower than 1.5.

Jeremy

Re: Where the speed is lost! (was: 1.6 speed) [ In reply to ]

tismer at tismer

Apr 27, 2000, 5:48 PM

Post #7 of 7 (720 views)

Jeremy Hylton wrote:
>
> >>>>> "CT" == Christian Tismer <tismer@tismer.com> writes:
>
> CT> Summary: We had two effects here. Effect 1: Wasting time with
> CT> extra errors in instance creation. Effect 2: Loss of locality
> CT> due to code size increase.
>
> CT> Solution to 1 is Jeremy's patch. Solution to 2 could be a
> CT> little renaming of the one or the other module, in order to get
> CT> the default link order to support locality better.
>
> CT> Now everything is clear to me. My first attempts with reordering
> CT> could not reveal the loss with the instance stuff.

from here...
> CT> All together, Python 1.6 is a bit faster than 1.5.2 if we try to
> CT> get related code ordered better.
...to here

I was not clear. The rest of it is at least 100% correct.

> I reach a different conclusion. The performance difference 1.5.2 and
> 1.6, measured with pystone and pybench, is so small that effects like
> the order in which the compiler assembles the code make a difference.

Sorry, it is 10 percent. Please do not shift the topic.
I agree that there must be better measurements to be
able to do my thoughtless claim ...from here to here...,
but the question was raised in the py-dev thread

"Python 1.6 speed" by Andrew, who was exactly asking why
pystone gets 10 percent slower.
I have been hunting that for a week now, and with your
help, it is solved.

> I don't think we should make any non-trivial effort to improve
> performance based on this kind of voodoo.

Thanks. I've already built it in - it was trivial,
but I'll keep it for my version.

> I also question the claim that the two effects here explain the
> performance difference between 1.5.2 and 1.6. Rather, they explain
> the performance difference of pystone and pybench running on different
> versions of the interpreter.

Exactly. I didn't want to claim anything else, it was all
in the context of the inital thread.

ciao - chris

Oops, p.s: interesting:
...
> For example, I used the compiler package (in nondist/src/Compiler) to
> compile itself. Based on that benchmark, an interpreter built from
> the current CVS tree is still 9-11% slower than 1.5.

Did you adjust the string methods? I don't believe these are still fast.

--
Christian Tismer :^) <mailto:tismer@appliedbiometrics.com>
Applied Biometrics GmbH : Have a break! Take a ride on Python's
Kaunstr. 26 : *Starship* http://starship.python.net
14163 Berlin : PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF
where do you want to jump today? http://www.stackless.com