Mailing List Archive

Fwd: Python 3.11 performance with frame pointers
Hi,

As part of the proposal to enable frame pointers by default in Fedora
(https://fedoraproject.org/wiki/Changes/fno-omit-frame-pointer), we
did some benchmarking to figure out the expected performance impact.
The performance impact was generally minimal, except for the
pyperformance benchmark suite where we noticed a more substantial
difference between a system built with frame pointers and a system
built without frame pointers. The results can be found here:
https://github.com/DaanDeMeyer/fpbench (look at the mean difference
column for the pyperformance results where the percentage is the
slowdown compared to a system built without frame pointers). One of
the biggest slowdowns was on the scimark_sparse_mat_mult benchmark
which slowed down 9.5% when the system (including python) was built
with frame pointers. Note that these benchmarks were run against
Python 3.11 on a Fedora 37 x86_64 system (one built with frame
pointers, another built without frame pointers). The system used to
run the benchmarks was an Amazon EC2 machine.

We did look a bit into the reasons behind this slowdown. I'll quote
the investigation by Andrii on the Fesco issue thread here
(https://pagure.io/fesco/issue/2817):

> So I did look a bit at Python with and without frame pointers trying to
> understand pyperformance > regressions.

> First, perf data suggests that big chunk of CPU is spent in _PyEval_EvalFrameDefault,
> so I looked specifically into it (also we had to use DWARF mode for perf for apples-to-apples
> comparison, and a bunch of stack traces weren't symbolized properly, which just again
> reminds why having frame pointers is important).

> perf annotation of _PyEval_EvalFrameDefault didn't show any obvious hot spots, the work
> seemed to be distributed pretty similarly with or without frame pointers. Also scrolling through
> _PyEval_EvalFrameDefault disassembly also showed that instruction patterns between fp
> and no-fp versions are very similar.

> But just a few interesting observations.

> The size of _PyEval_EvalFrameDefault function specifically (and all the other functions didn't
> change much in that regard) increased very significantly from 46104 to 53592 bytes, which is a
> considerable 15% increase. Looking deeper, I believe it's all due to more stack spills and
> reloads due to one less register available to keep local variables in registers instead of on the stack.

> Looking at _PyEval_EvalFrameDefault C code, it is a humongous one function with gigantic switch
> statement that implements Python instruction handling logic. So the function itself is big and it has
> a lot of local state in different branches, which to me explained why there is so much stack spill/load.

> Grepping for instruction of the form mov -0xf0(%rbp),%rcx or mov 0x50(%rsp),%r10 (and their reverse
> variants), I see that there is a substantial amount of stack spill/load in _PyEval_EvalFrameDefault
> disassembly already in default no frame pointer variant (1870 out of 11181 total instructions in that
> function, 16.7%), and it just increases further in frame pointer version (2341 out of 11733 instructions, 20%).

> One more interesting observation. With no frame pointers, GCC generates stack accesses using %rsp
> with small positive offsets, which results in pretty compact binary instruction representation, e.g.:

> 0x00000000001cce40 <+44160>: 4c 8b 54 24 50 mov 0x50(%rsp),%r10

> This uses 5 bytes. But if frame pointers are enabled, GCC switches to using %rbp-relative offsets,
> which are all negative. And that seems to result in much bigger instructions, taking now 7 bytes instead of 5:

> 0x00000000001d3969 <+53065>: 48 8b 8d 10 ff ff ff mov -0xf0(%rbp),%rcx

> I found it pretty interesting. I'd imagine GCC should be capable to keep using %rsp addressing just fine
> regardless of %rbp and save on instruction sizes, but apparently it doesn't. Not sure why. But this instruction
> increase, coupled with increase of number of spills/reloads, actually explains huge increase in byte size of
> _PyEval_EvalFrameDefault: (2341 - 1870) * 7 + 1870 * 2 = 7037 (2 extra bytes for existing 1870 instructions
> that were switched from %rsp+positive offset to %rbp + negative offset, plus 7 bytes for each of new 471 instructions).
> I'm no compiler expert, but it would be nice for someone from GCC community to check this as well (please CC
> relevant folks, if you know them).

> In summary, to put it bluntly, there is just more work to do for CPU saving/restoring state to/from stack. But I don't
> think _PyEval_EvalFrameDefault example is typical of how application code is written, nor is it, generally speaking,
> a good idea to do so much within single gigantic function. So I believe it's more of an outlier than a typical case.

We have a few questions:
- Is this slowdown when Python is built with frame pointers to be
expected? Has the Python community done any of their own experiments
with building Python with and without frame pointers?
- Is there anything we can do to fix the slowdown when Python is built
with frame pointers?
- Should we expect any change in benchmark results if we benchmark
against Python 3.12? Supposedly there are changes in Python 3.12
related to frame pointers so we're wondering if those changes might
affect these results in any way.

Cheers,

Daan De Meyer
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/LVRUY7KAJ5I532NHMDWJIS5H4HXSGBWD/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Fwd: Python 3.11 performance with frame pointers [ In reply to ]
I suggest re-posting this on discuss.python.org as more engaged active core
devs will pay attention to it there.

On Wed, Jan 4, 2023 at 11:12 AM Daan De Meyer <daan.j.demeyer@gmail.com>
wrote:

> Hi,
>
> As part of the proposal to enable frame pointers by default in Fedora
> (https://fedoraproject.org/wiki/Changes/fno-omit-frame-pointer), we
> did some benchmarking to figure out the expected performance impact.
> The performance impact was generally minimal, except for the
> pyperformance benchmark suite where we noticed a more substantial
> difference between a system built with frame pointers and a system
> built without frame pointers. The results can be found here:
> https://github.com/DaanDeMeyer/fpbench (look at the mean difference
> column for the pyperformance results where the percentage is the
> slowdown compared to a system built without frame pointers). One of
> the biggest slowdowns was on the scimark_sparse_mat_mult benchmark
> which slowed down 9.5% when the system (including python) was built
> with frame pointers. Note that these benchmarks were run against
> Python 3.11 on a Fedora 37 x86_64 system (one built with frame
> pointers, another built without frame pointers). The system used to
> run the benchmarks was an Amazon EC2 machine.
>
> We did look a bit into the reasons behind this slowdown. I'll quote
> the investigation by Andrii on the Fesco issue thread here
> (https://pagure.io/fesco/issue/2817):
>
> > So I did look a bit at Python with and without frame pointers trying to
> > understand pyperformance > regressions.
>
> > First, perf data suggests that big chunk of CPU is spent in
> _PyEval_EvalFrameDefault,
> > so I looked specifically into it (also we had to use DWARF mode for
> perf for apples-to-apples
> > comparison, and a bunch of stack traces weren't symbolized properly,
> which just again
> > reminds why having frame pointers is important).
>
> > perf annotation of _PyEval_EvalFrameDefault didn't show any obvious hot
> spots, the work
> > seemed to be distributed pretty similarly with or without frame
> pointers. Also scrolling through
> > _PyEval_EvalFrameDefault disassembly also showed that instruction
> patterns between fp
> > and no-fp versions are very similar.
>
> > But just a few interesting observations.
>
> > The size of _PyEval_EvalFrameDefault function specifically (and all the
> other functions didn't
> > change much in that regard) increased very significantly from 46104 to
> 53592 bytes, which is a
> > considerable 15% increase. Looking deeper, I believe it's all due to
> more stack spills and
> > reloads due to one less register available to keep local variables in
> registers instead of on the stack.
>
> > Looking at _PyEval_EvalFrameDefault C code, it is a humongous one
> function with gigantic switch
> > statement that implements Python instruction handling logic. So the
> function itself is big and it has
> > a lot of local state in different branches, which to me explained why
> there is so much stack spill/load.
>
> > Grepping for instruction of the form mov -0xf0(%rbp),%rcx or mov
> 0x50(%rsp),%r10 (and their reverse
> > variants), I see that there is a substantial amount of stack spill/load
> in _PyEval_EvalFrameDefault
> > disassembly already in default no frame pointer variant (1870 out of
> 11181 total instructions in that
> > function, 16.7%), and it just increases further in frame pointer version
> (2341 out of 11733 instructions, 20%).
>
> > One more interesting observation. With no frame pointers, GCC generates
> stack accesses using %rsp
> > with small positive offsets, which results in pretty compact binary
> instruction representation, e.g.:
>
> > 0x00000000001cce40 <+44160>: 4c 8b 54 24 50 mov
> 0x50(%rsp),%r10
>
> > This uses 5 bytes. But if frame pointers are enabled, GCC switches to
> using %rbp-relative offsets,
> > which are all negative. And that seems to result in much bigger
> instructions, taking now 7 bytes instead of 5:
>
> > 0x00000000001d3969 <+53065>: 48 8b 8d 10 ff ff ff mov
> -0xf0(%rbp),%rcx
>
> > I found it pretty interesting. I'd imagine GCC should be capable to keep
> using %rsp addressing just fine
> > regardless of %rbp and save on instruction sizes, but apparently it
> doesn't. Not sure why. But this instruction
> > increase, coupled with increase of number of spills/reloads, actually
> explains huge increase in byte size of
> > _PyEval_EvalFrameDefault: (2341 - 1870) * 7 + 1870 * 2 = 7037 (2 extra
> bytes for existing 1870 instructions
> > that were switched from %rsp+positive offset to %rbp + negative offset,
> plus 7 bytes for each of new 471 instructions).
> > I'm no compiler expert, but it would be nice for someone from GCC
> community to check this as well (please CC
> > relevant folks, if you know them).
>
> > In summary, to put it bluntly, there is just more work to do for CPU
> saving/restoring state to/from stack. But I don't
> > think _PyEval_EvalFrameDefault example is typical of how application
> code is written, nor is it, generally speaking,
> > a good idea to do so much within single gigantic function. So I believe
> it's more of an outlier than a typical case.
>
> We have a few questions:
> - Is this slowdown when Python is built with frame pointers to be
> expected? Has the Python community done any of their own experiments
> with building Python with and without frame pointers?
> - Is there anything we can do to fix the slowdown when Python is built
> with frame pointers?
> - Should we expect any change in benchmark results if we benchmark
> against Python 3.12? Supposedly there are changes in Python 3.12
> related to frame pointers so we're wondering if those changes might
> affect these results in any way.
>
> Cheers,
>
> Daan De Meyer
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/LVRUY7KAJ5I532NHMDWJIS5H4HXSGBWD/
> Code of Conduct: http://python.org/psf/codeofconduct/
>