Mailing List Archive

Performance issue with CPython 3.10 + Cython
Hi all,

I am wrapping an embedded application (, which does not use any dynamic
memory management,) using Cython to call it from CPython. The wrapped
application uses a cyclic executive, i.e. everything is done in the
input-logic-output design, typical for some real-time related domains.
Consequentially, every application cycle executes more or less the very
same code. As I am still in a prototyping stadium, the wrapped process is
completely CPU-bound, i.e. except of some logging output there is no I/O
whatsoever.

During one second of "application time", I am doing exactly 120 calls into
the application through three Cython-wrapped API functions. The
application uses some platform-dependent APIs, which I have also wrapped
with Cython, so that there are numerous callbacks into the Python realm per
call into the application. What I am observing now, is that the performance
per "application second" decreases (remember: executing code that does the
same thing on every cycle) and extending the number of loop iterations does
not seem to cause any bound to this decrease. In the log ouput below, you
can see the GC counts, which look innocent to me. The "real time" is
measured using "time.time()". The "top" utility does not suggest any memory
leak either. I am developing on WSL2, but I have checked that this
performance effect also happens on physical machines. Right now, I am
staring at "kcachegrind", but I have no idea, how to determine time series
for the performance of functions (I am not looking for those functions,
which need the most time, but for those, which consume more and more
execution time).

One more thing to look for could be memory fragmentation, but before that I
would like to ask the experts here for their ideas and experiences and/or
for tools, which could help to find the culprit.

2022-10-04 10:40:50|INFO |__main__ |Execution loop 0 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.06862711906433105 seconds real time.
> 2022-10-04 10:40:51|INFO |__main__ |Execution loop 100 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.08224177360534668 seconds real time.
> 2022-10-04 10:40:52|INFO |__main__ |Execution loop 200 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.08225250244140625 seconds real time.
> 2022-10-04 10:40:53|INFO |__main__ |Execution loop 300 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.10176873207092285 seconds real time.
> 2022-10-04 10:40:54|INFO |__main__ |Execution loop 400 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.10900592803955078 seconds real time.
> 2022-10-04 10:40:55|INFO |__main__ |Execution loop 500 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.12233948707580566 seconds real time.
> 2022-10-04 10:40:56|INFO |__main__ |Execution loop 600 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.14058256149291992 seconds real time.
> 2022-10-04 10:40:58|INFO |__main__ |Execution loop 700 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.14777183532714844 seconds real time.
> 2022-10-04 10:40:59|INFO |__main__ |Execution loop 800 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.15729451179504395 seconds real time.
> 2022-10-04 10:41:01|INFO |__main__ |Execution loop 900 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.17365813255310059 seconds real time.
> 2022-10-04 10:41:03|INFO |__main__ |Execution loop 1000 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.17772984504699707 seconds real time.
> 2022-10-04 10:41:05|INFO |__main__ |Execution loop 1100 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.1955263614654541 seconds real time.
> 2022-10-04 10:41:07|INFO |__main__ |Execution loop 1200 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.20046710968017578 seconds real time.
> 2022-10-04 10:41:09|INFO |__main__ |Execution loop 1300 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.22513842582702637 seconds real time.
> 2022-10-04 10:41:11|INFO |__main__ |Execution loop 1400 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.23578548431396484 seconds real time.
> 2022-10-04 10:41:13|INFO |__main__ |Execution loop 1500 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.24581527709960938 seconds real time.
> 2022-10-04 10:41:16|INFO |__main__ |Execution loop 1600 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.2541334629058838 seconds real time.
> 2022-10-04 10:41:19|INFO |__main__ |Execution loop 1700 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.26812195777893066 seconds real time.
> 2022-10-04 10:41:21|INFO |__main__ |Execution loop 1800 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.28777456283569336 seconds real time.
> 2022-10-04 10:41:24|INFO |__main__ |Execution loop 1900 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.3005337715148926 seconds real time.
> 2022-10-04 10:41:28|INFO |__main__ |Execution loop 2000 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.30992960929870605 seconds real time.
> 2022-10-04 10:41:31|INFO |__main__ |Execution loop 2100 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.32058119773864746 seconds real time.
> 2022-10-04 10:41:34|INFO |__main__ |Execution loop 2200 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.33020949363708496 seconds real time.
> 2022-10-04 10:41:37|INFO |__main__ |Execution loop 2300 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.34426307678222656 seconds real time.
> 2022-10-04 10:41:41|INFO |__main__ |Execution loop 2400 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.35851263999938965 seconds real time.
> 2022-10-04 10:41:45|INFO |__main__ |Execution loop 2500 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.3664553165435791 seconds real time.
> 2022-10-04 10:41:48|INFO |__main__ |Execution loop 2600 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.3767662048339844 seconds real time.
> 2022-10-04 10:41:52|INFO |__main__ |Execution loop 2700 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.39760732650756836 seconds real time.
> 2022-10-04 10:41:56|INFO |__main__ |Execution loop 2800 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.42221736907958984 seconds real time.
> 2022-10-04 10:42:01|INFO |__main__ |Execution loop 2900 done. GC
> counts = (381, 9, 3); 1 second of application time corresponds to
> 0.4237234592437744 seconds real time.


Thanks in advance,

Andreas
--
https://mail.python.org/mailman/listinfo/python-list
Re: Performance issue with CPython 3.10 + Cython [ In reply to ]
Answering to myself, just for the records:

1. The culprit was me. As lazy as I am, I have used f-strings all over the
place in calls to `logging.logger.debug()` and friends, evaluating all
arguments regardless of whether the logger was enabled or not. Replacing
these f-strings by regular printf-like format strings solved the issue.
Now the application bowls happily along, consistently below 0.02 seconds
per second application time.
2. Valgrind + callgrind is an awesome toolchain to spot performance issues,
even on VMs.


Am Di., 4. Okt. 2022 um 11:05 Uhr schrieb Andreas Ames <
andreas.0815.qwertz@gmail.com>:

> Hi all,
>
> I am wrapping an embedded application (, which does not use any dynamic
> memory management,) using Cython to call it from CPython. The wrapped
> application uses a cyclic executive, i.e. everything is done in the
> input-logic-output design, typical for some real-time related domains.
> Consequentially, every application cycle executes more or less the very
> same code. As I am still in a prototyping stadium, the wrapped process is
> completely CPU-bound, i.e. except of some logging output there is no I/O
> whatsoever.
>
> During one second of "application time", I am doing exactly 120 calls into
> the application through three Cython-wrapped API functions. The
> application uses some platform-dependent APIs, which I have also wrapped
> with Cython, so that there are numerous callbacks into the Python realm per
> call into the application. What I am observing now, is that the performance
> per "application second" decreases (remember: executing code that does the
> same thing on every cycle) and extending the number of loop iterations does
> not seem to cause any bound to this decrease. In the log ouput below, you
> can see the GC counts, which look innocent to me. The "real time" is
> measured using "time.time()". The "top" utility does not suggest any memory
> leak either. I am developing on WSL2, but I have checked that this
> performance effect also happens on physical machines. Right now, I am
> staring at "kcachegrind", but I have no idea, how to determine time series
> for the performance of functions (I am not looking for those functions,
> which need the most time, but for those, which consume more and more
> execution time).
>
> One more thing to look for could be memory fragmentation, but before that
> I would like to ask the experts here for their ideas and experiences and/or
> for tools, which could help to find the culprit.
>
> 2022-10-04 10:40:50|INFO |__main__ |Execution loop 0 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.06862711906433105 seconds real time.
>> 2022-10-04 10:40:51|INFO |__main__ |Execution loop 100 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.08224177360534668 seconds real time.
>> 2022-10-04 10:40:52|INFO |__main__ |Execution loop 200 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.08225250244140625 seconds real time.
>> 2022-10-04 10:40:53|INFO |__main__ |Execution loop 300 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.10176873207092285 seconds real time.
>> 2022-10-04 10:40:54|INFO |__main__ |Execution loop 400 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.10900592803955078 seconds real time.
>> 2022-10-04 10:40:55|INFO |__main__ |Execution loop 500 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.12233948707580566 seconds real time.
>> 2022-10-04 10:40:56|INFO |__main__ |Execution loop 600 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.14058256149291992 seconds real time.
>> 2022-10-04 10:40:58|INFO |__main__ |Execution loop 700 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.14777183532714844 seconds real time.
>> 2022-10-04 10:40:59|INFO |__main__ |Execution loop 800 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.15729451179504395 seconds real time.
>> 2022-10-04 10:41:01|INFO |__main__ |Execution loop 900 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.17365813255310059 seconds real time.
>> 2022-10-04 10:41:03|INFO |__main__ |Execution loop 1000 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.17772984504699707 seconds real time.
>> 2022-10-04 10:41:05|INFO |__main__ |Execution loop 1100 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.1955263614654541 seconds real time.
>> 2022-10-04 10:41:07|INFO |__main__ |Execution loop 1200 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.20046710968017578 seconds real time.
>> 2022-10-04 10:41:09|INFO |__main__ |Execution loop 1300 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.22513842582702637 seconds real time.
>> 2022-10-04 10:41:11|INFO |__main__ |Execution loop 1400 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.23578548431396484 seconds real time.
>> 2022-10-04 10:41:13|INFO |__main__ |Execution loop 1500 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.24581527709960938 seconds real time.
>> 2022-10-04 10:41:16|INFO |__main__ |Execution loop 1600 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.2541334629058838 seconds real time.
>> 2022-10-04 10:41:19|INFO |__main__ |Execution loop 1700 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.26812195777893066 seconds real time.
>> 2022-10-04 10:41:21|INFO |__main__ |Execution loop 1800 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.28777456283569336 seconds real time.
>> 2022-10-04 10:41:24|INFO |__main__ |Execution loop 1900 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.3005337715148926 seconds real time.
>> 2022-10-04 10:41:28|INFO |__main__ |Execution loop 2000 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.30992960929870605 seconds real time.
>> 2022-10-04 10:41:31|INFO |__main__ |Execution loop 2100 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.32058119773864746 seconds real time.
>> 2022-10-04 10:41:34|INFO |__main__ |Execution loop 2200 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.33020949363708496 seconds real time.
>> 2022-10-04 10:41:37|INFO |__main__ |Execution loop 2300 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.34426307678222656 seconds real time.
>> 2022-10-04 10:41:41|INFO |__main__ |Execution loop 2400 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.35851263999938965 seconds real time.
>> 2022-10-04 10:41:45|INFO |__main__ |Execution loop 2500 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.3664553165435791 seconds real time.
>> 2022-10-04 10:41:48|INFO |__main__ |Execution loop 2600 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.3767662048339844 seconds real time.
>> 2022-10-04 10:41:52|INFO |__main__ |Execution loop 2700 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.39760732650756836 seconds real time.
>> 2022-10-04 10:41:56|INFO |__main__ |Execution loop 2800 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.42221736907958984 seconds real time.
>> 2022-10-04 10:42:01|INFO |__main__ |Execution loop 2900 done. GC
>> counts = (381, 9, 3); 1 second of application time corresponds to
>> 0.4237234592437744 seconds real time.
>
>
> Thanks in advance,
>
> Andreas
>
>
--
https://mail.python.org/mailman/listinfo/python-list