Mailing List Archive

[issue45261] Unreliable (?) results from timeit (cache issue?)
New submission from Tim Holy <tim.holy@gmail.com>:

This is a lightly-edited reposting of https://stackoverflow.com/questions/69164027/unreliable-results-from-timeit-cache-issue

I am interested in timing certain operations in numpy and skimage, but I'm occasionally seeing surprising transitions (not entirely reproducible) in the reported times. Briefly, sometimes timeit returns results that differ by about 5-fold from earlier runs. Here's the setup:

import skimage
import numpy as np
import timeit

nrep = 16

def run_complement(img):
def inner():
skimage.util.invert(img)
return inner

img = np.random.randint(0, 65535, (512, 512, 3), dtype='uint16')

and here's an example session:

In [1]: %run python_timing_bug.py

In [2]: t = timeit.Timer(run_complement(img))

In [3]: t.repeat(nrep, number=1)
Out[3]:
[.0.0024439050030196086,
0.0020311699918238446,
0.00033007100864779204,
0.0002889479947043583,
0.0002851780009223148,
0.0002851030003512278,
0.00028487699455581605,
0.00032116699730977416,
0.00030912700458429754,
0.0002877369988709688,
0.0002840430097421631,
0.00028515000303741544,
0.00030791999597568065,
0.00029302599432412535,
0.00030723700183443725,
0.0002916679950430989]

In [4]: t = timeit.Timer(run_complement(img))

In [5]: t.repeat(nrep, number=1)
Out[5]:
[.0.0006320849934127182,
0.0004014919977635145,
0.00030359599622897804,
0.00029224599711596966,
0.0002907510061049834,
0.0002920039987657219,
0.0002918920072261244,
0.0003095199936069548,
0.00029789700056426227,
0.0002885590074583888,
0.00040198900387622416,
0.00037131100543774664,
0.00040271600300911814,
0.0003492849937174469,
0.0003378120018169284,
0.00029762100894004107]

In [6]: t = timeit.Timer(run_complement(img))

In [7]: t.repeat(nrep, number=1)
Out[7]:
[.0.00026428700948599726,
0.00012682100350502878,
7.380900206044316e-05,
6.346100417431444e-05,
6.29679998382926e-05,
6.278700311668217e-05,
6.320899410638958e-05,
6.25409884378314e-05,
6.262199894990772e-05,
6.247499550227076e-05,
6.293901242315769e-05,
6.259800284169614e-05,
6.285199197009206e-05,
6.293600017670542e-05,
6.309800664894283e-05,
6.248900899663568e-05]

Notice that in the final run, the minimum times were on the order of 0.6e-4 vs the previous minimum of ~3e-4, about 5x smaller than the times measured in previous runs. It's not entirely predictable when this "kicks in."

The faster time corresponds to 0.08ns/element of the array, which given that the 2.6GHz clock on my i7-8850H CPU ticks every ~0.4ns, seems to be pushing the limits of credibility (though thanks to SIMD on my AVX2 CPU, this cannot be entirely ruled out). My understanding is that this operation is implemented as a subtraction and most likely gets reduced to a bitwise-not by the compiler. So you do indeed expect this to be fast, but it's not entirely certain it should be this fast, and in either event the non-reproducibility is problematic.

It may be relevant to note that the total amount of data is

In [15]: img.size * 2
Out[15]: 1572864

and lshw reports that I have 384KiB L1 cache and 1536KiB of L2 cache:

In [16]: 384*1024
Out[16]: 393216

In [17]: 1536*1024
Out[17]: 1572864

So it seems possible that this result is being influenced by just fitting in L2 cache. (Note that even in the "fast block," the first run was not fast.) If I increase the size of the image:

img = np.random.randint(0, 65535, (2048, 2048, 3), dtype='uint16')

then my results seem more reproducible in the sense that I have not yet seen one of these transitions.

----------
components: Library (Lib)
messages: 402411
nosy: timholy
priority: normal
severity: normal
status: open
title: Unreliable (?) results from timeit (cache issue?)
type: behavior
versions: Python 3.8

_______________________________________
Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue45261>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com
[issue45261] Unreliable (?) results from timeit (cache issue?) [ In reply to ]
Steven D'Aprano <steve+python@pearwood.info> added the comment:

Running timeit with number=1 for fast running processes is not likely to be reliable. The problem is that modern computers are *incredibly noisy*, there are typically hundreds of processes running on them at any one time.

Trying to get deterministic times from something that runs in 0.3 ms or less is not easy. You can see yourself that the first couple of runs are of the order of 2-10 times higher before settling down. And as you point out, then there is a further drop by a factor of 5.

I agree that a CPU cache kicking in is a likely explanation.

When I use timeit, I generally try to have large enough number that the time is at least 0.1s per loop. To be perfectly honest, I don't know if that is actually helpful or if it is just superstition, but using that as a guide, I've never seen the sort of large drop in timing results that you are getting.

I presume you have read the notes in the doc about the default time?

"default_timer() measurements can be affected by other programs running on the same machine"

https://docs.python.org/3/library/timeit.html

There are more comments about timing in the source code and the commentary by Tim Peters in the Cookbook.

Another alternative is to try Victor Stinner's pyperf tool. I don't know how it works though.

----------
nosy: +steven.daprano, tim.peters, vstinner

_______________________________________
Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue45261>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com
[issue45261] Unreliable (?) results from timeit (cache issue?) [ In reply to ]
STINNER Victor <vstinner@python.org> added the comment:

PyPy emits a warning when the timeit module is used, suggesting to use pyperf.

timeit uses the minimum, whereas pyperf uses the average (arithmetic mean).

timeit uses a single process, pyperf spawns 21 processes: 1 just for the loop calibration, 20 to compute values.

timeit computes 5 values, pyperf computes 60 values.

timeit uses all computed values, pyperf ignores the first value considered as a "warmup value" (the number of warmup values can be configured).

timeit doesn't compute the standard deviation, pyperf does. The standard deviation gives an idea if the benchmark looks reliable or not. IMO results without standard deviation should not be trusted.

pyperf also emits warning when a benchmark doesn't look reliable. For example, if the user ran various workload while the benchmark was running.

pyperf also supports storing results in a JSON file which stores all values, but also metadata.

I cannot force people to stop using timeit. But there are reason why pyperf is more reliable than timeit.

Benchmarking is hard. See pyperf documentation giving hints how to get reproducible benchmark results:
https://pyperf.readthedocs.io/en/latest/run_benchmark.html#how-to-get-reproducible-benchmark-results

Read also this important article ;-)
"Biased Benchmarks (honesty is hard)"
http://matthewrocklin.com/blog/work/2017/03/09/biased-benchmarks

----------

_______________________________________
Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue45261>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com
[issue45261] Unreliable (?) results from timeit (cache issue?) [ In reply to ]
Steven D'Aprano <steve+python@pearwood.info> added the comment:

Thanks Victor for the explanation about pyperf's addition features. They
do sound very useful. Perhaps we should consider adding some of them to
timeit?

However, in my opinion using the average is statistically wrong. Using
the mean is good when errors are two-sided, that is, your measured value
can be either too low or too high compared to the measured value:

measurement = true value ± random error

If the random errors are symmetrically distributed, then taking the
average tends to cancel them out and give you a better estimate of the
true value. Even if the errors aren't symmetrical, the mean will still
be a better estimate of the true value. (Or perhaps a trimmed mean, or
the median, if there are a lot of outliers.)

But timing results are not like that, the measurement errors are
one-sided, not two:

measurement = true value + random error

So by taking the average, all you are doing is averaging the errors, not
cancelling them. The result you get is *worse* as an estimate of the
true value than the minimum.

All those other factors (ignore the warmup, check for a small stdev,
etc) seem good to me. But the minimum, not the mean, is still going to
be closer to the true cost of running the code.

----------

_______________________________________
Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue45261>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com
[issue45261] Unreliable (?) results from timeit (cache issue?) [ In reply to ]
Change by Dong-hee Na <donghee.na@python.org>:


----------
nosy: +corona10

_______________________________________
Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue45261>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com
[issue45261] Unreliable (?) results from timeit (cache issue?) [ In reply to ]
STINNER Victor <vstinner@python.org> added the comment:

> But timing results are not like that, the measurement errors are
one-sided, not two: (..)

I suggest you to run tons of benchmarks and look at the distribution. The reality is more complex than what you may think.


> measurement = true value + random error

In my experience, there is no single "true value", they are multiple values. Concrete example where the Python randomized hash function gives different value each time you spawn a new Python process:
https://vstinner.github.io/journey-to-stable-benchmark-average.html

Each process has its own "true value", but pyperf spawns 20 Python processes :-)

There are multiple sources of randomness, not only the Python randomized hash function. On Linux, the process address space is randomized by ASLR. I may give different timing at each run.

Code placement, exact memory address, etc. Many things enter into the game when you look into functions which take less than 100 ns.

Here is report is about a value lower than a single nanosecond: "0.08ns/element".

--

I wrote articles about benchmarking:
https://vstinner.readthedocs.io/benchmark.html#my-articles

I gave a talk about it:

* https://raw.githubusercontent.com/vstinner/talks/main/2017-FOSDEM-Brussels/howto_run_stable_benchmarks.pdf
* https://archive.fosdem.org/2017/schedule/event/python_stable_benchmark/

Again, good luck with benchmarking, it's a hard problem ;-)

--

Once you will consider that you know everything about benchmarking, you should read the following paper and cry:
https://arxiv.org/abs/1602.00602

See also my analysis of PyPy performance:
https://vstinner.readthedocs.io/pypy_warmups.html

----------

_______________________________________
Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue45261>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com
[issue45261] Unreliable (?) results from timeit (cache issue?) [ In reply to ]
Steven D'Aprano <steve+python@pearwood.info> added the comment:

Thank you Victor, I don't know whether I can convince you or you will
convince me, but this is very interesting one way or the other. And I
hope that Tim Holy is interested too :-)

On Wed, Sep 22, 2021 at 01:51:35PM +0000, STINNER Victor wrote:

> > But timing results are not like that, the measurement errors are
> > one-sided, not two: (..)
>
> I suggest you to run tons of benchmarks and look at the distribution.
> The reality is more complex than what you may think.

We are trying to find the ideal time that the code would take with no
external factors slowing it. No noise, no other processes running, no
operating system overhead. The baseline time that it would take to run
the code if the CPU could dedicate every cycle to your code and nothing
else.

(Obviously we cannot measure that time on a real computer where there is
always other processes running, but we want the best estimate of that
time we can reach.)

There may still be conditions that can shift the baseline, e.g. the data
fits in the CPU cache or it doesn't, but that's not *environmental*
noise except so far as the impact of other processes might push your
data out of the cache.

So in a noisy system (a real computer) there is some ideal time that the
code would take to run if the CPU was dedicated to running your code
only, and some actual time (ideal time + environmental noise) that we
can measure.

The enviromental noise is never negative. That would mean that running
more processes on the machine could make it faster. That cannot be.

> > measurement = true value + random error
>
> In my experience, there is no single "true value", they are multiple
> values. Concrete example where the Python randomized hash function
> gives different value each time you spawn a new Python process:
>
> https://vstinner.github.io/journey-to-stable-benchmark-average.html
>
> Each process has its own "true value", but pyperf spawns 20 Python
> processes :-)

Thank you for the link, I have read it.

Of course the more processes you spawn, the more environmental noise
there is, which means more noise in your measurements.

If you run with 10 processes, or 100 processes, you probably won't get
the same results for the two runs. (Sorry, I have not tried this myself.
Consider this a prediction.)

You say:

"But how can we compare performances if results are random? Take the
minimum? No! You must never (ever again) use the minimum for
benchmarking! Compute the average and some statistics like the standard
deviation"

but you give no reason for why you should not use the minimum. Without a
reason why the minimum is *wrong*, I cannot accept that prohibition.

> Here is report is about a value lower than a single nanosecond:
> "0.08ns/element".

An Intel i7 CPU performs at approximately 220000 MIPS, so you can
execute about 17 or 18 CPU instructions in 0.08 ns. (Depends on the
instruction, of course.) AMD Ryzen Threadripper may be ten times as
fast. Without knowing the CPU and code I cannot tell whether to be
impressed or shocked at the time.

> Again, good luck with benchmarking, it's a hard problem ;-)

Thank you, and I do appreciate all the work you have put into it, even
if you have not convinced me to stop using the minimum.

----------

_______________________________________
Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue45261>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com
[issue45261] Unreliable (?) results from timeit (cache issue?) [ In reply to ]
STINNER Victor <vstinner@python.org> added the comment:

> but you give no reason for why you should not use the minimum.

See https://pyperf.readthedocs.io/en/latest/analyze.html#minimum-vs-average

I'm not really interested to convince you. Use the minimum if you believe that it better fits your needs.

But pyperf will stick to the mean to get more reproducible benchmark results ;-)

There are many believes and assumptions made by people running benchmarks. As I wrote, do your own experiments ;-) Enjoy multimodal distributions ;-)

----------

_______________________________________
Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue45261>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com
[issue45261] Unreliable (?) results from timeit (cache issue?) [ In reply to ]
STINNER Victor <vstinner@python.org> added the comment:

I suggest to close this issue. timeit works as expected. It has limitations, but that's ok. Previous attempts to enhance timeit were rejected.

----------

_______________________________________
Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue45261>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com
[issue45261] Unreliable (?) results from timeit (cache issue?) [ In reply to ]
Tim Holy <tim.holy@gmail.com> added the comment:

To make sure it's clear, it's not 0.08ns/function call, it's a loop and it comes out to 0.08ns/element. The purpose of quoting that number was to compare to the CPU clock interval, which on my machine is ~0.4ns. Any operation that's less than 1 clock cycle is suspicious, but not automatically wrong because of SIMD (if the compiler generates such instructions for this operation, but I'm not sure how one checks that in Python). On my AVX2 processor, as many as 16 `uint16` values could fit simultaneously, and so you can't entirely rule out times well below one clock cycle (although the need for load, manipulation, store, and increment means that its not plausible to be 1/16th of the clock cycle).

Interestingly, increasing `number` does seem to make it consistent, without obvious transitions. I'm curious why the reported times are not "per number"; I find myself making comparisons using

list(map(lambda tm : tm / 1000, t.repeat(repeat=nrep, number=1000)))

Should the documentation mention that the timing of the core operation should be divided by `number`?

However, in the bigger picture of things I suspect this should be closed. I'll let others chime in first, in case they think documentation or other things need to be changed.

----------

_______________________________________
Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue45261>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com
[issue45261] Unreliable (?) results from timeit (cache issue?) [ In reply to ]
Marc-Andre Lemburg <mal@egenix.com> added the comment:

On the topic average vs. minimum, it's interesting to see this pop up
every now and then. When I originally wrote pybench late in 1997, I used
average, since it gave good results on my PC at the time.

Later on, before pybench was added to Tools/ in Python 2.5 in 2006, people
started complaining about sometimes getting weird results (e.g. negative
times due to the calibration runs not being stable enough). A lot of noise
was made from the minimum fans, so I switched to minimum, which then made
the results more stable, but I left in the average figures as well.

Then some years later, people complained about pybench not being
good enough for comparing to e.g. PyPy, since those other implementations
optimized away some of the micro-benchmarks which were used in pybench.
It was then eventually removed, to not confuse people not willing
to try to understand what the benchmark suite was meant for, nor
understand the issues around running such benchmarks on more modern
CPUs.

CPUs have advanced a lot since the days pybench was written and so
reliable timings are not easy to get unless you invest in dedicated
hardware, custom OS and CPU settings and lots of time to calibrate
everything. See Victor's research for more details.

What we have here is essentially the same issue. timeit() is mostly
being used for micro-benchmarks, but those need to be run in dedicated
environments. timeit() is good for quick checks, but not really up
to the task of providing reliable timing results.

One of these days, we should ask the PSF or one of its sponsors to
provide funding and devtime to set up such a reliable testing
environment. One which runs not only high end machines, but also
average and lower end machines, and using different OSes,
so that we can detect performance regressions early and easily
on different platforms.

----------
nosy: +lemburg

_______________________________________
Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue45261>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com
[issue45261] Unreliable (?) results from timeit (cache issue?) [ In reply to ]
Tim Holy <tim.holy@gmail.com> added the comment:

> And I hope that Tim Holy is interested too :-)

Sure, I'll bite :-). On the topic of which statistic to show, I am a real fan of the histogram. As has been pointed out, timing in the real world can be pretty complicated, and I don't think it does anyone good to hide that complexity. Even in cases where machines aren't busy doing other things, you can get weird multimodal distributions. A great example (maybe not relevant to a lot of Python benchmarks...) is in multithreaded algorithms where the main thread is both responsible for scheduling other threads and for a portion of the work. Even in an almost-idle machine, you can get little race conditions where the main thread decides to pick up some work instants before another thread starts looking for more work. That can generate peaks in the histogram that are separated by the time for one "unit" of work.

But if you have to quote one and only one number, I'm a fan of the minimum (as long as you can trust it---which relies on assuming that you've accurately calibrated away all the overhead of your timing apparatus, and not any more).

----------

_______________________________________
Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue45261>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com
[issue45261] Unreliable (?) results from timeit (cache issue?) [ In reply to ]
Raymond Hettinger <raymond.hettinger@gmail.com> added the comment:

Given that there isn't an actual bug here, please move this discussion to python-ideas.

----------
nosy: +rhettinger
resolution: -> not a bug
stage: -> resolved
status: open -> closed

_______________________________________
Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue45261>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com
[issue45261] Unreliable (?) results from timeit (cache issue?) [ In reply to ]
Change by Tim Holy <tim.holy@gmail.com>:


----------
resolution: not a bug ->

_______________________________________
Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue45261>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com
[issue45261] Unreliable (?) results from timeit (cache issue?) [ In reply to ]
Change by STINNER Victor <vstinner@python.org>:


----------
resolution: -> not a bug

_______________________________________
Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue45261>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com