Mailing List Archive

1 2  View All
Re: Python multithreading without the GIL [ In reply to ]
I have a PR to remove this FAQ entry: https://github.com/python/cpython/pull/28886
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/YP3QZ7ZLMMQUAWVQRGAGNNETA6IDXP4P/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Python multithreading without the GIL [ In reply to ]
On Mon, Oct 11, 2021 at 7:04 AM Antoine Pitrou <antoine@python.org> wrote:

> It's crude, but you can take a look at `ccbench` in the Tools directory.
>

Thanks, I wasn't familiar with this. The ccbench results look pretty good:
about 18.1x speed-up on "pi calculation" and 19.8x speed-up on "regular
expression" with 20 threads (turbo off). The latency and throughput results
look good too. With the GIL enabled (3.11), the compute intensive
background task increases latency and dramatically decreases throughput.
With the GIL disabled, latency remains low and throughput high.

Here are the full results for 20 threads without the GIL:
https://gist.github.com/colesbury/8479ee0246558fa1ab0f49e4c01caeed (nogil,
20 threads)

Here are the results for 4 threads (the default) for comparison with
upstream:
https://gist.github.com/colesbury/8479ee0246558fa1ab0f49e4c01caeed (nogil,
4 threads)
https://gist.github.com/colesbury/c0b89f82e51779670265fb7c7cd37114
(3.11/b108db63e0, 4 threads)
Re: Python multithreading without the GIL [ In reply to ]
I've updated the linked gists with the results from interpreters compiled
with PGO, so the numbers have slightly changed.
Re: Python multithreading without the GIL [ In reply to ]
Thank you Sam, this additional detail really helps me understand your proposal.

-Barry

> On Oct 11, 2021, at 12:06, Sam Gross <colesbury@gmail.com> wrote:
>
> I’m unclear what is actually retried. You use this note throughout the document, so I think it would help to clarify exactly what is retried and why that solves the particular problem. I’m confused because, is it the refcount increment that’s retried or the entire sequence of steps (i.e. do you go back and reload the address of the item)? Is there some kind of waiting period before the retry? I would infer that if you’re retrying the refcount incrementing, it’s because you expect subsequent retries to transition from zero to non-zero, but is that guaranteed? Are there possibilities of deadlocks or race conditions?
>
> The entire operation is retried (not just the refcount). For "dict", this means going back to step 1 and reloading the version tag and PyDictKeysObject. The operation can fail (and need to be retried) only when some other thread is concurrently modifying the dict. The reader needs to perform the checks (and retry) to avoid returning inconsistent data, such as an object that was never in the dict. With the checks and retry, returning inconsistent or garbage data is not possible.
>
> The retry is performed after locking the dict, so the operation is retried at most once -- the read operation can't fail when it holds the dict's lock because the lock prevents concurrent modifications. It would have also been possible to retry the operation in a loop without locking the dict, but I was concerned about reader starvation. (In the doc I wrote "livelock", but "reader starvation" is more accurate.) In particular, I was concerned that a thread repeatedly modifying a dict might prevent other threads reading the dict from making progress. I hadn't seen this in practice, but I'm aware that reader starvation can be an issue for similar designs like Linux's seqlock. Acquiring the dict's lock when retrying avoids the reader starvation issue.
>
> Deadlock isn't possible because the code does not acquire any other locks while holding the dict's lock. For example, the code releases the dict's lock before calling Py_DECREF or PyObject_RichCompareBool.
>
> The race condition question is a bit harder to answer precisely. Concurrent reads and modifications of a dict won't cause the program to segfault, return garbage data, or items that were never in the dict.
>
> Regards,
> Sam
>
>
Re: Python multithreading without the GIL [ In reply to ]
(off-list)


On 10/11/21 2:09 PM, Sam Gross wrote:
> The ccbench results look pretty good: about 18.1x speed-up on "pi
> calculation" and 19.8x speed-up on "regular expression" with 20
> threads (turbo off). The latency and throughput results look good too.


JESUS CHRIST



//arry/
Re: Python multithreading without the GIL [ In reply to ]
Oops! Sorry everybody, I meant that to be off-list.

Still, I hope you at least enjoyed my enthusiasm!


/arry

On Tue, Oct 12, 2021, 12:55 Larry Hastings <larry@hastings.org> wrote:

>
> (off-list)
>
>
> On 10/11/21 2:09 PM, Sam Gross wrote:
>
> The ccbench results look pretty good: about 18.1x speed-up on "pi
> calculation" and 19.8x speed-up on "regular expression" with 20 threads
> (turbo off). The latency and throughput results look good too.
>
>
> JESUS CHRIST
>
>
>
> */arry*
>
Re: Python multithreading without the GIL [ In reply to ]
> Still, I hope you at least enjoyed my enthusiasm!

I did!
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/BJFDVRCZMEDOHEMCCIJJP6NTX6HOGC5L/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Python multithreading without the GIL [ In reply to ]
I love everything about this - but I expect some hesitancy due to this "Multithreaded programs are prone to concurrency bugs.".

If there is significant pushback, I have one suggestion:

Would it be helpful to think of the python concurrency mode as a property of interpreters?
`interp = interpreters.create(concurrency_mode=interpreters.GIL)`
or
`interp = interpreters.create(concurrency_mode=interpreters.NOGIL)`

and subsequently python _environments_ can make different choices about what to use for the 0th interpreter, via some kind of configuration.
Python modules can declare which concurrency modes they supports. Future concurrency modes that address specific use cases could be added.

This would allow python environments who would rather not audit their code for concurrency isuses to opt out, and allow incremental adoption. I can't intuit whether this indirection would cause a performance problem in the C implementation or if there is some clever way to have different variants of relevant objects at compile time and switch between them based on the interpreter concurrency mode.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ZUEWHEOW34MNHKOY2TLTFI4LHYJX4YDW/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Python multithreading without the GIL [ In reply to ]
The way I see it, the concurrency model to be used is selected by
developers. They can choose between multi-threading, multi-process, or
asyncio, or even a hybrid. If developers select multithreading, then
they carry the burden of ensuring mutual exclusion and avoiding race
conditions, dead locks, live locks, etc.


On Mon, 2021-10-18 at 13:17 +0000, Mohamed Koubaa wrote:
> I love everything about this - but I expect some hesitancy due to
> this "Multithreaded programs are prone to concurrency bugs.".
>
> If there is significant pushback, I have one suggestion:
>
> Would it be helpful to think of the python concurrency mode as a
> property of interpreters?
> `interp = interpreters.create(concurrency_mode=interpreters.GIL)`
> or
> `interp = interpreters.create(concurrency_mode=interpreters.NOGIL)`
>
> and subsequently python _environments_ can make different choices
> about what to use for the 0th interpreter, via some kind of
> configuration.
> Python modules can declare which concurrency modes they supports. 
> Future concurrency modes that address specific use cases could be
> added.
>
> This would allow python environments who would rather not audit their
> code for concurrency isuses to opt out, and allow incremental
> adoption.  I can't intuit whether this indirection would cause a
> performance problem in the C implementation or if there is some
> clever way to have different variants of relevant objects at compile
> time and switch between them based on the interpreter concurrency
> mode.
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/ZUEWHEOW34MNHKOY2TLTFI4LHYJX4YDW/
> Code of Conduct: http://python.org/psf/codeofconduct/
Re: Python multithreading without the GIL [ In reply to ]
Mohamed> I love everything about this - but I expect some hesitancy
due to this "Multithreaded programs are prone to concurrency bugs.".

Paul> The way I see it, the concurrency model to be used is selected
by developers. They can choose between ...

I think the real intent of the statement Mohamed quoted is that just
because your program works in a version of Python with the GIL doesn't
mean it will work unchanged in a GIL-free world. As we all know, the
GIL can hide a multitude of sins. I could be paraphrasing Tim Peters
here without realizing it explicitly. It kinda sounds like something
he might say.

Skip
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/4OWK2DQKQOZZDPNWA7KC3NAUTWOBFOND/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Python multithreading without the GIL [ In reply to ]
Guido> To be clear, Sam’s basic approach is a bit slower for
single-threaded code, and he admits that. But to sweeten the pot he has
also applied a bunch of unrelated speedups that make it faster in general,
so that overall it’s always a win. But presumably we could upstream the
latter easily, separately from the GIL-freeing part.

Something just occurred to me. If you upstream all the other goodies
(register VM, etc), when the time comes to upstream the no-GIL parts won't
the complaint then be (again), "but it's slower for single-threaded code!"
? ;-)

Onto other things. For about as long as I can remember, the biggest knock
against Python was, "You can never do any serious multi-threaded
programming with it. It has this f**king GIL!" I know that attempts to
remove it have been made multiple times, beginning with (I think) Greg
Smith in the 1.4 timeframe. In my opinion, Sam's work finally solves the
problem.

Not being a serious parallel programming person (I have used
multi-threading a bit in Python, but only for obviously I/O-bound tasks), I
thought it might be instructive — for me, at least — to kick the no-GIL
tires a bit. Not having any obvious application in mind, I decided to
implement a straightforward parallel matrix multiply. (I think I wrote
something similar back in the mid-80s in a now defunct Smalltalk-inspired
language while at GE.) Note that this was just for my own edification. I
have no intention of trying to supplant numpy.matmul() or anything like that.
It splits up the computation in the most straightforward (to me) way,
handing off the individual vector multiplications to a variable sized
thread pool. The code is here:

https://gist.github.com/smontanaro/80f788a506d2f41156dae779562fd08d

Here is a graph of some timings. My machine is a now decidedly
long-in-the-tooth Dell Precision 5520 with a 7th Gen Core i7 processor
(four cores + hyperthreading). The data for the graph come from the
built-in bash time(1) command. As expected, wall clock time drops as you
increase the number of cores until you reach four. After that, nothing
improves, since the logical HT cores don't actually have their own ALU
(just instruction fetch/decode I think). The slope of the real time
improvement from two cores to four isn't as great as one to two, probably
because I wasn't careful about keeping the rest of the system quiet. It was
running my normal mix, Brave with many open tabs + Emacs. I believe I used
A=240x3125, B=3125x480, giving a 240x480 result, so 15200 vector multiplies.
.

[image: matmul.png]

All-in-all, I think Sam's effort is quite impressive. I got things going in
fits and starts, needing a bit of help from Sam and Vadym Stupakov
to get the modified numpy implementation (crosstalk between my usual Conda
environment and the no-GIL stuff). I'm sure there are plenty of problems
yet to be solved related to extension modules, but I trust smarter people
than me can solve them without a lot of fuss. Once nogil is up-to-date with
the latest 3.9 release I hope these changes can start filtering into main.
Hopefully that means a 3.11 release. In fact, I'd vote for pushing back the
usual release cycle to accommodate inclusion. Sam has gotten this so close
it would be a huge disappointment to abandon it now. The problems faced at
this point would have been amortized over years of development if the GIL
had been removed 20 years ago. I say go for it.

Skip
Re: Python multithreading without the GIL [ In reply to ]
Thanks Skip — nice to see some examples.

Did you try running the same code with stock Python?

One reason I ask is the IIUC, you are using numpy for the individual
vector operations, and numpy already releases the GIL in some
circumstances.

It would also be fun to see David Beezley’s example from his seminal talk:


https://youtu.be/ph374fJqFPE

-CHB



On Thu, Oct 28, 2021 at 3:55 AM Skip Montanaro <skip.montanaro@gmail.com>
wrote:

> Guido> To be clear, Sam’s basic approach is a bit slower for
> single-threaded code, and he admits that. But to sweeten the pot he has
> also applied a bunch of unrelated speedups that make it faster in general,
> so that overall it’s always a win. But presumably we could upstream the
> latter easily, separately from the GIL-freeing part.
>
> Something just occurred to me. If you upstream all the other goodies
> (register VM, etc), when the time comes to upstream the no-GIL parts won't
> the complaint then be (again), "but it's slower for single-threaded
> code!" ? ;-)
>
> Onto other things. For about as long as I can remember, the biggest knock
> against Python was, "You can never do any serious multi-threaded
> programming with it. It has this f**king GIL!" I know that attempts to
> remove it have been made multiple times, beginning with (I think) Greg
> Smith in the 1.4 timeframe. In my opinion, Sam's work finally solves the
> problem.
>
> Not being a serious parallel programming person (I have used
> multi-threading a bit in Python, but only for obviously I/O-bound tasks), I
> thought it might be instructive — for me, at least — to kick the no-GIL
> tires a bit. Not having any obvious application in mind, I decided to
> implement a straightforward parallel matrix multiply. (I think I wrote
> something similar back in the mid-80s in a now defunct Smalltalk-inspired
> language while at GE.) Note that this was just for my own edification. I
> have no intention of trying to supplant numpy.matmul() or anything like
> that. It splits up the computation in the most straightforward (to me)
> way, handing off the individual vector multiplications to a variable
> sized thread pool. The code is here:
>
> https://gist.github.com/smontanaro/80f788a506d2f41156dae779562fd08d
>
> Here is a graph of some timings. My machine is a now decidedly
> long-in-the-tooth Dell Precision 5520 with a 7th Gen Core i7 processor
> (four cores + hyperthreading). The data for the graph come from the
> built-in bash time(1) command. As expected, wall clock time drops as you
> increase the number of cores until you reach four. After that, nothing
> improves, since the logical HT cores don't actually have their own ALU
> (just instruction fetch/decode I think). The slope of the real time
> improvement from two cores to four isn't as great as one to two, probably
> because I wasn't careful about keeping the rest of the system quiet. It was
> running my normal mix, Brave with many open tabs + Emacs. I believe I used
> A=240x3125, B=3125x480, giving a 240x480 result, so 15200 vector multiplies.
> .
>
> [image: matmul.png]
>
> All-in-all, I think Sam's effort is quite impressive. I got things going
> in fits and starts, needing a bit of help from Sam and Vadym Stupakov
> to get the modified numpy implementation (crosstalk between my usual Conda
> environment and the no-GIL stuff). I'm sure there are plenty of problems
> yet to be solved related to extension modules, but I trust smarter people
> than me can solve them without a lot of fuss. Once nogil is up-to-date with
> the latest 3.9 release I hope these changes can start filtering into main.
> Hopefully that means a 3.11 release. In fact, I'd vote for pushing back the
> usual release cycle to accommodate inclusion. Sam has gotten this so close
> it would be a huge disappointment to abandon it now. The problems faced at
> this point would have been amortized over years of development if the GIL
> had been removed 20 years ago. I say go for it.
>
> Skip
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/WBLU6PZ2RDPEMG3ZYBWSAXUGXCJNFG4A/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
--
Christopher Barker, PhD (Chris)

Python Language Consulting
- Teaching
- Scientific Software Development
- Desktop GUI and Web Development
- wxPython, numpy, scipy, Cython
Re: Python multithreading without the GIL [ In reply to ]
>
> Did you try running the same code with stock Python?
>
> One reason I ask is the IIUC, you are using numpy for the individual
> vector operations, and numpy already releases the GIL in some
> circumstances.
>

I had not run the same code with stock Python (but see below). Also, I only
used numpy for two bits:

1. I use numpy arrays filled with random values, and the output array is
also a numpy array. The vector multiplication is done in a simple for loop
in my vecmul() function.

2. Early on I compared my results with the result of numpy.matmul just to
make sure I had things right.

That said, I have now run my example code using both PYTHONGIL=0 and
PYTHONGIL=1 of Sam's nogil branch as well as the following other Python3
versions:

* Conda Python3 (3.9.7)
* /usr/bin/python3 (3.9.1 in my case)
* 3.9 branch tip (3.9.7+)

The results were confusing, so I dredged up a copy of pystone to make sure
I wasn't missing anything w.r.t. basic execution performance. I'm still
confused, so will keep digging.

It would also be fun to see David Beezley’s example from his seminal talk:
>
> https://youtu.be/ph374fJqFPE
>

Thanks, I'll take a look when I get a chance. Might give me the excuse I
need to wake up extra early and tag along with Dave on an early morning
bike ride.

Skip
Re: Python multithreading without the GIL [ In reply to ]
On Fri, Oct 29, 2021 at 6:10 AM Skip Montanaro <skip.montanaro@gmail.com>
wrote:

> 1. I use numpy arrays filled with random values, and the output array is
> also a numpy array. The vector multiplication is done in a simple for loop
> in my vecmul() function.
>

probably doesn't make a difference for this exercise, but numpy arrays make
lousy replacements for a regular list -- i.e. as a container alone. The
issue is that floats need to be "boxed" and "unboxed" as you put them in
and pull them out of an array. whereas with lists, they float objects
themselves are already there.

OK, maybe not as bad as I remember. but not great:

In [61]: def multiply(vect, scalar, out):
...: """
...: multiply all the elements in vect by a scalar in place
...: """
...: for i, val in enumerate(vect):
...: out[i] = val * scalar
...:

In [62]: arr = np.random.random((100000,))

In [63]: arrout = np.zeros_like(arr)

In [64]: l = list(arr)

In [65]: lout = [None] * len(l)

In [66]: %timeit multiply(arr, 1.1, arrout)
19.3 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [67]: %timeit multiply(l, 1.1, lout)
12.8 ms ± 83.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

> That said, I have now run my example code using both PYTHONGIL=0 and
PYTHONGIL=1 of Sam's nogil branch as well as the following other Python3
versions:

* Conda Python3 (3.9.7)
* /usr/bin/python3 (3.9.1 in my case)
* 3.9 branch tip (3.9.7+)

The results were confusing, so I dredged up a copy of pystone to make sure
I wasn't missing anything w.r.t. basic execution performance. I'm still
confused, so will keep digging.

I'll be interested to see what you find out :-)

It would also be fun to see David Beezley’s example from his seminal talk:
>
> https://youtu.be/ph374fJqFPE
>

Thanks, I'll take a look when I get a chance

That may not be the best source of the talk -- just the one I found first
:-)

-CHB

--
Christopher Barker, PhD (Chris)

Python Language Consulting
- Teaching
- Scientific Software Development
- Desktop GUI and Web Development
- wxPython, numpy, scipy, Cython
Re: Python multithreading without the GIL [ In reply to ]
Skip> 1. I use numpy arrays filled with random values, and the output array
is also a numpy array. The vector multiplication is done in a simple for
loop in my vecmul() function.

CHB> probably doesn't make a difference for this exercise, but numpy arrays
make lousy replacements for a regular list ...

Yeah, I don't think it should matter here. Both versions should be
similarly penalized.

Skip> The results were confusing, so I dredged up a copy of pystone to make
sure I wasn't missing anything w.r.t. basic execution performance. I'm
still confused, so will keep digging.

CHB> I'll be interested to see what you find out :-)

I'm still scratching my head. I was thinking there was something about the
messaging between the main and worker threads, so I tweaked matmul.py to
accept 0 as a number of threads. That means it would call matmul which
would call vecmul directly. The original queue-using versions were simply
renamed to matmul_t and vecmul_t.

I am still confused. Here are the pystone numbers, nogil first, then the
3.9 git tip:

(base) nogil_build% ./bin/python3 ~/cmd/pystone.py
Pystone(1.1.1) time for 50000 passes = 0.137658
This machine benchmarks at 363218 pystones/second

(base) 3.9_build% ./bin/python3 ~/cmd/pystone.py
Pystone(1.1.1) time for 50000 passes = 0.207102
This machine benchmarks at 241427 pystones/second

That suggests nogil is indeed a definite improvement over vanilla 3.9.
However, here's a quick nogil v 3.9 timing run of my matrix multiplication,
again, nogil followed by 3.9 tip:

(base) nogil_build% time ./bin/python3 ~/tmp/matmul.py 0 100000
a: (160, 625) b: (625, 320) result: (160, 320) -> 51200

real 0m9.314s
user 0m9.302s
sys 0m0.012s

(base) 3.9_build% time ./bin/python3 ~/tmp/matmul.py 0 100000
a: (160, 625) b: (625, 320) result: (160, 320) -> 51200

real 0m4.918s
user 0m5.180s
sys 0m0.380s

What's up with that? Suddenly nogil is much slower than 3.9 tip. No threads
are in use. I thought perhaps the nogil run somehow didn't use Sam's VM
improvements, so I disassembled the two versions of vecmul. I won't bore
you with the entire dis.dis output, but suffice it to say that Sam's
instruction set appears to be in play:

(base) nogil_build% PYTHONPATH=$HOME/tmp ./bin/python3/python3
Python 3.9.0a4+ (heads/nogil:b0ee2c4740, Oct 30 2021, 16:23:03)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import matmul, dis
>>> dis.dis(matmul.vecmul)
26 0 FUNC_HEADER 11 (11)

28 2 LOAD_CONST 2 (0.0)
4 STORE_FAST 2 (result)

29 6 LOAD_GLOBAL 3 254 ('len'; 254)
9 STORE_FAST 8 (.t3)
11 COPY 9 0 (.t4 <- a)
14 CALL_FUNCTION 9 1 (.t4 to .t5)
18 STORE_FAST 5 (.t0)
...

So I unboxed the two numpy arrays once and used lists of lists for the
actual work. The nogil version still performs worse by about a factor of
two:

(base) nogil_build% time ./bin/python3 ~/tmp/matmul.py 0 100000
a: (160, 625) b: (625, 320) result: (160, 320) -> 51200

real 0m9.537s
user 0m9.525s
sys 0m0.012s

(base) 3.9_build% time ./bin/python3 ~/tmp/matmul.py 0 100000
a: (160, 625) b: (625, 320) result: (160, 320) -> 51200

real 0m4.836s
user 0m5.109s
sys 0m0.365s

Still scratching my head and am open to suggestions about what to try next.
If anyone is playing along from home, I've updated my script:

https://gist.github.com/smontanaro/80f788a506d2f41156dae779562fd08d

I'm sure there are things I could have done more efficiently, but I would
think both Python versions would be similarly penalized by dumb s**t I've
done.

Skip


Skip
Re: Python multithreading without the GIL [ In reply to ]
Remember that py stone is a terrible benchmark. It only exercises a few
byte codes and a modern CPU’s caching and branch prediction make minced
meat of those. Sam wrote a whole new register-based VM so perhaps that
exercises different byte codes.

On Sun, Oct 31, 2021 at 05:19 Skip Montanaro <skip.montanaro@gmail.com>
wrote:

> Skip> 1. I use numpy arrays filled with random values, and the output
> array is also a numpy array. The vector multiplication is done in a simple
> for loop in my vecmul() function.
>
> CHB> probably doesn't make a difference for this exercise, but numpy
> arrays make lousy replacements for a regular list ...
>
> Yeah, I don't think it should matter here. Both versions should be
> similarly penalized.
>
> Skip> The results were confusing, so I dredged up a copy of pystone to
> make sure I wasn't missing anything w.r.t. basic execution performance. I'm
> still confused, so will keep digging.
>
> CHB> I'll be interested to see what you find out :-)
>
> I'm still scratching my head. I was thinking there was something about the
> messaging between the main and worker threads, so I tweaked matmul.py to
> accept 0 as a number of threads. That means it would call matmul which
> would call vecmul directly. The original queue-using versions were simply
> renamed to matmul_t and vecmul_t.
>
> I am still confused. Here are the pystone numbers, nogil first, then the
> 3.9 git tip:
>
> (base) nogil_build% ./bin/python3 ~/cmd/pystone.py
> Pystone(1.1.1) time for 50000 passes = 0.137658
> This machine benchmarks at 363218 pystones/second
>
> (base) 3.9_build% ./bin/python3 ~/cmd/pystone.py
> Pystone(1.1.1) time for 50000 passes = 0.207102
> This machine benchmarks at 241427 pystones/second
>
> That suggests nogil is indeed a definite improvement over vanilla 3.9.
> However, here's a quick nogil v 3.9 timing run of my matrix multiplication,
> again, nogil followed by 3.9 tip:
>
> (base) nogil_build% time ./bin/python3 ~/tmp/matmul.py 0 100000
> a: (160, 625) b: (625, 320) result: (160, 320) -> 51200
>
> real 0m9.314s
> user 0m9.302s
> sys 0m0.012s
>
> (base) 3.9_build% time ./bin/python3 ~/tmp/matmul.py 0 100000
> a: (160, 625) b: (625, 320) result: (160, 320) -> 51200
>
> real 0m4.918s
> user 0m5.180s
> sys 0m0.380s
>
> What's up with that? Suddenly nogil is much slower than 3.9 tip. No
> threads are in use. I thought perhaps the nogil run somehow didn't use
> Sam's VM improvements, so I disassembled the two versions of vecmul. I
> won't bore you with the entire dis.dis output, but suffice it to say that
> Sam's instruction set appears to be in play:
>
> (base) nogil_build% PYTHONPATH=$HOME/tmp ./bin/python3/python3
> Python 3.9.0a4+ (heads/nogil:b0ee2c4740, Oct 30 2021, 16:23:03)
> [GCC 9.3.0] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import matmul, dis
> >>> dis.dis(matmul.vecmul)
> 26 0 FUNC_HEADER 11 (11)
>
> 28 2 LOAD_CONST 2 (0.0)
> 4 STORE_FAST 2 (result)
>
> 29 6 LOAD_GLOBAL 3 254 ('len'; 254)
> 9 STORE_FAST 8 (.t3)
> 11 COPY 9 0 (.t4 <- a)
> 14 CALL_FUNCTION 9 1 (.t4 to .t5)
> 18 STORE_FAST 5 (.t0)
> ...
>
> So I unboxed the two numpy arrays once and used lists of lists for the
> actual work. The nogil version still performs worse by about a factor of
> two:
>
> (base) nogil_build% time ./bin/python3 ~/tmp/matmul.py 0 100000
> a: (160, 625) b: (625, 320) result: (160, 320) -> 51200
>
> real 0m9.537s
> user 0m9.525s
> sys 0m0.012s
>
> (base) 3.9_build% time ./bin/python3 ~/tmp/matmul.py 0 100000
> a: (160, 625) b: (625, 320) result: (160, 320) -> 51200
>
> real 0m4.836s
> user 0m5.109s
> sys 0m0.365s
>
> Still scratching my head and am open to suggestions about what to try
> next. If anyone is playing along from home, I've updated my script:
>
> https://gist.github.com/smontanaro/80f788a506d2f41156dae779562fd08d
>
> I'm sure there are things I could have done more efficiently, but I would
> think both Python versions would be similarly penalized by dumb s**t I've
> done.
>
> Skip
>
>
> Skip
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/4JSJFOWQPZHUAUGDVRGIU6LTF7QNXTLD/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
--
--Guido (mobile)
Re: Python multithreading without the GIL [ In reply to ]
> Remember that py stone is a terrible benchmark.

I understand that. I was only using it as a spot check. I was surprised at
how much slower my (threaded or unthreaded) matrix multiply was on nogil vs
3.9+. I went into it thinking I would see an improvement. The Performance
section of Sam's design document starts:

As mentioned above, the no-GIL proof-of-concept interpreter is about 10%
faster than CPython 3.9 (and 3.10) on the pyperformance benchmark suite.


so it didn't occur to me that I'd be looking at a slowdown, much less by as
much as I'm seeing.

Maybe I've somehow stumbled on some instruction mix for which the nogil VM
is much worse than the stock VM. For now, I prefer to think I'm just doing
something stupid. It certainly wouldn't be the first time.

Skip

P.S. I suppose I should have cc'd Sam when I first replied to this
thread, but I'm doing so now. I figured my mistake would reveal itself
early on. Sam, here's my first post about my little "project."
https://mail.python.org/archives/list/python-dev@python.org/message/WBLU6PZ2RDPEMG3ZYBWSAXUGXCJNFG4A/
Re: Python multithreading without the GIL [ In reply to ]
Hi Skip,

I think the performance difference is because of different versions of
NumPy. Python 3.9 installs NumPy 1.21.3 by default for "pip install numpy".
I've only built and packaged NumPy 1.19.4 for "nogil" Python. There are
substantial performance differences between the two NumPy builds for this
matmul script.

With NumPy 1.19.4, I get practically the same results for both Python 3.9.2
and "nogil" Python for "time python3 matmul.py 0 100000".

I'll update the version of NumPy for "nogil" Python if I have some time
this week.

Best,
Sam

On Sun, Oct 31, 2021 at 5:46 PM Skip Montanaro <skip.montanaro@gmail.com>
wrote:

> > Remember that py stone is a terrible benchmark.
>
> I understand that. I was only using it as a spot check. I was surprised at
> how much slower my (threaded or unthreaded) matrix multiply was on nogil vs
> 3.9+. I went into it thinking I would see an improvement. The Performance
> section of Sam's design document starts:
>
> As mentioned above, the no-GIL proof-of-concept interpreter is about 10%
> faster than CPython 3.9 (and 3.10) on the pyperformance benchmark suite.
>
>
> so it didn't occur to me that I'd be looking at a slowdown, much less by
> as much as I'm seeing.
>
> Maybe I've somehow stumbled on some instruction mix for which the nogil VM
> is much worse than the stock VM. For now, I prefer to think I'm just doing
> something stupid. It certainly wouldn't be the first time.
>
> Skip
>
> P.S. I suppose I should have cc'd Sam when I first replied to this
> thread, but I'm doing so now. I figured my mistake would reveal itself
> early on. Sam, here's my first post about my little "project."
> https://mail.python.org/archives/list/python-dev@python.org/message/WBLU6PZ2RDPEMG3ZYBWSAXUGXCJNFG4A/
>
>
>
Re: Python multithreading without the GIL [ In reply to ]
> I think the performance difference is because of different versions of
> NumPy.
>

Good reason to leave numpy completely out of it. Unless you want to test
nogil’s performance effects on numpy code — an interesting exercise in
itself.

Also — sorry I didn’t look at your code before, but you really want to keep
the generation of large random arrays out of your benchmark if you can. I
suspect that’s what’s changed in numpy versions.

In any case, do time the random number generation…

-CHB



Python 3.9 installs NumPy 1.21.3 by default for "pip install numpy". I've
> only built and packaged NumPy 1.19.4 for "nogil" Python. There are
> substantial performance differences between the two NumPy builds for this
> matmul script.
>
> With NumPy 1.19.4, I get practically the same results for both Python
> 3.9.2 and "nogil" Python for "time python3 matmul.py 0 100000".
>
> I'll update the version of NumPy for "nogil" Python if I have some time
> this week.
>
> Best,
> Sam
>
> On Sun, Oct 31, 2021 at 5:46 PM Skip Montanaro <skip.montanaro@gmail.com>
> wrote:
>
>> > Remember that py stone is a terrible benchmark.
>>
>> I understand that. I was only using it as a spot check. I was surprised
>> at how much slower my (threaded or unthreaded) matrix multiply was on nogil
>> vs 3.9+. I went into it thinking I would see an improvement. The
>> Performance section of Sam's design document starts:
>>
>> As mentioned above, the no-GIL proof-of-concept interpreter is about 10%
>> faster than CPython 3.9 (and 3.10) on the pyperformance benchmark suite.
>>
>>
>> so it didn't occur to me that I'd be looking at a slowdown, much less by
>> as much as I'm seeing.
>>
>> Maybe I've somehow stumbled on some instruction mix for which the nogil
>> VM is much worse than the stock VM. For now, I prefer to think I'm just
>> doing something stupid. It certainly wouldn't be the first time.
>>
>> Skip
>>
>> P.S. I suppose I should have cc'd Sam when I first replied to this
>> thread, but I'm doing so now. I figured my mistake would reveal itself
>> early on. Sam, here's my first post about my little "project."
>> https://mail.python.org/archives/list/python-dev@python.org/message/WBLU6PZ2RDPEMG3ZYBWSAXUGXCJNFG4A/
>>
>>
>> --
Christopher Barker, PhD (Chris)

Python Language Consulting
- Teaching
- Scientific Software Development
- Desktop GUI and Web Development
- wxPython, numpy, scipy, Cython
Re: Python multithreading without the GIL [ In reply to ]
Sam> I think the performance difference is because of different
versions of NumPy.

Thanks all for the help/input/advice. It never occurred to me that two
relatively recent versions of numpy would differ so much for the
simple tasks in my script (array creation & transform). I confirmed
this by removing 1.21.3 and installing 1.19.4 in my 3.9 build.

I also got a little bit familiar with pyperf, and as a "stretch" goal
completely removed random numbers and numpy from my script. (Took me a
couple tries to get my array init and transposition correct. Let's
just say that it's been awhile. Numpy *was* a nice crutch...) With no
trace of numpyleft I now get identical results for single-threaded
matrix multiply (a size==10000, b size==20000):

3.9: matmul: Mean +- std dev: 102 ms +- 1 ms
nogil: matmul: Mean +- std dev: 103 ms +- 2 ms

and a nice speedup for multi-threaded (a size==30000, b size=60000, nthreads=3):

3.9: matmul_t: Mean +- std dev: 290 ms +- 13 ms
nogil: matmul_t: Mean +- std dev: 102 ms +- 3 ms

Sam> I'll update the version of NumPy for "nogil" Python if I have
some time this week.

I think it would be sufficient to alert users to the 1.19/1.21
performance differences and recommend they force install 1.19 in
non-nogil builds for testing purposes. Hopefully adding a simple note
to your README will take less time than porting your changes to numpy
1.21 and adjusting your build configs/scripts.

Skip
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/5RXRTNNCYBCILMVATHODFGAZ5ZEQXRZI/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Python multithreading without the GIL [ In reply to ]
Hello all,

I am very excited about a future multithreaded Python. I managed to postpone some rewrites in the company I work for Rust/Go, precisely because of the potential to have a Python solution in the medium term.

I was wondering. Is Sam Gross' nogil merge being seriously considered by the core Python team?
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/SNSLKDHCE3J2VQHZCWFHNPDAEWGKEWN6/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Python multithreading without the GIL [ In reply to ]
On Sat, Apr 23, 2022 at 8:31 AM <brataodream@gmail.com> wrote:

> Hello all,
>
> I am very excited about a future multithreaded Python. I managed to
> postpone some rewrites in the company I work for Rust/Go, precisely because
> of the potential to have a Python solution in the medium term.
>
> I was wondering. Is Sam Gross' nogil merge being seriously considered by
> the core Python team?
>

Yes, although we have no timeline as to when we will make a decision about
whether we will accept it or not. The last update we had on the work was
Sam was upstreaming the performance improvements he made that were not
nogil-specific. The nogil work was also being updated for the `main`
branch. Once that's all done we will probably start a serious discussion as
to whether we want to accept it.
Re: Python multithreading without the GIL [ In reply to ]
On Mon, Apr 25, 2022 at 2:33 PM Brett Cannon <brett@python.org> wrote:

>
>
> On Sat, Apr 23, 2022 at 8:31 AM <brataodream@gmail.com> wrote:
>
>> Hello all,
>>
>> I am very excited about a future multithreaded Python. I managed to
>> postpone some rewrites in the company I work for Rust/Go, precisely because
>> of the potential to have a Python solution in the medium term.
>>
>> I was wondering. Is Sam Gross' nogil merge being seriously considered by
>> the core Python team?
>>
>
> Yes, although we have no timeline as to when we will make a decision about
> whether we will accept it or not.
>

We haven't even discussed a *process* for how to decide. OTOH, in two days
at the Language Summit at PyCon, Sam will give a presentation to the core
devs present (which is far from all of us, alas).


> The last update we had on the work was Sam was upstreaming the performance
> improvements he made that were not nogil-specific. The nogil work was also
> being updated for the `main` branch. Once that's all done we will probably
> start a serious discussion as to whether we want to accept it.
>

It's possible that I've missed those code reviews, but I haven't seen a
single PR from Sam, nor have there been any messages from him in this forum
or in any other forums I'm monitoring. I'm hoping that the Language Summit
will change this, but I suspect that there aren't that many perf
improvements in Sam's work that are easily separated from the nogil work.
(To be sure, Christian Heimes seems to have made progress with introducing
mimalloc, which is one of Sam's dependencies, but AFAIK that work hasn't
been finished yet.)

--
--Guido van Rossum (python.org/~guido)
*Pronouns: he/him **(why is my pronoun here?)*
<http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>
Re: Python multithreading without the GIL [ In reply to ]
I'm suspicious of pyperformance testing for this reason:

The point of Python is operating OK despite GIL because "most of the time is spent in 'external' libraries."

Pyperformance tests "typical" python performance where supposedly most tests are "ok" despite GIL. You need multithreading in atypical situations which may involve a lot of raw-python-object "thrashing," with high ref-counting, locks, etc. How do we know that pyperformance actually tests these cases well?
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/UA4VX3JFRRHI6TXPEBCZRWSPDOWQWM2G/
Code of Conduct: http://python.org/psf/codeofconduct/

1 2  View All