Mailing List Archive: Python multithreading without the GIL

Python multithreading without the GIL

Oct 7, 2021, 12:52 PM

Post #1 of 49 (3483 views)

Hi,

I've been working on changes to CPython to allow it to run without the
global interpreter lock. I'd like to share a working proof-of-concept that
can run without the GIL. The proof-of-concept involves substantial changes
to CPython internals, but relatively few changes to the C-API. It is
compatible with many C extensions: extensions must be rebuilt, but usually
require small or no modifications to source code. I've built compatible
versions of packages from the scientific Python ecosystem, and they are
installable through the bundled "pip".

Source code:
https://github.com/colesbury/nogil

Design overview:
https://docs.google.com/document/d/18CXhDb1ygxg-YXNBJNzfzZsDFosB5e6BfnXLlejd9l0/edit

My goal with the proof-of-concept is to demonstrate that removing the GIL
is feasible and worthwhile, and that the technical ideas of the project
could serve as a basis of such an effort.

I'd like to start a discussion about these ideas and gauge the community's
interest in this approach to removing the GIL.

Regards,
Sam Gross
colesbury@gmail.com / sgross@fb.com

Re: Python multithreading without the GIL [ In reply to ]

rosuav at gmail

Oct 7, 2021, 9:02 PM

Post #2 of 49 (3483 views)

Permalink

On Fri, Oct 8, 2021 at 1:51 PM Sam Gross <colesbury@gmail.com> wrote:
>
> Hi,
>
> I've been working on changes to CPython to allow it to run without the global interpreter lock. I'd like to share a working proof-of-concept that can run without the GIL. The proof-of-concept involves substantial changes to CPython internals, but relatively few changes to the C-API. It is compatible with many C extensions: extensions must be rebuilt, but usually require small or no modifications to source code. I've built compatible versions of packages from the scientific Python ecosystem, and they are installable through the bundled "pip".
>
> Source code:
> https://github.com/colesbury/nogil
>
> Design overview:
> https://docs.google.com/document/d/18CXhDb1ygxg-YXNBJNzfzZsDFosB5e6BfnXLlejd9l0/edit
>
> My goal with the proof-of-concept is to demonstrate that removing the GIL is feasible and worthwhile, and that the technical ideas of the project could serve as a basis of such an effort.
>

Thanks for doing this, and thanks for the detailed writeup! I'd like
to offer a perspective from observing the ongoing project of a brother
of mine; he does not have the concurrency experience that I have, and
it's been instructive to see what he has trouble with. For reference,
the project involves GTK (which only works on the main thread),
multiple threads for I/O (eg a socket read/parse/process thread), and
one thread managed by asyncio using async/await functions.

At no point has he ever had a problem with performance, because the
project is heavily I/O based, spending most of its time waiting for
events. So this is not going to touch on the question of
single-threaded vs multi-threaded performance.

To him, an async function and a thread function are exactly
equivalent. He doesn't think in terms of yield points or anything;
they are simply two ways of doing parallelism and are, to his code,
equivalent.

Mutable shared state is something to get your head around with *any*
sort of parallelism, and nothing will change that. Whether it's
asyncio, GUI callbacks, or actual threads, the considerations have
been exactly the same. Threads neither gain nor lose compared to other
options.

Not being a low-level programmer, he has, I believe, an inherent
assumption that any operation on a built-in type will be atomic. He's
never stated this but I suspect he's assuming that. It's an assumption
that Python is never going to violate.

Concurrency is *hard*. There's no getting around it, there's no
sugar-coating it. There are concepts that simply have to be learned,
and the failures can be extremely hard to track down. Instantiating an
object on the wrong thread can crash GTK, but maybe not immediately.
Failing to sleep in one thread results in other threads stalling. I
don't think any of this is changed by different modes (with the
exception of process-based parallelism, which fixes a lot of
concurrency at the cost of explicit IPC), and the more work
programmers want their code to do, the more likely that they'll run
into this.

Glib.idle_add is really just a magic incantation to make the GUI work. :)

Spawning a thread for asyncio isn't too hard as long as you don't have
to support older Python versions.... sadly, not every device updated
at the same time. But in a few years, we will be able to ignore Python
versions pre-3.7.

Most likely, none of his code would be affected by the removal of the
GIL, since (as I understand it) the guarantees as seen in Python code
won't change. Will there be impact on lower-memory systems? As small
devices go, the Raspberry Pi is one of the largest, but it's still a
lot smaller than a full PC, and adding overhead to every object would
be costly (I'm not sure what the cost of local reference counting is,
but it can't be none). Threading is perfectly acceptable for a project
like this, so I'm hoping that GIL removal won't unnecessarily penalize
this kind of thread usage.

Speaking of local refcounting, how does that affect objects that get
passed as thread arguments? Initially, an object is owned by the
creating thread, which can relinquish ownership if its local refcount
drops to zero; does another thread then take it over?

I'm excited by anything that helps parallelism in Python, and very
curious to see where this effort will go. If you need a hand with
testing, I'd be happy to help out.

ChrisA
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/PGUFHTN4YY4X4OOTCXYPVEJVAWBZZS4K/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Python multithreading without the GIL [ In reply to ]

steve at pearwood

Oct 8, 2021, 1:13 AM

Post #3 of 49 (3483 views)

Permalink

Hi Sam,

On Thu, Oct 07, 2021 at 03:52:56PM -0400, Sam Gross wrote:

> I've been working on changes to CPython to allow it to run without the
> global interpreter lock. I'd like to share a working proof-of-concept that
> can run without the GIL.

Getting Python to run without the GIL has never been a major problem for
CPython (and of course some other Python interpreters don't have a GIL
at all). I think the first attempt was in 1999, a mere handful of years
after Python was released.

https://www.artima.com/weblogs/viewpost.jsp?thread=214235

The problem has been removing the GIL without seriously degrading
performance. How does your GIL-less CPython fork perform? Especially for
single-threaded code.

Have you been following progress of the GILectomy?

https://pythoncapi.readthedocs.io/gilectomy.html

Single threaded code is still, and always will be, an important part of
Python's ecosystem. A lot of people would be annoyed if the cost of
speeding up heavily threaded Python by a small percentage would be to
slow down single-threaded Python by a large percentage.

--
Steve
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/CFPTMVFVPYHCGUPXYVBUYZBPLEVLSFIM/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Python multithreading without the GIL [ In reply to ]

lukasz at langa

Oct 8, 2021, 7:40 AM

Post #4 of 49 (3483 views)

Permalink

> On 8 Oct 2021, at 10:13, Steven D'Aprano <steve@pearwood.info> wrote:
>
> Hi Sam,
>
> On Thu, Oct 07, 2021 at 03:52:56PM -0400, Sam Gross wrote:
>
>> I've been working on changes to CPython to allow it to run without the
>> global interpreter lock. I'd like to share a working proof-of-concept that
>> can run without the GIL.
>
> Getting Python to run without the GIL has never been a major problem for
> CPython (and of course some other Python interpreters don't have a GIL
> at all).

On the first page of Sam's design overview he references Gilectomy by name.

> Single threaded code is still, and always will be, an important part of
> Python's ecosystem. A lot of people would be annoyed if the cost of
> speeding up heavily threaded Python by a small percentage would be to
> slow down single-threaded Python by a large percentage.

Quoting that same design document, Sam writes: "The new interpreter
(together with the GIL changes) is about 10% faster than CPython 3.9
on the single-threaded pyperformance benchmarks."

- ?

Re: Python multithreading without the GIL [ In reply to ]

guido at python

Oct 8, 2021, 8:09 AM

Post #5 of 49 (3483 views)

Permalink

To be clear, Sam’s basic approach is a bit slower for single-threaded code,
and he admits that. But to sweeten the pot he has also applied a bunch of
unrelated speedups that make it faster in general, so that overall it’s
always a win. But presumably we could upstream the latter easily,
separately from the GIL-freeing part.

On Fri, Oct 8, 2021 at 07:42 ?ukasz Langa <lukasz@langa.pl> wrote:

>
> > On 8 Oct 2021, at 10:13, Steven D'Aprano <steve@pearwood.info> wrote:
> >
> > Hi Sam,
> >
> > On Thu, Oct 07, 2021 at 03:52:56PM -0400, Sam Gross wrote:
> >
> >> I've been working on changes to CPython to allow it to run without the
> >> global interpreter lock. I'd like to share a working proof-of-concept
> that
> >> can run without the GIL.
> >
> > Getting Python to run without the GIL has never been a major problem for
> > CPython (and of course some other Python interpreters don't have a GIL
> > at all).
>
> On the first page of Sam's design overview he references Gilectomy by name.
>
> > Single threaded code is still, and always will be, an important part of
> > Python's ecosystem. A lot of people would be annoyed if the cost of
> > speeding up heavily threaded Python by a small percentage would be to
> > slow down single-threaded Python by a large percentage.
>
> Quoting that same design document, Sam writes: "The new interpreter
> (together with the GIL changes) is about 10% faster than CPython 3.9
> on the single-threaded pyperformance benchmarks."
>
> - ?
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/JO7OQCHZKIFNKSXTTXT2JBCF5H47M7OO/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
--
--Guido (mobile)

Re: Python multithreading without the GIL [ In reply to ]

chris.jerdonek at gmail

Oct 8, 2021, 8:30 AM

Post #6 of 49 (3483 views)

Permalink

On Fri, Oct 8, 2021 at 8:11 AM Guido van Rossum <guido@python.org> wrote:

> To be clear, Sam’s basic approach is a bit slower for single-threaded
> code, and he admits that.
>

Is it also slower even when running with PYTHONGIL=1? If it could be made
the same speed for single-threaded code when running in GIL-enabled mode,
that might be an easier intermediate target while still adding value.

—Chris

But to sweeten the pot he has also applied a bunch of unrelated speedups
> that make it faster in general, so that overall it’s always a win. But
> presumably we could upstream the latter easily, separately from the
> GIL-freeing part.
>
> On Fri, Oct 8, 2021 at 07:42 ?ukasz Langa <lukasz@langa.pl> wrote:
>
>>
>> > On 8 Oct 2021, at 10:13, Steven D'Aprano <steve@pearwood.info> wrote:
>> >
>> > Hi Sam,
>> >
>> > On Thu, Oct 07, 2021 at 03:52:56PM -0400, Sam Gross wrote:
>> >
>> >> I've been working on changes to CPython to allow it to run without the
>> >> global interpreter lock. I'd like to share a working proof-of-concept
>> that
>> >> can run without the GIL.
>> >
>> > Getting Python to run without the GIL has never been a major problem for
>> > CPython (and of course some other Python interpreters don't have a GIL
>> > at all).
>>
>> On the first page of Sam's design overview he references Gilectomy by
>> name.
>>
>> > Single threaded code is still, and always will be, an important part of
>> > Python's ecosystem. A lot of people would be annoyed if the cost of
>> > speeding up heavily threaded Python by a small percentage would be to
>> > slow down single-threaded Python by a large percentage.
>>
>> Quoting that same design document, Sam writes: "The new interpreter
>> (together with the GIL changes) is about 10% faster than CPython 3.9
>> on the single-threaded pyperformance benchmarks."
>>
>> - ?
>> _______________________________________________
>> Python-Dev mailing list -- python-dev@python.org
>> To unsubscribe send an email to python-dev-leave@python.org
>> https://mail.python.org/mailman3/lists/python-dev.python.org/
>> Message archived at
>> https://mail.python.org/archives/list/python-dev@python.org/message/JO7OQCHZKIFNKSXTTXT2JBCF5H47M7OO/
>> Code of Conduct: http://python.org/psf/codeofconduct/
>>
> --
> --Guido (mobile)
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/XQOOGKH5PIFBHJRK7W2LMX32DIGIH4KX/
> Code of Conduct: http://python.org/psf/codeofconduct/
>

Re: Python multithreading without the GIL [ In reply to ]

njs at pobox

Oct 8, 2021, 9:04 AM

Post #7 of 49 (3483 views)

Permalink

On Thu, Oct 7, 2021 at 7:54 PM Sam Gross <colesbury@gmail.com> wrote:
> Design overview:
> https://docs.google.com/document/d/18CXhDb1ygxg-YXNBJNzfzZsDFosB5e6BfnXLlejd9l0/edit

Whoa, this is impressive work.

I notice the fb.com address -- is this a personal project or something
facebook is working on? what's the relationship to Cinder, if any?

Regarding the tricky lock-free dict/list reads: I guess the more
straightforward approach would be to use a plain ol' mutex that's
optimized for this kind of fine-grained per-object lock with short
critical sections and minimal contention, like WTF::Lock. Did you try
alternatives like that? If so, I assume they didn't work well -- can
you give more details?

-n

--
Nathaniel J. Smith -- https://vorpus.org
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/XAZYWRYXKIVUSMRSMFAETKQDLGL27L7X/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Python multithreading without the GIL [ In reply to ]

pablogsal at gmail

Oct 8, 2021, 9:24 AM

Post #8 of 49 (3483 views)

Permalink

>
> To speed-up function calls, the interpreter uses a linear, resizable stack
> to store function call frames, an idea taken from LuaJIT. The stack stores
> the interpreter registers (local variables + space for temporaries) plus
> some extra information per-function call. This avoids the need for
> allocating PyFrameObjects for each call. For compatibility, the
> PyFrameObject type still exists, but they are created lazily as-needed
> (such as for exception handling and for sys._getframe).

The optimized function calls have about an order of magnitude less overhead
> than the current CPython implementation.

The change also simplifies the use of deferred reference counting with the
> data that is stored per-call like the function object. The interpreter can
> usually avoid incrementing the reference count of the function object
> during a call. Like other objects on the stack, a borrowed reference to the
> function is indicated by setting the least-significant-bit.

Congrats Sam! This is incredible work! One quick question after reading the
design doc:

When you mean "an order of magnitude less overhead than the current CPython
implementation" do you mean compared with the main branch? We recently
implemented already almost everything is listed in this paragraph:

https://github.com/python/cpython/pull/27077

We also pack some extra similar optimizations in this other PR, including
stealing the frame arguments from python to python calls:

https://github.com/python/cpython/pull/28488

This could explain why the performance is closer to the current master
branch as you indicate:

It gets about the same average performance as the “main” branch of CPython
> 3.11 as of early September 2021.

Cheers from cloudy London,
Pablo Galindo Salgado

On Fri, 8 Oct 2021 at 03:49, Sam Gross <colesbury@gmail.com> wrote:

> Hi,
>
> I've been working on changes to CPython to allow it to run without the
> global interpreter lock. I'd like to share a working proof-of-concept that
> can run without the GIL. The proof-of-concept involves substantial changes
> to CPython internals, but relatively few changes to the C-API. It is
> compatible with many C extensions: extensions must be rebuilt, but usually
> require small or no modifications to source code. I've built compatible
> versions of packages from the scientific Python ecosystem, and they are
> installable through the bundled "pip".
>
> Source code:
> https://github.com/colesbury/nogil
>
> Design overview:
>
> https://docs.google.com/document/d/18CXhDb1ygxg-YXNBJNzfzZsDFosB5e6BfnXLlejd9l0/edit
>
> My goal with the proof-of-concept is to demonstrate that removing the GIL
> is feasible and worthwhile, and that the technical ideas of the project
> could serve as a basis of such an effort.
>
> I'd like to start a discussion about these ideas and gauge the community's
> interest in this approach to removing the GIL.
>
> Regards,
> Sam Gross
> colesbury@gmail.com / sgross@fb.com
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/ABR2L6BENNA6UPSPKV474HCS4LWT26GY/
> Code of Conduct: http://python.org/psf/codeofconduct/
>

Re: Python multithreading without the GIL [ In reply to ]

lord.mauve at gmail

Oct 8, 2021, 9:55 AM

Post #9 of 49 (3483 views)

Permalink

On Fri, 8 Oct 2021 at 03:50, Sam Gross <colesbury@gmail.com> wrote:
> My goal with the proof-of-concept is to demonstrate that removing the GIL is feasible and worthwhile, and that the technical ideas of the project could serve as a basis of such an effort.

I'm a novice C programmer, but I'm unsure about the safety of your
thread-safe collections description. You describe an algorithm for
lock-free read access to list items as

1. Load the version counter from the collection
2. Load the “backing array” from the collection
3. Load the address of the item (from the “backing array”)
4. Increment the reference count of the item, if it is non-zero
(otherwise retry)
5. Verify that the item still exists at the same location in the
collection (otherwise retry)
6. Verify that the version counter did not change (otherwise retry)
7. Return the address of the item

But you do the bounds check for the index before this, here[1]. If the
thread is suspended after this and before you read the address of the
backing array [2], the list could have been resized (shrunk), and the
backing array reallocated from a new memory block. So the pointer you
read at 3 could be from uninitialized memory that is beyond the size
of the array (or within the array but larger than the current number
of items). And then you write to it at 4 which is then a write into a
random memory location.

[1] https://github.com/colesbury/nogil/blob/fb6aabede5f7f1936a21c2f48ec7fcc0848d74bf/Objects/listobject.c#L137
[2] https://github.com/colesbury/nogil/blob/fb6aabede5f7f1936a21c2f48ec7fcc0848d74bf/Objects/listobject.c#L141
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/6J6XFEACF2C6XPLZRVABUFFHJICUTZCS/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Python multithreading without the GIL [ In reply to ]

larry at hastings

Oct 8, 2021, 4:10 PM

Post #10 of 49 (3483 views)

Permalink

On 10/7/21 8:52 PM, Sam Gross wrote:
> I've been working on changes to CPython to allow it to run without the
> global interpreter lock.

Before anybody asks: Sam contacted me privately some time ago to pick my
brain a little. But honestly, Sam didn't need any help--he'd already
taken the project further than I'd ever taken the Gilectomy. I have
every confidence in Sam and his work, and I'm excited he's revealed it
to the world!

Best wishes,

//arry/

Re: Python multithreading without the GIL [ In reply to ]

colesbury at gmail

Oct 8, 2021, 8:02 PM

Post #11 of 49 (3483 views)

Permalink

On Fri, Oct 8, 2021 at 12:55 PM Daniel Pope <lord.mauve@gmail.com> wrote:

> I'm a novice C programmer, but I'm unsure about the safety of your
> thread-safe collections description.
>

The "list" class uses a slightly different strategy than "dict", which I
forgot about
when writing the design overview. List relies on the property that the
backing array
of a given list can only grow (never shrink) [1]. This is different from
upstream CPython.

Dict stores the capacity inside PyDictKeysObject (the backing array). The
capacity
never changes, so if you have a valid pointer to the PyDictKeysObject you
load the
correct capacity.

I've been meaning to change "list" to use the same strategy as "dict". I
think that would
simplify the overall design and let "list" shrink the backing array again.

[1]
https://github.com/colesbury/nogil/blob/fb6aabede5f7f1936a21c2f48ec7fcc0848d74bf/Objects/listobject.c#L46-L49
(

Re: Python multithreading without the GIL [ In reply to ]

colesbury at gmail

Oct 8, 2021, 8:52 PM

Post #12 of 49 (3483 views)

Permalink

On Fri, Oct 8, 2021 at 12:24 PM Pablo Galindo Salgado <pablogsal@gmail.com>
wrote:

> When you mean "an order of magnitude less overhead than the current
> CPython implementation" do you mean compared with the main branch? We
> recently implemented already almost everything is listed in this paragraph.
>

I think I wrote that in August when "current CPython" meant something
different from today :) I'll update it.

Thanks for the links to the PRs. I'll need to look at them more closely,
but one I think one remaining difference is that
the "nogil" interpreter stays within the same interpreter loop for many
Python function calls, while upstream CPython
recursively calls into _PyEval_EvalFrameDefault.

I've been using this mini-benchmark to measure the overhead of Python
function calls for various numbers of
arguments and keywords:
https://github.com/colesbury/nogil/blob/fb6aabede5f7f1936a21c2f48ec7fcc0848d74bf/benchmarks/call_benchmark.py

For zero, two, and four argument functions, I get:
nogil (nogil/fb6aabed): 10ns, 14ns, 18ns
3.11 (main/b108db63): 47ns, 54ns, 63ns

Re: Python multithreading without the GIL [ In reply to ]

guido at python

Oct 8, 2021, 9:45 PM

Post #13 of 49 (3483 views)

Permalink

On Fri, Oct 8, 2021 at 8:55 PM Sam Gross <colesbury@gmail.com> wrote:

> the "nogil" interpreter stays within the same interpreter loop for many
> Python function calls, while upstream CPython
> recursively calls into _PyEval_EvalFrameDefault.
>

Not for much longer though. https://github.com/python/cpython/pull/28488

--
--Guido van Rossum (python.org/~guido)
*Pronouns: he/him **(why is my pronoun here?)*
<http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>

Re: Python multithreading without the GIL [ In reply to ]

drsalists at gmail

Oct 9, 2021, 8:31 PM

Post #14 of 49 (3481 views)

Permalink

On Thu, Oct 7, 2021 at 9:10 PM Chris Angelico <rosuav@gmail.com> wrote:

> Concurrency is *hard*. There's no getting around it, there's no
> sugar-coating it. There are concepts that simply have to be learned,
> and the failures can be extremely hard to track down. Instantiating an
> object on the wrong thread can crash GTK, but maybe not immediately.
> Failing to sleep in one thread results in other threads stalling. I
> don't think any of this is changed by different modes (with the
> exception of process-based parallelism, which fixes a lot of
> concurrency at the cost of explicit IPC), and the more work
> programmers want their code to do, the more likely that they'll run
> into this.
>

I'd like to encourage folks not to give up on looking for new, simpler
parallelism/concurrency formalisms.

They're out there - consider how well bash does with its parallelism in
pipelines.

The truly general ones may end up looking like Java, but I don't think they
have to be fully general to be useful.

Re: Python multithreading without the GIL [ In reply to ]

rosuav at gmail

Oct 9, 2021, 8:36 PM

Post #15 of 49 (3481 views)

Permalink

On Sun, Oct 10, 2021 at 2:31 PM Dan Stromberg <drsalists@gmail.com> wrote:
>
>
> On Thu, Oct 7, 2021 at 9:10 PM Chris Angelico <rosuav@gmail.com> wrote:
>>
>> Concurrency is *hard*. There's no getting around it, there's no
>> sugar-coating it. There are concepts that simply have to be learned,
>> and the failures can be extremely hard to track down. Instantiating an
>> object on the wrong thread can crash GTK, but maybe not immediately.
>> Failing to sleep in one thread results in other threads stalling. I
>> don't think any of this is changed by different modes (with the
>> exception of process-based parallelism, which fixes a lot of
>> concurrency at the cost of explicit IPC), and the more work
>> programmers want their code to do, the more likely that they'll run
>> into this.
>
>
> I'd like to encourage folks not to give up on looking for new, simpler parallelism/concurrency formalisms.
>
> They're out there - consider how well bash does with its parallelism in pipelines.

That's process-based parallelism with unidirectional byte-stream IPC.
It's incredibly specific, but also incredibly useful in its place :)

Simpler parallelism techniques are always possible if you don't need
much interaction between the processes. The challenge isn't making the
simple cases work, but making the harder ones efficient.

ChrisA
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/UPR7EIZTVKESA2ND4ISFXEYBZYLNSTCE/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Python multithreading without the GIL [ In reply to ]

barry at python

Oct 10, 2021, 1:34 PM

Post #16 of 49 (3467 views)

Permalink

Congrats on this impressive work Sam. I enjoyed the thorough write up of the design. There’s one aspect that I don’t quite understand. Maybe I missed the explanation. For example:

```
• Load the address of the item
• Increment the reference count of the item, if it is non-zero (otherwise retry)
• Return the address of the item

(We typically acquire the per-collection mutex when retrying operations to avoid potential livelock issues.)
```

I’m unclear what is actually retried. You use this note throughout the document, so I think it would help to clarify exactly what is retried and why that solves the particular problem. I’m confused because, is it the refcount increment that’s retried or the entire sequence of steps (i.e. do you go back and reload the address of the item)? Is there some kind of waiting period before the retry? I would infer that if you’re retrying the refcount incrementing, it’s because you expect subsequent retries to transition from zero to non-zero, but is that guaranteed? Are there possibilities of deadlocks or race conditions?

Can you go into some more detail (here or in the document) about how this works?

Cheers,
-Barry

> On Oct 7, 2021, at 12:52, Sam Gross <colesbury@gmail.com> wrote:
>
> Hi,
>
> I've been working on changes to CPython to allow it to run without the global interpreter lock. I'd like to share a working proof-of-concept that can run without the GIL. The proof-of-concept involves substantial changes to CPython internals, but relatively few changes to the C-API. It is compatible with many C extensions: extensions must be rebuilt, but usually require small or no modifications to source code. I've built compatible versions of packages from the scientific Python ecosystem, and they are installable through the bundled "pip".
>
> Source code:
> https://github.com/colesbury/nogil
>
> Design overview:
> https://docs.google.com/document/d/18CXhDb1ygxg-YXNBJNzfzZsDFosB5e6BfnXLlejd9l0/edit
>
> My goal with the proof-of-concept is to demonstrate that removing the GIL is feasible and worthwhile, and that the technical ideas of the project could serve as a basis of such an effort.
>
> I'd like to start a discussion about these ideas and gauge the community's interest in this approach to removing the GIL.
>
> Regards,
> Sam Gross
> colesbury@gmail.com / sgross@fb.com
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ABR2L6BENNA6UPSPKV474HCS4LWT26GY/
> Code of Conduct: http://python.org/psf/codeofconduct/

Re: Python multithreading without the GIL [ In reply to ]

antoine at python

Oct 11, 2021, 2:08 AM

Post #17 of 49 (3467 views)

Permalink

On Thu, 7 Oct 2021 15:52:56 -0400
Sam Gross <colesbury@gmail.com> wrote:
> Hi,
>
> I've been working on changes to CPython to allow it to run without the
> global interpreter lock. I'd like to share a working proof-of-concept that
> can run without the GIL. The proof-of-concept involves substantial changes
> to CPython internals, but relatively few changes to the C-API. It is
> compatible with many C extensions: extensions must be rebuilt, but usually
> require small or no modifications to source code. I've built compatible
> versions of packages from the scientific Python ecosystem, and they are
> installable through the bundled "pip".
>
> Source code:
> https://github.com/colesbury/nogil
>
> Design overview:
> https://docs.google.com/document/d/18CXhDb1ygxg-YXNBJNzfzZsDFosB5e6BfnXLlejd9l0/edit

Impressive work!

Just for the record:
"""
It’s harder to measure aggregate multi-threaded performance because
there aren’t any standard multi-threaded Python benchmarks, but the new
interpreter addresses many of the use cases that failed to scale
efficiently.
"""

It's crude, but you can take a look at `ccbench` in the Tools directory.

Regards

Antoine.

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/WRT7F2RHHCQ3N2TYEDC6JSIJ4T2ZM6F7/
Code of Conduct: http://python.org/psf/codeofconduct/

Re: Python multithreading without the GIL [ In reply to ]

colesbury at gmail

Oct 11, 2021, 9:51 AM

Post #18 of 49 (3467 views)

Permalink

On Fri, Oct 8, 2021 at 12:04 PM Nathaniel Smith <njs@pobox.com> wrote:

> I notice the fb.com address -- is this a personal project or something
> facebook is working on? what's the relationship to Cinder, if any?
>

It is a Facebook project, at least in the important sense that I work on it
as an employee at Facebook. (I'm currently the only person working on it.)
I keep in touch with some of the Cinder devs regularly and they've advised
on the project, but otherwise the two projects are unrelated.

> Regarding the tricky lock-free dict/list reads: I guess the more
> straightforward approach would be to use a plain ol' mutex that's
> optimized for this kind of fine-grained per-object lock with short
> critical sections and minimal contention, like WTF::Lock. Did you try
> alternatives like that? If so, I assume they didn't work well -- can
> you give more details?
>

I'm using WTF::Lock style locks for dict/list mutations. I did an experiment
early on where I included locking around reads as well. I think it slowed
down
the pyperformance benchmarks by ~10% on average, but I can't find my notes
so I plan to re-run the experiment.

Additionally, because dicts are used for things like global variables, I'd
expect
that locks around reads prevent efficient scaling, but I haven't measured
this.

Re: Python multithreading without the GIL [ In reply to ]

tagrain at gmail

Oct 11, 2021, 9:58 AM

Post #19 of 49 (3467 views)

Permalink

Is D1.update(D2) still atomic with this implementation?
https://docs.python.org/3.11/faq/library.html#what-kinds-of-global-value-mutation-are-thread-safe

On Mon, 11 Oct 2021, 17:54 Sam Gross, <colesbury@gmail.com> wrote:

> On Fri, Oct 8, 2021 at 12:04 PM Nathaniel Smith <njs@pobox.com> wrote:
>
>> I notice the fb.com address -- is this a personal project or something
>> facebook is working on? what's the relationship to Cinder, if any?
>>
>
> It is a Facebook project, at least in the important sense that I work on it
> as an employee at Facebook. (I'm currently the only person working on it.)
> I keep in touch with some of the Cinder devs regularly and they've advised
> on the project, but otherwise the two projects are unrelated.
>
>
>> Regarding the tricky lock-free dict/list reads: I guess the more
>> straightforward approach would be to use a plain ol' mutex that's
>> optimized for this kind of fine-grained per-object lock with short
>> critical sections and minimal contention, like WTF::Lock. Did you try
>> alternatives like that? If so, I assume they didn't work well -- can
>> you give more details?
>>
>
> I'm using WTF::Lock style locks for dict/list mutations. I did an
> experiment
> early on where I included locking around reads as well. I think it slowed
> down
> the pyperformance benchmarks by ~10% on average, but I can't find my notes
> so I plan to re-run the experiment.
>
> Additionally, because dicts are used for things like global variables, I'd
> expect
> that locks around reads prevent efficient scaling, but I haven't measured
> this.
>
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/V76ZRBM6UMGYU7FTNENMOOW7OYEFYQ5Q/
> Code of Conduct: http://python.org/psf/codeofconduct/
>

Re: Python multithreading without the GIL [ In reply to ]

colesbury at gmail

Oct 11, 2021, 10:13 AM

Post #20 of 49 (3467 views)

Permalink

On Fri, Oct 8, 2021 at 11:35 AM Chris Jerdonek <chris.jerdonek@gmail.com>
wrote:

> Is it also slower even when running with PYTHONGIL=1? If it could be made
> the same speed for single-threaded code when running in GIL-enabled mode,
> that might be an easier intermediate target while still adding value.
>

Running with PYTHONGIL=1 is a bit less than 1% faster (on pyperformance)
than with PYTHONGIL=0. It might be possible to improve PYTHONGIL=1 by
another 1-2% by adding runtime checks for the GIL before attempting to lock
dicts and lists during mutations. I think further optimizations specific to
the PYTHONGIL=1 use case would be tricky.

Re: Python multithreading without the GIL [ In reply to ]

python-dev at python

Oct 11, 2021, 11:27 AM

Post #21 of 49 (3467 views)

Permalink

> On 11 Oct 2021, at 18:58, Thomas Grainger <tagrain@gmail.com> wrote:
>
> Is D1.update(D2) still atomic with this implementation? https://docs.python.org/3.11/faq/library.html#what-kinds-of-global-value-mutation-are-thread-safe <https://docs.python.org/3.11/faq/library.html#what-kinds-of-global-value-mutation-are-thread-safe>
AFAIK this is already only atomic in specific circumstances, that are more limited than the FAQ appears to claim.

For dict.update to be atomic I’d expect that with two threads performing an update on the same keys you’d end up the update of either thread, but not a mix.

That is:

Thread 1: d.update({“a”: 1, “b”: 1})
Thread 2: d.update({“a”: 2, “b”: 2})

The result should have d[“a”] == d[“b”].

This can already end up with a mix of the two when “d” has keys that are objects that implement __eq__ in Python, because the interpreter could switch threads while interpreting __eq__.

A pathological example:

# — start of script —

import threading
import time

stop = False
trigger = False
def runfunc():
while not stop:
if trigger:
d.update({"a": 2, "b": 2 })
print(d)
break

t = threading.Thread(target=runfunc)
t.start()

class X(str):
def __eq__(self, other):
if threading.current_thread() is t:
return str.__eq__(self, other)

global trigger
trigger = True
t.join()
return str.__eq__(self, other)

def __hash__(self):
return str.__hash__(self)

d = {X("b"):0}
print("before", d)
d.update({"a":1, "b": 1})
print("after", d)

stop = True
t.join()

# — end of script —

This prints "after {'b': 1, 'a': 2}” on my machine.

Ronald

>
> On Mon, 11 Oct 2021, 17:54 Sam Gross, <colesbury@gmail.com <mailto:colesbury@gmail.com>> wrote:
> On Fri, Oct 8, 2021 at 12:04 PM Nathaniel Smith <njs@pobox.com <mailto:njs@pobox.com>> wrote:
> I notice the fb.com <http://fb.com/> address -- is this a personal project or something
> facebook is working on? what's the relationship to Cinder, if any?
>
> It is a Facebook project, at least in the important sense that I work on it
> as an employee at Facebook. (I'm currently the only person working on it.)
> I keep in touch with some of the Cinder devs regularly and they've advised
> on the project, but otherwise the two projects are unrelated.
>
> Regarding the tricky lock-free dict/list reads: I guess the more
> straightforward approach would be to use a plain ol' mutex that's
> optimized for this kind of fine-grained per-object lock with short
> critical sections and minimal contention, like WTF::Lock. Did you try
> alternatives like that? If so, I assume they didn't work well -- can
> you give more details?
>
> I'm using WTF::Lock style locks for dict/list mutations. I did an experiment
> early on where I included locking around reads as well. I think it slowed down
> the pyperformance benchmarks by ~10% on average, but I can't find my notes
> so I plan to re-run the experiment.
>
> Additionally, because dicts are used for things like global variables, I'd expect
> that locks around reads prevent efficient scaling, but I haven't measured this.
>
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org <mailto:python-dev@python.org>
> To unsubscribe send an email to python-dev-leave@python.org <mailto:python-dev-leave@python.org>
> https://mail.python.org/mailman3/lists/python-dev.python.org/ <https://mail.python.org/mailman3/lists/python-dev.python.org/>
> Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/V76ZRBM6UMGYU7FTNENMOOW7OYEFYQ5Q/ <https://mail.python.org/archives/list/python-dev@python.org/message/V76ZRBM6UMGYU7FTNENMOOW7OYEFYQ5Q/>
> Code of Conduct: http://python.org/psf/codeofconduct/ <http://python.org/psf/codeofconduct/>
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/5RKLUR2DYJ53OIRX74WVZCVRGW7VUXLF/
> Code of Conduct: http://python.org/psf/codeofconduct/

—

Twitter / micro.blog: @ronaldoussoren
Blog: https://blog.ronaldoussoren.net/

Re: Python multithreading without the GIL [ In reply to ]

colesbury at gmail

Oct 11, 2021, 12:06 PM

Post #22 of 49 (3467 views)

Permalink

>
> I’m unclear what is actually retried. You use this note throughout the
> document, so I think it would help to clarify exactly what is retried and
> why that solves the particular problem. I’m confused because, is it the
> refcount increment that’s retried or the entire sequence of steps (i.e. do
> you go back and reload the address of the item)? Is there some kind of
> waiting period before the retry? I would infer that if you’re retrying the
> refcount incrementing, it’s because you expect subsequent retries to
> transition from zero to non-zero, but is that guaranteed? Are there
> possibilities of deadlocks or race conditions?

The entire operation is retried (not just the refcount). For "dict", this
means going back to step 1 and reloading the version tag and
PyDictKeysObject. The operation can fail (and need to be retried) only when
some other thread is concurrently modifying the dict. The reader needs to
perform the checks (and retry) to avoid returning inconsistent data, such
as an object that was never in the dict. With the checks and retry,
returning inconsistent or garbage data is not possible.

The retry is performed after locking the dict, so the operation is retried
at most once -- the read operation can't fail when it holds the dict's lock
because the lock prevents concurrent modifications. It would have also been
possible to retry the operation in a loop without locking the dict, but I
was concerned about reader starvation. (In the doc I wrote "livelock", but
"reader starvation" is more accurate.) In particular, I was concerned that
a thread repeatedly modifying a dict might prevent other threads reading
the dict from making progress. I hadn't seen this in practice, but I'm
aware that reader starvation can be an issue for similar designs like
Linux's seqlock. Acquiring the dict's lock when retrying avoids the reader
starvation issue.

Deadlock isn't possible because the code does not acquire any other
locks while holding the dict's lock. For example, the code releases the
dict's lock before calling Py_DECREF or PyObject_RichCompareBool.

The race condition question is a bit harder to answer precisely. Concurrent
reads and modifications of a dict won't cause the program to segfault,
return garbage data, or items that were never in the dict.

Regards,
Sam

Re: Python multithreading without the GIL [ In reply to ]

arj.python at gmail

Oct 11, 2021, 12:27 PM

Post #23 of 49 (3467 views)

Permalink

When you mean "an order of magnitude less overhead than the current CPython
implementation" do you mean compared with the main branch? We recently
implemented already almost everything is listed in this paragraph:

https://github.com/python/cpython/pull/27077

We also pack some extra similar optimizations in this other PR, including
stealing the frame arguments from python to python calls:

https://github.com/python/cpython/pull/28488

This could explain why the performance is closer to the current master
branch as you indicate:

This means that if we remove the GIL + add the 3.11 improvements we should
get some more speed?

(or if those are integrated in the POC?)

Kind Regards,

Abdur-Rahmaan Janhangeer
about <https://compileralchemy.github.io/> | blog
<https://www.pythonkitchen.com>
github <https://github.com/Abdur-RahmaanJ>
Mauritius

Re: Python multithreading without the GIL [ In reply to ]

colesbury at gmail

Oct 11, 2021, 12:38 PM

Post #24 of 49 (3467 views)

Permalink

On Mon, Oct 11, 2021 at 12:58 PM Thomas Grainger <tagrain@gmail.com> wrote:

> Is D1.update(D2) still atomic with this implementation?
> https://docs.python.org/3.11/faq/library.html#what-kinds-of-global-value-mutation-are-thread-safe
>

No. For example, another thread reading from the dict concurrently may
observe a partial update.

As Ronald Oussoren points out, dict.update isn't atomic in the general
case. CPython even includes some checks for concurrent modifications:
https://github.com/python/cpython/blob/2f92e2a590f0e5d2d3093549f5af9a4a1889eb5a/Objects/dictobject.c#L2582-L2586

Re: Python multithreading without the GIL [ In reply to ]

pablogsal at gmail

Oct 11, 2021, 12:39 PM

Post #25 of 49 (3467 views)

Permalink

As far as I understand we should get a smaller improvement on single thread
because some of the optimizations listed in this work are partially or
totally implemented.

This is excluding any non linear behaviour between the different
optimizations of course, and assuming that both versions yield the same
numbers.

On Mon, 11 Oct 2021, 20:28 Abdur-Rahmaan Janhangeer, <arj.python@gmail.com>
wrote:

> When you mean "an order of magnitude less overhead than the current
> CPython implementation" do you mean compared with the main branch? We
> recently implemented already almost everything is listed in this paragraph:
>
> https://github.com/python/cpython/pull/27077
>
> We also pack some extra similar optimizations in this other PR, including
> stealing the frame arguments from python to python calls:
>
> https://github.com/python/cpython/pull/28488
>
> This could explain why the performance is closer to the current master
> branch as you indicate:
>
>
> This means that if we remove the GIL + add the 3.11 improvements we should
> get some more speed?
>
> (or if those are integrated in the POC?)
>
>
> Kind Regards,
>
> Abdur-Rahmaan Janhangeer
> about <https://compileralchemy.github.io/> | blog
> <https://www.pythonkitchen.com>
> github <https://github.com/Abdur-RahmaanJ>
> Mauritius
>

Mailing List Archive

Attached Files:

Attached Files: