Mailing List Archive: Is there a more efficient threading lock?

Is there a more efficient threading lock?

skip.montanaro at gmail

Feb 25, 2023, 7:52 AM

Post #1 of 24 (1020 views)

I have a multi-threaded program which calls out to a non-thread-safe
library (not mine) in a couple places. I guard against multiple
threads executing code there using threading.Lock. The code is
straightforward:

from threading import Lock

# Something in textblob and/or nltk doesn't play nice with no-gil, so just
# serialize all blobby accesses.
BLOB_LOCK = Lock()

def get_terms(text):
with BLOB_LOCK:
phrases = TextBlob(text, np_extractor=EXTRACTOR).noun_phrases
for phrase in phrases:
yield phrase

When I monitor the application using py-spy, that with statement is
consuming huge amounts of CPU. Does threading.Lock.acquire() sleep
anywhere? I didn't see anything obvious poking around in the C code
which implements this stuff. I'm no expert though, so could easily
have missed something.

Thx,

Skip
--
https://mail.python.org/mailman/listinfo/python-list

Re: Is there a more efficient threading lock? [ In reply to ]

hjp-python at hjp

Feb 25, 2023, 8:27 AM

Post #2 of 24 (1020 views)

On 2023-02-25 09:52:15 -0600, Skip Montanaro wrote:
> I have a multi-threaded program which calls out to a non-thread-safe
> library (not mine) in a couple places. I guard against multiple
> threads executing code there using threading.Lock. The code is
> straightforward:
>
> from threading import Lock
>
> # Something in textblob and/or nltk doesn't play nice with no-gil, so just
> # serialize all blobby accesses.
> BLOB_LOCK = Lock()
>
> def get_terms(text):
> with BLOB_LOCK:
> phrases = TextBlob(text, np_extractor=EXTRACTOR).noun_phrases
> for phrase in phrases:
> yield phrase
>
> When I monitor the application using py-spy, that with statement is
> consuming huge amounts of CPU.

Which OS is this?

> Does threading.Lock.acquire() sleep anywhere?

On Linux it calls futex(2), which does sleep if it can't get the lock
right away. (Of course if it does get the lock, it will return
immediately which may use a lot of CPU if you are calling it a lot.)

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

Re: Is there a more efficient threading lock? [ In reply to ]

list1 at tompassin

Feb 25, 2023, 8:48 AM

Post #3 of 24 (1020 views)

On 2/25/2023 10:52 AM, Skip Montanaro wrote:
> I have a multi-threaded program which calls out to a non-thread-safe
> library (not mine) in a couple places. I guard against multiple
> threads executing code there using threading.Lock. The code is
> straightforward:
>
> from threading import Lock
>
> # Something in textblob and/or nltk doesn't play nice with no-gil, so just
> # serialize all blobby accesses.
> BLOB_LOCK = Lock()
>
> def get_terms(text):
> with BLOB_LOCK:
> phrases = TextBlob(text, np_extractor=EXTRACTOR).noun_phrases
> for phrase in phrases:
> yield phrase
>
> When I monitor the application using py-spy, that with statement is
> consuming huge amounts of CPU. Does threading.Lock.acquire() sleep
> anywhere? I didn't see anything obvious poking around in the C code
> which implements this stuff. I'm no expert though, so could easily
> have missed something.

I'm no expert on locks, but you don't usually want to keep a lock while
some long-running computation goes on. You want the computation to be
done by a separate thread, put its results somewhere, and then notify
the choreographing thread that the result is ready.

This link may be helpful -

https://anandology.com/blog/using-iterators-and-generators/

--
https://mail.python.org/mailman/listinfo/python-list

Re: Is there a more efficient threading lock? [ In reply to ]

python-list at python

Feb 25, 2023, 1:24 PM

Post #4 of 24 (1019 views)

On 2023-02-25, Paul Rubin <no.email@nospam.invalid> wrote:
> Skip Montanaro <skip.montanaro@gmail.com> writes:
>> from threading import Lock
>
> 1) you generally want to use RLock rather than Lock

Why?

> 2) I have generally felt that using locks at the app level at all is an
> antipattern. The main way I've stayed sane in multi-threaded Python
> code is to have every mutable strictly owned by exactly one thread, pass
> values around using Queues, and have an event loop in each thread taking
> requests from Queues.
>
> 3) I didn't know that no-gil was a now thing and I'm used to having the
> GIL. So I would have considered the multiprocessing module rather than
> threading, for something like this.

What does this mean? Are you saying the GIL has been removed?
--
https://mail.python.org/mailman/listinfo/python-list

Re: Is there a more efficient threading lock? [ In reply to ]

skip.montanaro at gmail

Feb 25, 2023, 1:41 PM

Post #5 of 24 (1019 views)

Thanks for the responses.

Peter wrote:

> Which OS is this?

MacOS Ventura 13.1, M1 MacBook Pro (eight cores).

Thomas wrote:

> I'm no expert on locks, but you don't usually want to keep a lock while
> some long-running computation goes on. You want the computation to be
> done by a separate thread, put its results somewhere, and then notify
> the choreographing thread that the result is ready.

In this case I'm extracting the noun phrases from the body of an email
message (returned as a list). I have a collection of email messages
organized by month (typically 1000 to 3000 messages per month). I'm using
concurrent.futures.ThreadPoolExecutor() with the default number of workers (
os.cpu_count() * 1.5, or 12 threads on my system) to process each month, so
12 active threads at a time. Given that the process is pretty much CPU
bound, maybe reducing the number of workers to the CPU count would make
sense. Processing of each email message enters that with block once. That's
about as minimal as I can make it. I thought for a bit about pushing the
textblob stuff into a separate worker thread, but it wasn't obvious how to
set up queues to handle the communication between the threads created by
ThreadPoolExecutor() and the worker thread. Maybe I'll think about it
harder. (I have a related problem with SQLite, since an open database can't
be manipulated from multiple threads. That makes much of the program's
end-of-run processing single-threaded.)

> This link may be helpful -
>
> https://anandology.com/blog/using-iterators-and-generators/

I don't think that's where my problem is. The lock protects the generation
of the noun phrases. My loop which does the yielding operates outside of
that lock's control. The version of the code is my latest, in which I
tossed out a bunch of phrase-processing code (effectively dead end ideas
for processing the phrases). Replacing the for loop with a simple return
seems not to have any effect. In any case, the caller which uses the
phrases does a fair amount of extra work with the phrases, populating a
SQLite database, so I don't think the amount of time it takes to process a
single email message is dominated by the phrase generation.

Here's timeit output for the noun_phrases code:

% python -m timeit -s 'text = """`python -m timeit --help`""" ; from
textblob import TextBlob ; from textblob.np_extractors import
ConllExtractor ; ext = ConllExtractor() ; phrases = TextBlob(text,
np_extractor=ext).noun_phrases' 'phrases = TextBlob(text,
np_extractor=ext).noun_phrases'
5000 loops, best of 5: 98.7 usec per loop

I process the output of timeit's help message which looks to be about the
same length as a typical email message, certainly the same order of
magnitude. Also, note that I call it once in the setup to eliminate the
initial training of the ConllExtractor instance. I don't know if ~100us
qualifies as long running or not.

I'll keep messing with it.

Skip
--
https://mail.python.org/mailman/listinfo/python-list

Re: Is there a more efficient threading lock? [ In reply to ]

gweatherby at uchc

Feb 25, 2023, 1:47 PM

Post #6 of 24 (1019 views)

?I'm no expert on locks, but you don't usually want to keep a lock while
some long-running computation goes on. You want the computation to be
done by a separate thread, put its results somewhere, and then notify
the choreographing thread that the result is ready.?

Maybe. There are so many possible threaded application designs I?d hesitate to make a general statement.

The threading.Lock.acquire method has flags for both a non-blocking attempt and a timeout, so a valid design could include a long-running computation with a main thread or event loop polling the thread. Or the thread could signal a main loop some other way.

I?ve written some code that coordinated threads by having a process talk to itself using a socket.socketpair. The advantage is that you can bundle multiple items (sockets, file handles, a polling timeout) into a select.select call which waits without consuming resources (at least on Linux) until
something interesting happens.

--
https://mail.python.org/mailman/listinfo/python-list

Re: Is there a more efficient threading lock? [ In reply to ]

hjp-python at hjp

Feb 25, 2023, 1:53 PM

Post #7 of 24 (1019 views)

On 2023-02-25 09:52:15 -0600, Skip Montanaro wrote:
> BLOB_LOCK = Lock()
>
> def get_terms(text):
> with BLOB_LOCK:
> phrases = TextBlob(text, np_extractor=EXTRACTOR).noun_phrases
> for phrase in phrases:
> yield phrase
>
> When I monitor the application using py-spy, that with statement is
> consuming huge amounts of CPU.

Another thought:

How accurate is py-spy? Is it possible that it assigns time actually
spent in
phrases = TextBlob(text, np_extractor=EXTRACTOR).noun_phrases
to
with BLOB_LOCK:
?

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

Re: Is there a more efficient threading lock? [ In reply to ]

list1 at tompassin

Feb 25, 2023, 2:20 PM

Post #8 of 24 (1019 views)

On 2/25/2023 4:41 PM, Skip Montanaro wrote:
> Thanks for the responses.
>
> Peter wrote:
>
>> Which OS is this?
>
> MacOS Ventura 13.1, M1 MacBook Pro (eight cores).
>
> Thomas wrote:
>
> > I'm no expert on locks, but you don't usually want to keep a lock while
> > some long-running computation goes on. You want the computation to be
> > done by a separate thread, put its results somewhere, and then notify
> > the choreographing thread that the result is ready.
>
> In this case I'm extracting the noun phrases from the body of an email
> message(returned as a list). I have a collection of email messages
> organized by month(typically 1000 to 3000 messages per month). I'm using
> concurrent.futures.ThreadPoolExecutor() with the default number of
> workers (os.cpu_count() * 1.5, or 12 threads on my system)to process
> each month, so 12 active threads at a time. Given that the process is
> pretty much CPU bound, maybe reducing the number of workers to the CPU
> count would make sense. Processing of each email message enters that
> with block once.That's about as minimal as I can make it. I thought for
> a bit about pushing the textblob stuff into a separate worker thread,
> but it wasn't obvious how to set up queues to handle the communication
> between the threads created by ThreadPoolExecutor()and the worker
> thread. Maybe I'll think about it harder. (I have a related problem with
> SQLite, since an open database can't be manipulated from multiple
> threads. That makes much of the program's end-of-run processing
> single-threaded.)

If the noun extractor is single-threaded (which I think you mentioned),
no amount of parallel access is going to help. The best you can do is
to queue up requests so that as soon as the noun extractor returns from
one call, it gets handed another blob. The CPU will be busy all the
time running the noun-extraction code.

If that's the case, you might just as well eliminate all the threads and
just do it sequentially in the most obvious and simple manner.

It would possibly be worth while to try this approach out and see what
happens to the CPU usage and overall computation time.

> > This link may be helpful -
> >
> > https://anandology.com/blog/using-iterators-and-generators/
> <https://anandology.com/blog/using-iterators-and-generators/>
>
> I don't think that's where my problem is. The lock protects the
> generation of the noun phrases. My loop which does the yielding operates
> outside of that lock's control. The version of the code is my latest, in
> which I tossed out a bunch of phrase-processing code (effectively dead
> end ideas for processing the phrases). Replacing the for loop with a
> simple return seems not to have any effect. In any case, the caller
> which uses the phrases does a fair amount of extra work with the
> phrases, populating a SQLite database, so I don't think the amount of
> time it takes to process a single email message is dominated by the
> phrase generation.
>
> Here's timeitoutput for the noun_phrases code:
>
> % python -m timeit -s 'text = """`python -m timeit --help`""" ; from
> textblob import TextBlob ; from textblob.np_extractors import
> ConllExtractor ; ext = ConllExtractor() ; phrases = TextBlob(text,
> np_extractor=ext).noun_phrases' 'phrases = TextBlob(text,
> np_extractor=ext).noun_phrases'
> 5000 loops, best of 5: 98.7 usec per loop
>
> I process the output of timeit's help message which looks to be about
> the same length as a typical email message, certainly the same order of
> magnitude. Also, note that I call it once in the setup to eliminate the
> initial training of the ConllExtractor instance. I don't know if ~100us
> qualifies as long running or not.
>
> I'll keep messing with it.
>
> Skip

--
https://mail.python.org/mailman/listinfo/python-list

Re: Is there a more efficient threading lock? [ In reply to ]

barry at barrys-emacs

Feb 25, 2023, 2:48 PM

Post #9 of 24 (1019 views)

Re sqlite and threads. The C API can be compiled to be thread safe from my
Reading if the sqlite docs. What I have not checked is how python’s bundled sqlite
is compiled. There are claims python’s sqlite is not thread safe.

Barry

--
https://mail.python.org/mailman/listinfo/python-list

Re: Is there a more efficient threading lock? [ In reply to ]

python-list at python

Feb 25, 2023, 3:45 PM

Post #10 of 24 (1019 views)

On 2023-02-25, Paul Rubin <no.email@nospam.invalid> wrote:
> Jon Ribbens <jon+usenet@unequivocal.eu> writes:
>>> 1) you generally want to use RLock rather than Lock
>> Why?
>
> So that a thread that tries to acquire it twice doesn't block itself,
> etc. Look at the threading lib docs for more info.

Yes, I know what the docs say, I was asking why you were making the
statement above. I haven't used Lock very often, but I've literally
never once in 25 years needed to use RLock. As you say, it's best
to keep the lock-protected code brief, so it's usually pretty
obvious that the code can't be re-entered.

>> What does this mean? Are you saying the GIL has been removed?
>
> Last I heard there was an experimental version of CPython with the GIL
> removed. It is supposed to take less of a performance hit due to
> INCREF/DECREF than an earlier attempt some years back. I don't know its
> current status.
>
> The GIL is an evil thing, but it has been around for so long that most
> of us have gotten used to it, and some user code actually relies on it.
> For example, with the GIL in place, a statement like "x += 1" is always
> atomic, I believe. But, I think it is better to not have any shared
> mutables regardless.

I think it is the case that x += 1 is atomic but foo.x += 1 is not.
Any replacement for the GIL would have to keep the former at least,
plus the fact that you can do hundreds of things like list.append(foo)
which are all effectively atomic.
--
https://mail.python.org/mailman/listinfo/python-list

Re: Is there a more efficient threading lock? [ In reply to ]

Feb 25, 2023, 7:10 PM

Post #11 of 24 (1019 views)

On Sat, 25 Feb 2023 15:41:52 -0600, Skip Montanaro
<skip.montanaro@gmail.com> declaimed the following:

>concurrent.futures.ThreadPoolExecutor() with the default number of workers (
>os.cpu_count() * 1.5, or 12 threads on my system) to process each month, so
>12 active threads at a time. Given that the process is pretty much CPU
>bound, maybe reducing the number of workers to the CPU count would make

Unless things have improved a lot over the years, the GIL still limits
active threads to the equivalent of a single CPU. The OS may swap among
which CPU as it schedules system processes, but only one thread will be
running at any moment regardless of CPU count.

Common wisdom is that Python threading works well for I/O bound
systems, where each thread spends most of its time waiting for some I/O
operation to complete -- thereby allowing the OS to schedule other threads.

For CPU bound, use of the multiprocessing package may be more suited --
though you'll have to device a working IPC system transfer data to/from the
separate processes (no shared objects as possible with threads).

--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed@ix.netcom.com http://wlfraed.microdiversity.freeddns.org/
--
https://mail.python.org/mailman/listinfo/python-list

Re: Is there a more efficient threading lock? [ In reply to ]

rosuav at gmail

Feb 25, 2023, 9:35 PM

Post #12 of 24 (1019 views)

On Sun, 26 Feb 2023 at 16:16, Jon Ribbens via Python-list
<python-list@python.org> wrote:
>
> On 2023-02-25, Paul Rubin <no.email@nospam.invalid> wrote:
> > The GIL is an evil thing, but it has been around for so long that most
> > of us have gotten used to it, and some user code actually relies on it.
> > For example, with the GIL in place, a statement like "x += 1" is always
> > atomic, I believe. But, I think it is better to not have any shared
> > mutables regardless.
>
> I think it is the case that x += 1 is atomic but foo.x += 1 is not.
> Any replacement for the GIL would have to keep the former at least,
> plus the fact that you can do hundreds of things like list.append(foo)
> which are all effectively atomic.

The GIL is most assuredly *not* an evil thing. If you think it's so
evil, go ahead and remove it, because we'll clearly be better off
without it, right?

As it turns out, most GIL-removal attempts have had a fairly nasty
negative effect on performance. The GIL is a huge performance boost.

As to what is atomic and what is not... it's complicated, as always.
Suppose that x (or foo.x) is a custom type:

class Thing:
def __iadd__(self, other):
print("Hi, I'm being added onto!")
self.increment_by(other)
return self

Then no, neither of these is atomic, although if the increment itself
is, it probably won't matter. As far as I know, the only way that it
would be at all different for x+=1 and foo.x+=1 would be if the
__iadd__ method both mutates and returns something other than self,
which is quite unusual. (Most incrementing is done by either
constructing a new object to return, or mutating the existing one, but
not a hybrid.)

Consider this:

import threading
d = {0:0, 1:0, 2:0, 3:0}
def thrd():
for _ in range(10000):
d[0] += 1
d[1] += 1
d[2] += 1
d[3] += 1

threads = [threading.Thread(target=thrd) for _ in range(50)]
for t in threads: t.start()
for t in threads: t.join()
print(d)

Is this code guaranteed to result in 500000 in every slot in the
dictionary? What if you replace the dictionary with a four-element
list? Do you need a GIL for this, or some other sort of lock? What
exactly is it that is needed? To answer that question, let's look at
exactly what happens in the disassembly:

>>> def thrd():
... d[0] += 1
... d[1] += 1
...
>>> import dis
>>> dis.dis(thrd)
1 0 RESUME 0

2 2 LOAD_GLOBAL 0 (d)
14 LOAD_CONST 1 (0)
16 COPY 2
18 COPY 2
20 BINARY_SUBSCR
30 LOAD_CONST 2 (1)
32 BINARY_OP 13 (+=)
36 SWAP 3
38 SWAP 2
40 STORE_SUBSCR

3 44 LOAD_GLOBAL 0 (d)
56 LOAD_CONST 2 (1)
58 COPY 2
60 COPY 2
62 BINARY_SUBSCR
72 LOAD_CONST 2 (1)
74 BINARY_OP 13 (+=)
78 SWAP 3
80 SWAP 2
82 STORE_SUBSCR
86 LOAD_CONST 0 (None)
88 RETURN_VALUE
>>>

(Your exact disassembly may differ, this was on CPython 3.12.)
Crucially, note these three instructions that occur in each block:
BINARY_SUBSCR, BINARY_OP, and STORE_SUBSCR. Those are a lookup
(retrieving the value of d[0]), the actual addition (adding one to the
value), and a store (putting the result back into d[0]). So it's
actually not guaranteed to be atomic; it would be perfectly reasonable
to interrupt that sequence and have something else do another
subscript.

Here's the equivalent with just incrementing a global:

>>> def thrd():
... x += 1
...
>>> dis.dis(thrd)
1 0 RESUME 0

2 2 LOAD_FAST_CHECK 0 (x)
4 LOAD_CONST 1 (1)
6 BINARY_OP 13 (+=)
10 STORE_FAST 0 (x)
12 LOAD_CONST 0 (None)
14 RETURN_VALUE
>>>

The exact same sequence: load, add, store. Still not atomic.

General takeaway: The GIL is a performance feature, not a magic
solution, and certainly not an evil beast that must be slain at any
cost. Attempts to remove it always have to provide equivalent
protection in some other way. But the protection you think you have
might not be what you actually have.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: Is there a more efficient threading lock? [ In reply to ]

rosuav at gmail

Feb 25, 2023, 9:50 PM

Post #13 of 24 (1019 views)

On Sun, 26 Feb 2023 at 16:27, Dennis Lee Bieber <wlfraed@ix.netcom.com> wrote:
>
> On Sat, 25 Feb 2023 15:41:52 -0600, Skip Montanaro
> <skip.montanaro@gmail.com> declaimed the following:
>
>
> >concurrent.futures.ThreadPoolExecutor() with the default number of workers (
> >os.cpu_count() * 1.5, or 12 threads on my system) to process each month, so
> >12 active threads at a time. Given that the process is pretty much CPU
> >bound, maybe reducing the number of workers to the CPU count would make
>
> Unless things have improved a lot over the years, the GIL still limits
> active threads to the equivalent of a single CPU. The OS may swap among
> which CPU as it schedules system processes, but only one thread will be
> running at any moment regardless of CPU count.

Specifically, a single CPU core *executing Python bytecode*. There are
quite a few libraries that release the GIL during computation. Here's
a script that's quite capable of saturating my CPU entirely - in fact,
typing this email is glitchy due to lack of resources:

import threading
import bcrypt
results = [0, 0]
def thrd():
for _ in range(10):
ok = bcrypt.checkpw(b"password",
b'$2b$15$DGDXMb2zvPotw1rHFouzyOVzSopiLIUSedO5DVGQ1GblAd6L6I8/6')
results[ok] += 1

threads = [threading.Thread(target=thrd) for _ in range(100)]
for t in threads: t.start()
for t in threads: t.join()
print(results)

I have four cores eight threads, and yeah, my CPU's not exactly the
latest and greatest (i7 6700k - it was quite good some years ago, but
outstripped now), but feel free to crank the numbers if you want to.

I'm pretty sure bcrypt won't use more than one CPU core for a single
hashpw/checkpw call, but by releasing the GIL during the hard number
crunching, it allows easy parallelization. Same goes for numpy work,
or anything else that can be treated as a separate operation.

So it's more accurate to say that only one CPU core can be
*manipulating Python objects* at a time, although it's hard to pin
down exactly what that means, making it easier to say that there can
only be one executing Python bytecode; it should be possible for any
function call into a C library to be a point where other threads can
take over (most notably, any sort of I/O, but also these kinds of
things).

As mentioned, GIL-removal has been under discussion at times, most
recently (and currently) with PEP 703
https://peps.python.org/pep-0703/ - and the benefits in multithreaded
applications always have to be factored against quite significant
performance penalties. It's looking like PEP 703's proposal has the
least-bad performance measurements of any GILectomy I've seen so far,
showing 10% worse performance on average (possibly able to be reduced
to 5%). As it happens, a GIL just makes sense when you want pure, raw
performance, and it's only certain workloads that suffer under it.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: Is there a more efficient threading lock? [ In reply to ]

barry at barrys-emacs

Feb 26, 2023, 3:53 AM

Post #14 of 24 (1019 views)

On 25/02/2023 23:45, Jon Ribbens via Python-list wrote:
> I think it is the case that x += 1 is atomic but foo.x += 1 is not.

No that is not true, and has never been true.

:>>> def x(a):
:...    a += 1
:...
:>>>
:>>> dis.dis(x)
1           0 RESUME                   0

2           2 LOAD_FAST                0 (a)
             4 LOAD_CONST               1 (1)
             6 BINARY_OP               13 (+=)
            10 STORE_FAST               0 (a)
            12 LOAD_CONST               0 (None)
            14 RETURN_VALUE
:>>>

As you can see there are 4 byte code ops executed.

Python's eval loop can switch to another thread between any of them.

Its is not true that the GIL provides atomic operations in python.

Barry

--
https://mail.python.org/mailman/listinfo/python-list

Re: Is there a more efficient threading lock? [ In reply to ]

python-list at python

Feb 26, 2023, 8:09 AM

Post #15 of 24 (1007 views)

On 2023-02-26, Chris Angelico <rosuav@gmail.com> wrote:
> On Sun, 26 Feb 2023 at 16:16, Jon Ribbens via Python-list
><python-list@python.org> wrote:
>> On 2023-02-25, Paul Rubin <no.email@nospam.invalid> wrote:
>> > The GIL is an evil thing, but it has been around for so long that most
>> > of us have gotten used to it, and some user code actually relies on it.
>> > For example, with the GIL in place, a statement like "x += 1" is always
>> > atomic, I believe. But, I think it is better to not have any shared
>> > mutables regardless.
>>
>> I think it is the case that x += 1 is atomic but foo.x += 1 is not.
>> Any replacement for the GIL would have to keep the former at least,
>> plus the fact that you can do hundreds of things like list.append(foo)
>> which are all effectively atomic.
>
> The GIL is most assuredly *not* an evil thing. If you think it's so
> evil, go ahead and remove it, because we'll clearly be better off
> without it, right?

If you say so. I said nothing whatsoever about the GIL being evil.

> As it turns out, most GIL-removal attempts have had a fairly nasty
> negative effect on performance. The GIL is a huge performance boost.
>
> As to what is atomic and what is not... it's complicated, as always.
> Suppose that x (or foo.x) is a custom type:

Yes, sure, you can make x += 1 not work even single-threaded if you
make custom types which override basic operations. I'm talking about
when you're dealing with simple atomic built-in types such as integers.

> Here's the equivalent with just incrementing a global:
>
>>>> def thrd():
> ... x += 1
> ...
>>>> dis.dis(thrd)
> 1 0 RESUME 0
>
> 2 2 LOAD_FAST_CHECK 0 (x)
> 4 LOAD_CONST 1 (1)
> 6 BINARY_OP 13 (+=)
> 10 STORE_FAST 0 (x)
> 12 LOAD_CONST 0 (None)
> 14 RETURN_VALUE
>>>>
>
> The exact same sequence: load, add, store. Still not atomic.

And yet, it appears that *something* changed between Python 2
and Python 3 such that it *is* atomic:

import sys, threading
class Foo:
x = 0
foo = Foo()
y = 0
def thrd():
global y
for _ in range(10000):
foo.x += 1
y += 1
threads = [threading.Thread(target=thrd) for _ in range(50)]
for t in threads: t.start()
for t in threads: t.join()
print(sys.version)
print(foo.x, y)

2.7.5 (default, Jun 28 2022, 15:30:04)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-44)]
(64489, 59854)

3.8.10 (default, Nov 14 2022, 12:59:47)
[GCC 9.4.0]
500000 500000

--
https://mail.python.org/mailman/listinfo/python-list

Re: Is there a more efficient threading lock? [ In reply to ]

python-list at python

Feb 26, 2023, 8:11 AM

Post #16 of 24 (1007 views)

On 2023-02-26, Barry Scott <barry@barrys-emacs.org> wrote:
> On 25/02/2023 23:45, Jon Ribbens via Python-list wrote:
>> I think it is the case that x += 1 is atomic but foo.x += 1 is not.
>
> No that is not true, and has never been true.
>
>:>>> def x(a):
>:...    a += 1
>:...
>:>>>
>:>>> dis.dis(x)
> 1           0 RESUME                   0
>
> 2           2 LOAD_FAST                0 (a)
>              4 LOAD_CONST               1 (1)
>              6 BINARY_OP               13 (+=)
>             10 STORE_FAST               0 (a)
>             12 LOAD_CONST               0 (None)
>             14 RETURN_VALUE
>:>>>
>
> As you can see there are 4 byte code ops executed.
>
> Python's eval loop can switch to another thread between any of them.
>
> Its is not true that the GIL provides atomic operations in python.

That's oversimplifying to the point of falsehood (just as the opposite
would be too). And: see my other reply in this thread just now - if the
GIL isn't making "x += 1" atomic, something else is.
--
https://mail.python.org/mailman/listinfo/python-list

Re: Is there a more efficient threading lock? [ In reply to ]

skip.montanaro at gmail

Feb 26, 2023, 9:53 AM

Post #17 of 24 (1007 views)

Thanks for the various replies. The program originally started out
single-threaded. I wandered down the multi-threaded path to see if I could
get a performance boost using Sam Gross's NoGIL fork
<https://github.com/colesbury/nogil-3.12>. I was pretty sure the GIL would
limit multi-threading performance on a stock Python interpreter. When I
first switched to threads, I didn't have a lock around the one or two
places which called out to the TextBlob <https://textblob.readthedocs.io/>/NLTK
stuff. The use of threading.Lock was the obvious simplest choice, and it
solved the crash I saw without it. I'm still thinking about using queues to
communicate between the email processing threads and the TextBlob & SQLite
processing stuff.

I had been doing a bit of pre- and post-processing of the default TextBlob
noun phrase generation, but I wasn't happy with it, so I decided to
experiment with an alternate noun phrase extractor
<https://textblob.readthedocs.io/en/dev/api_reference.html?highlight=ConllExtractor#textblob.en.np_extractors.ConllExtractor>.
I was happier with that, so ripped out most of the ad hoc stuff I was
doing. While doing this code surgery, I moved back to 3.11 to have a more
trusty Python interpreter. (I've yet to encounter a problem with NoGIL,
just cutting back on moving parts, and wasn't seeing any obvious
performance gains.)

As for SQLite and multi-threading, I figured if the core devs hadn't yet
gotten around to making it available then it probably wasn't
straightforward. I wasn't willing to tackle that.

So, I'll keep messing around. It's all just for fun
<https://www.smontanaro.net/CR> anyway.

Skip
--
https://mail.python.org/mailman/listinfo/python-list

Re: Is there a more efficient threading lock? [ In reply to ]

skip.montanaro at gmail

Feb 26, 2023, 4:20 PM

Post #18 of 24 (1007 views)

> And yet, it appears that *something* changed between Python 2 and Python
3 such that it *is* atomic:

I haven't looked, but something to check in the source is opcode
prediction. It's possible that after the BINARY_OP executes, opcode
prediction jumps straight to the STORE_FAST opcode, avoiding the transfer
to the top of the virtual machine loop. That would (I think) avoid checks
related to GIL release and thread switches.

I don't guarantee that's what's going on, and even if I'm correct, I don't
think you can rely on it.

Skip
--
https://mail.python.org/mailman/listinfo/python-list

Re: Is there a more efficient threading lock? [ In reply to ]

rosuav at gmail

Feb 26, 2023, 5:25 PM

Post #19 of 24 (1007 views)

On Mon, 27 Feb 2023 at 10:42, Jon Ribbens via Python-list
<python-list@python.org> wrote:
>
> On 2023-02-26, Chris Angelico <rosuav@gmail.com> wrote:
> > On Sun, 26 Feb 2023 at 16:16, Jon Ribbens via Python-list
> ><python-list@python.org> wrote:
> >> On 2023-02-25, Paul Rubin <no.email@nospam.invalid> wrote:
> >> > The GIL is an evil thing, but it has been around for so long that most
> >> > of us have gotten used to it, and some user code actually relies on it.
> >> > For example, with the GIL in place, a statement like "x += 1" is always
> >> > atomic, I believe. But, I think it is better to not have any shared
> >> > mutables regardless.
> >>
> >> I think it is the case that x += 1 is atomic but foo.x += 1 is not.
> >> Any replacement for the GIL would have to keep the former at least,
> >> plus the fact that you can do hundreds of things like list.append(foo)
> >> which are all effectively atomic.
> >
> > The GIL is most assuredly *not* an evil thing. If you think it's so
> > evil, go ahead and remove it, because we'll clearly be better off
> > without it, right?
>
> If you say so. I said nothing whatsoever about the GIL being evil.

You didn't, but I was also responding to Paul's description that the
GIL "is an evil thing". Apologies if that wasn't clear.

> Yes, sure, you can make x += 1 not work even single-threaded if you
> make custom types which override basic operations. I'm talking about
> when you're dealing with simple atomic built-in types such as integers.
>
> > Here's the equivalent with just incrementing a global:
> >
> >>>> def thrd():
> > ... x += 1
> > ...
> >>>> dis.dis(thrd)
> > 1 0 RESUME 0
> >
> > 2 2 LOAD_FAST_CHECK 0 (x)
> > 4 LOAD_CONST 1 (1)
> > 6 BINARY_OP 13 (+=)
> > 10 STORE_FAST 0 (x)
> > 12 LOAD_CONST 0 (None)
> > 14 RETURN_VALUE
> >>>>
> >
> > The exact same sequence: load, add, store. Still not atomic.
>
> And yet, it appears that *something* changed between Python 2
> and Python 3 such that it *is* atomic:

I don't think that's a guarantee. You might be unable to make it
break, but that doesn't mean it's dependable.

In any case, it's not the GIL that's doing this. It might be a quirk
of the current implementation of the core evaluation loop, or it might
be something unrelated, but whatever it is, removing the GIL wouldn't
change that; and it's certainly no different whether it's a global or
an attribute of an object.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: Is there a more efficient threading lock? [ In reply to ]

knomenet at gmail

Feb 26, 2023, 7:19 PM

Post #20 of 24 (1007 views)

I wanted to provide an example that your claimed atomicity is simply wrong,
but I found there is something different in the 3.10+ cpython
implementations.

I've tested the code at the bottom of this message using a few docker
python images, and it appears there is a difference starting in 3.10.0

python3.8
EXPECTED 2560000000
ACTUAL 84533137
python:3.9
EXPECTED 2560000000
ACTUAL 95311773
python:3.10 (.8)
EXPECTED 2560000000
ACTUAL 2560000000

just to see if there was a specific sub-version of 3.10 that added it
python:3.10.0
EXPECTED 2560000000
ACTUAL 2560000000

nope, from the start of 3.10 this is happening

the only difference in the bytecode I see is 3.10 adds SETUP_LOOP and
POP_BLOCK around the for loop

I don't see anything different in the long c code that I would expect would
cause this.

AFAICT the inplace add is null for longs and so should revert to the
long_add that always creates a new integer in x_add

another test
python:3.11
EXPECTED 2560000000
ACTUAL 2560000000

I'm not sure where the difference is at the moment. I didn't see anything
in the release notes given a quick glance.

I do agree that you shouldn't depend on this unless you find a written
guarantee of the behavior, as it is likely an implementation quirk of some
kind

--[code]--

import threading

UPDATES = 10000000
THREADS = 256

vv = 0

def update_x_times( xx ):
for _ in range( xx ):
global vv
vv += 1

def main():
tts = []
for _ in range( THREADS ):
tts.append( threading.Thread( target = update_x_times, args =
(UPDATES,) ) )

for tt in tts:
tt.start()

for tt in tts:
tt.join()

print( 'EXPECTED', UPDATES * THREADS )
print( 'ACTUAL ', vv )

if __name__ == '__main__':
main()

On Sun, Feb 26, 2023 at 6:35?PM Jon Ribbens via Python-list <
python-list@python.org> wrote:

> On 2023-02-26, Barry Scott <barry@barrys-emacs.org> wrote:
> > On 25/02/2023 23:45, Jon Ribbens via Python-list wrote:
> >> I think it is the case that x += 1 is atomic but foo.x += 1 is not.
> >
> > No that is not true, and has never been true.
> >
> >:>>> def x(a):
> >:... a += 1
> >:...
> >:>>>
> >:>>> dis.dis(x)
> > 1 0 RESUME 0
> >
> > 2 2 LOAD_FAST 0 (a)
> > 4 LOAD_CONST 1 (1)
> > 6 BINARY_OP 13 (+=)
> > 10 STORE_FAST 0 (a)
> > 12 LOAD_CONST 0 (None)
> > 14 RETURN_VALUE
> >:>>>
> >
> > As you can see there are 4 byte code ops executed.
> >
> > Python's eval loop can switch to another thread between any of them.
> >
> > Its is not true that the GIL provides atomic operations in python.
>
> That's oversimplifying to the point of falsehood (just as the opposite
> would be too). And: see my other reply in this thread just now - if the
> GIL isn't making "x += 1" atomic, something else is.
> --
> https://mail.python.org/mailman/listinfo/python-list
>
--
https://mail.python.org/mailman/listinfo/python-list

Re: Is there a more efficient threading lock? [ In reply to ]

knomenet at gmail

Feb 26, 2023, 10:26 PM

Post #21 of 24 (1007 views)

https://stackoverflow.com/questions/69993959/python-threads-difference-for-3-10-and-others

https://github.com/python/cpython/commit/4958f5d69dd2bf86866c43491caf72f774ddec97

it's a quirk of implementation. the scheduler currently only checks if it
needs to release the gil after the POP_JUMP_IF_FALSE, POP_JUMP_IF_TRUE,
JUMP_ABSOLUTE, CALL_METHOD, CALL_FUNCTION, CALL_FUNCTION_KW, and
CALL_FUNCTION_EX opcodes.

>>> import code
>>> import dis
>>> dis.dis( code.update_x_times )
10 0 LOAD_GLOBAL 0 (range)
2 LOAD_FAST 0 (xx)
4 CALL_FUNCTION 1
##### GIL CAN RELEASE HERE #####
6 GET_ITER
>> 8 FOR_ITER 6 (to 22)
10 STORE_FAST 1 (_)
12 12 LOAD_GLOBAL 1 (vv)
14 LOAD_CONST 1 (1)
16 INPLACE_ADD
18 STORE_GLOBAL 1 (vv)
20 JUMP_ABSOLUTE 4 (to 8)
##### GIL CAN RELEASE HERE (after JUMP_ABSOLUTE points the instruction
counter back to FOR_ITER, but before the interpreter actually jumps to
FOR_ITER again) #####
10 >> 22 LOAD_CONST 0 (None)
24 RETURN_VALUE
>>>

due to this, this section:
12 12 LOAD_GLOBAL 1 (vv)
14 LOAD_CONST 1 (1)
16 INPLACE_ADD
18 STORE_GLOBAL 1 (vv)

is effectively locked/atomic on post-3.10 interpreters, though this is
neither portable nor guaranteed to stay that way into the future

On Sun, Feb 26, 2023 at 10:19?PM Michael Speer <knomenet@gmail.com> wrote:

> I wanted to provide an example that your claimed atomicity is simply
> wrong, but I found there is something different in the 3.10+ cpython
> implementations.
>
> I've tested the code at the bottom of this message using a few docker
> python images, and it appears there is a difference starting in 3.10.0
>
> python3.8
> EXPECTED 2560000000
> ACTUAL 84533137
> python:3.9
> EXPECTED 2560000000
> ACTUAL 95311773
> python:3.10 (.8)
> EXPECTED 2560000000
> ACTUAL 2560000000
>
> just to see if there was a specific sub-version of 3.10 that added it
> python:3.10.0
> EXPECTED 2560000000
> ACTUAL 2560000000
>
> nope, from the start of 3.10 this is happening
>
> the only difference in the bytecode I see is 3.10 adds SETUP_LOOP and
> POP_BLOCK around the for loop
>
> I don't see anything different in the long c code that I would expect
> would cause this.
>
> AFAICT the inplace add is null for longs and so should revert to the
> long_add that always creates a new integer in x_add
>
> another test
> python:3.11
> EXPECTED 2560000000
> ACTUAL 2560000000
>
> I'm not sure where the difference is at the moment. I didn't see anything
> in the release notes given a quick glance.
>
> I do agree that you shouldn't depend on this unless you find a written
> guarantee of the behavior, as it is likely an implementation quirk of some
> kind
>
> --[code]--
>
> import threading
>
> UPDATES = 10000000
> THREADS = 256
>
> vv = 0
>
> def update_x_times( xx ):
> for _ in range( xx ):
> global vv
> vv += 1
>
> def main():
> tts = []
> for _ in range( THREADS ):
> tts.append( threading.Thread( target = update_x_times, args =
> (UPDATES,) ) )
>
> for tt in tts:
> tt.start()
>
> for tt in tts:
> tt.join()
>
> print( 'EXPECTED', UPDATES * THREADS )
> print( 'ACTUAL ', vv )
>
> if __name__ == '__main__':
> main()
>
> On Sun, Feb 26, 2023 at 6:35?PM Jon Ribbens via Python-list <
> python-list@python.org> wrote:
>
>> On 2023-02-26, Barry Scott <barry@barrys-emacs.org> wrote:
>> > On 25/02/2023 23:45, Jon Ribbens via Python-list wrote:
>> >> I think it is the case that x += 1 is atomic but foo.x += 1 is not.
>> >
>> > No that is not true, and has never been true.
>> >
>> >:>>> def x(a):
>> >:... a += 1
>> >:...
>> >:>>>
>> >:>>> dis.dis(x)
>> > 1 0 RESUME 0
>> >
>> > 2 2 LOAD_FAST 0 (a)
>> > 4 LOAD_CONST 1 (1)
>> > 6 BINARY_OP 13 (+=)
>> > 10 STORE_FAST 0 (a)
>> > 12 LOAD_CONST 0 (None)
>> > 14 RETURN_VALUE
>> >:>>>
>> >
>> > As you can see there are 4 byte code ops executed.
>> >
>> > Python's eval loop can switch to another thread between any of them.
>> >
>> > Its is not true that the GIL provides atomic operations in python.
>>
>> That's oversimplifying to the point of falsehood (just as the opposite
>> would be too). And: see my other reply in this thread just now - if the
>> GIL isn't making "x += 1" atomic, something else is.
>> --
>> https://mail.python.org/mailman/listinfo/python-list
>>
>
--
https://mail.python.org/mailman/listinfo/python-list

Re: Is there a more efficient threading lock? [ In reply to ]

rosuav at gmail

Feb 26, 2023, 10:37 PM

Post #22 of 24 (1007 views)

On Mon, 27 Feb 2023 at 17:28, Michael Speer <knomenet@gmail.com> wrote:
>
> https://github.com/python/cpython/commit/4958f5d69dd2bf86866c43491caf72f774ddec97
>
> it's a quirk of implementation. the scheduler currently only checks if it
> needs to release the gil after the POP_JUMP_IF_FALSE, POP_JUMP_IF_TRUE,
> JUMP_ABSOLUTE, CALL_METHOD, CALL_FUNCTION, CALL_FUNCTION_KW, and
> CALL_FUNCTION_EX opcodes.
>

Oh now that is VERY interesting. It's a quirk of implementation, yes,
but there's a reason for it; a bug being solved. The underlying
guarantee about __exit__ should be considered to be defined behaviour,
meaning that the precise quirk might not be relevant even though the
bug has to remain fixed in all future versions. But I'd also note here
that, if it can be absolutely 100% guaranteed that the GIL will be
released and signals checked on a reasonable interval, there's no
particular reason to state that signals are checked after every single
Python bytecode. (See the removed comment about empty loops, which
would have been a serious issue and is probably why the backward jump
rule exists.)

So it wouldn't be too hard for a future release of Python to mandate
atomicity of certain specific operations. Obviously it'd require
buy-in from other implementations, but it would be rather convenient
if, subject to some very tight rules like "only when adding integers
onto core data types" etc, a simple statement like "x.y += 1" could
actually be guaranteed to take place atomically.

Though it's still probably not as useful as you might hope. In C, if I
can do "int id = counter++;" atomically, it would guarantee me a new
ID that no other thread could ever have. But in Python, that increment
operation doesn't give you the result, so all it's really useful for
is statistics on operations done. Still, that in itself could be of
value in quite a few situations.

In any case, though, this isn't something to depend upon at the moment.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: Is there a more efficient threading lock? [ In reply to ]

barry at barrys-emacs

Feb 28, 2023, 3:04 PM

Post #23 of 24 (986 views)

> Though it's still probably not as useful as you might hope. In C, if I
> can do "int id = counter++;" atomically, it would guarantee me a new
> ID that no other thread could ever have.

C does not have to do that atomically. In fact it is free to use lots of instructions to build the int value. And some compilers indeed do, the linux kernel folks see this in gcc generated code.

I understand you have to use the new atomics features.

Barry

--
https://mail.python.org/mailman/listinfo/python-list

Re: Is there a more efficient threading lock? [ In reply to ]

rosuav at gmail

Feb 28, 2023, 5:58 PM

Post #24 of 24 (984 views)

On Wed, 1 Mar 2023 at 10:04, Barry <barry@barrys-emacs.org> wrote:
>
> > Though it's still probably not as useful as you might hope. In C, if I
> > can do "int id = counter++;" atomically, it would guarantee me a new
> > ID that no other thread could ever have.
>
> C does not have to do that atomically. In fact it is free to use lots of instructions to build the int value. And some compilers indeed do, the linux kernel folks see this in gcc generated code.
>
> I understand you have to use the new atomics features.
>

Yeah, I didn't have a good analogy so I went with a hypothetical. The
atomicity would be more useful in that context as it would give
lock-free ID generation, which doesn't work in Python.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list