Mailing List Archive: [ python-Bugs-1731717 ] race condition in subprocess module

Bugs item #1731717, was opened at 2007-06-06 08:19
Message generated for change (Comment added) made by abo
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1731717&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: Python 2.4
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: dsagal (dsagal)
Assigned to: Peter Ã…strand (astrand)
Summary: race condition in subprocess module

Initial Comment:
Python's subprocess module has a race condition: Popen() constructor has a call to global "_cleanup()" function on whenever a Popen object gets created, and that call causes a check for all pending Popen objects whether their subprocess has exited - i.e. the poll() method is called for every active Popen object.

Now, if I create Popen object "foo" in one thread, and then a.wait(), and meanwhile I create another Popen object "bar" in another thread, then a.poll() might get called by _clean() right at the time when my first thread is running a.wait(). But those are not synchronized - each calls os.waitpid() if returncode is None, but the section which checks returncode and calls os.waitpid() is not protected as a critical section should be.

The following code illustrates the problem, and is known to break reasonably consistenly with Python2.4. Changes to subprocess in Python2.5 seems to address a somewhat related problem, but not this one.

import sys, os, threading, subprocess, time

class X(threading.Thread):
def __init__(self, *args, **kwargs):
super(X, self).__init__(*args, **kwargs)
self.start()

def tt():
s = subprocess.Popen(("/bin/ls", "-a", "/tmp"), stdout=subprocess.PIPE,
universal_newlines=True)
# time.sleep(1)
s.communicate()[0]

for i in xrange(1000):
X(target = tt)

This typically gives a few dozen errors like these:
Exception in thread Thread-795:
Traceback (most recent call last):
File "/usr/lib/python2.4/threading.py", line 442, in __bootstrap
self.run()
File "/usr/lib/python2.4/threading.py", line 422, in run
self.__target(*self.__args, **self.__kwargs)
File "z.py", line 21, in tt
s.communicate()[0]
File "/usr/lib/python2.4/subprocess.py", line 1083, in communicate
self.wait()
File "/usr/lib/python2.4/subprocess.py", line 1007, in wait
pid, sts = os.waitpid(self.pid, 0)
OSError: [Errno 10] No child processes

Note that uncommenting time.sleep(1) fixes the problem. So does wrapping subprocess.poll() and wait() with a lock. So does removing the call to global _cleanup() in Popen constructor.

----------------------------------------------------------------------

Comment By: Donovan Baarda (abo)
Date: 2007-08-04 05:14

Message:
Logged In: YES
user_id=10273
Originator: NO

I just looked at popen2 and it has the same bug. I don't think __del__()
related changes have anything to do with this. Popen.poll() and _cleanup()
are non-threadsafe. When __init__() is called to create subprocesses
simultaniously in two different threads, they both call _cleanup() and
trample all over each other.

Removing _cleanup() will not leave zombies for popened processes that are
correctly handled to completion. Using methods like communicate() and
wait() will handle the process to completion and reap the child.

Unfortunately, I think a fairly common use-case is people using popen to
fire-and-forget subprocesses without using communicate() or wait(). These
will not get reaped, and will end up as zombies.

I cannot think of an easy way to reap abandoned popen instances that is
thread safe without using locks. At times like this I wish that the GIL was
exposed as a recursive lock... we could have __cleanup__() acquire/release
the GIL and make it atomic... no more race condition :-)

----------------------------------------------------------------------

Comment By: Peter Ã…strand (astrand)
Date: 2007-08-02 19:15

Message:
Logged In: YES
user_id=344921
Originator: NO

Suddenly starting to leave Zombies is NOT an option for me. To prevent
zombies, all applications using the subprocess module would need to be
changed. This is a major breakage of backwards compatibility, IMHO.
subprocess (as well as the popen2 module) has prevented zombies
automatically from the beginning and I believe they should continue to do
so (at least by default).

A little bit of history: When I wrote the subprocess module, I stole the
idea of calling _cleanup whenever a new instance is created from the popen2
module, since I thought it was nice, lightweight and avoided having a
__del__ method (which I have some bad experience with, GC-related). This
worked great for a long time. At 2006-04-10, however, martin.v.loewis
committed patch 1467770 (revision r45234), to solve bug 1460493. It
introduces a __del__ method and some more complexity. I must admit that I
didn't review this patch in detail, but it looked like thread safeness was
being taken care of. But apparently it isn't.

Perhaps reverting to the original popen2 approach would solve the problem?
Or does the popen2 module suffer from this bug as well?

I think that we need to fix this once and for all, this time. We should
probably also consider adding the option of not having subprocess auto-wait
while we are at it, for those what would like to wait() themselves (there
should be a tracker item for this, but I can't find it right now).

Are we tight on time for fixing this bug? I'm out in the forest right
now...

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2007-08-02 10:45

Message:
Logged In: YES
user_id=6380
Originator: NO

I like #2. I don't see any use for threadsafe Popen instances, and I
think that any self-respecting long-running server using Popen should
properly manage its subprocesses anyway. (And for short-running processes
it doesn't really matter if you have a few zombies.)

One could add a __del__ method that registers zombies to be reaped later,
but I don't think it's worth it, and __del__ has some serious issues of its
own. (If you really want to do this, use a weak reference callback instead
of __del__ to do the zombie registration.)

----------------------------------------------------------------------

Comment By: Donovan Baarda (abo)
Date: 2007-08-02 03:37

Message:
Logged In: YES
user_id=10273
Originator: NO

Having just gone through that waffly description of the problems and
various levels of fix, I think there are really only two fixes worth
considering;

1) Make Popen instances fully threadsafe. Give them a recursive lock
attribute and have every method acquire the lock at the start, and release
it at the end.

2) Decide the "try to reap abandoned children at each Popen" idea was an
ugly hack and abandon it. Remove _active and _cleanup(), and document that
any child process not explicitly handled to completion will result in
zombie child processes.

----------------------------------------------------------------------

Comment By: Donovan Baarda (abo)
Date: 2007-08-02 03:05

Message:
Logged In: YES
user_id=10273
Originator: NO

It appears that subprocess calls a module global "_cleanup()" whenever
opening a new subprocess. This method is meant to reap any child processes
that have terminated and have not explicitly cleaned up. These are
processes you would expect to be cleaned up by GC, however, subprocess
keeps a list of of all spawned subprocesses in _active until they are
reaped explicitly so it can cleanup any that nolonger referenced anywhere
else.

The problem is lots of methods, including poll() and wait(), check
self.returncode and then modify it. Any non-atomic read/modify action is
inherently non-threadsafe. And _cleanup() calls poll() on all un-reaped
child processes. If two threads happen to try and spawn subprocesses at
once, these _cleanup() calls collide..

The way to fix this depends on how thread-safe you want to make it. If you
want to share popen objects between threads to wait()/poll() with impunity
from any thread, you should add a recursive lock attribute to the Popen
instance and have it lock/release it at the start/end of every method
call.

If you only care about using popen objects in one thread at a time, then
all you need to fix is the nasty "every popen created calls poll() on every
other living popen object regardless of what thread started them,
and poll() is not threadsafe" behaviour.

Removing _cleanup() is one way, but it will then not reap child processes
that you del'ed all references to (except the one in subprocess._active)
before you checked they were done.

Probably another good idea is to not append and remove popen objects to
_active directly, instead and add a popen.__del__() method that defers
GC'ing of non-finished popen objects by adding them to _active. This
way, _active only contains un-reaped child processes that were due to be
GC'ed. _cleanup() will then be responsible for polling and removing these
popen objects from _active when they are done.

However, this alone will not fix things because you are still calling
_cleanup() from different threads, it is still calling poll() on all these
un-reaped processes, and poll() is not threadsafe. So... you either have to
make poll() threadsafe (lock/unlock at the beginning/end of the method), or
make _cleanup() threadsafe. The reason you can get away with making only
_cleanup() threadsafe this way is _active will contain a list of processes
that are not referenced anywhere else, so you know the only thing that will
call poll() on them is the _cleanup() method.

----------------------------------------------------------------------

Comment By: Neal Norwitz (nnorwitz)
Date: 2007-06-07 15:26

Message:
Logged In: YES
user_id=33168
Originator: NO

Peter, could you take a look at this? Thanks.

----------------------------------------------------------------------

You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1731717&group_id=5470
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

[ python-Bugs-1731717 ] race condition in subprocess module [ In reply to ]

noreply at sourceforge

Aug 3, 2007, 12:26 PM

Post #6 of 10 (392 views)

Permalink

Bugs item #1731717, was opened at 2007-06-06 08:19
Message generated for change (Comment added) made by abo
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1731717&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: Python 2.4
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: dsagal (dsagal)
Assigned to: Peter Ã…strand (astrand)
Summary: race condition in subprocess module

Initial Comment:
Python's subprocess module has a race condition: Popen() constructor has a call to global "_cleanup()" function on whenever a Popen object gets created, and that call causes a check for all pending Popen objects whether their subprocess has exited - i.e. the poll() method is called for every active Popen object.

Now, if I create Popen object "foo" in one thread, and then a.wait(), and meanwhile I create another Popen object "bar" in another thread, then a.poll() might get called by _clean() right at the time when my first thread is running a.wait(). But those are not synchronized - each calls os.waitpid() if returncode is None, but the section which checks returncode and calls os.waitpid() is not protected as a critical section should be.

The following code illustrates the problem, and is known to break reasonably consistenly with Python2.4. Changes to subprocess in Python2.5 seems to address a somewhat related problem, but not this one.

import sys, os, threading, subprocess, time

class X(threading.Thread):
def __init__(self, *args, **kwargs):
super(X, self).__init__(*args, **kwargs)
self.start()

def tt():
s = subprocess.Popen(("/bin/ls", "-a", "/tmp"), stdout=subprocess.PIPE,
universal_newlines=True)
# time.sleep(1)
s.communicate()[0]

for i in xrange(1000):
X(target = tt)

This typically gives a few dozen errors like these:
Exception in thread Thread-795:
Traceback (most recent call last):
File "/usr/lib/python2.4/threading.py", line 442, in __bootstrap
self.run()
File "/usr/lib/python2.4/threading.py", line 422, in run
self.__target(*self.__args, **self.__kwargs)
File "z.py", line 21, in tt
s.communicate()[0]
File "/usr/lib/python2.4/subprocess.py", line 1083, in communicate
self.wait()
File "/usr/lib/python2.4/subprocess.py", line 1007, in wait
pid, sts = os.waitpid(self.pid, 0)
OSError: [Errno 10] No child processes

Note that uncommenting time.sleep(1) fixes the problem. So does wrapping subprocess.poll() and wait() with a lock. So does removing the call to global _cleanup() in Popen constructor.

----------------------------------------------------------------------

Comment By: Donovan Baarda (abo)
Date: 2007-08-04 05:26

Message:
Logged In: YES
user_id=10273
Originator: NO

Actually, after thinking about it, there may be a way to make _cleanup()
threadsafe without using explicit locks by leveraging off the GIL. If
_active.pop() and _active.append() are atomic because of the GIL, then
_cleanup() can be made threadsafe. To be truely threadsafe it also needs to
make sure that _active contains only abandoned popen instances so that
_cleanup() doesn't have it's inst.poll() calls collide with other threads
that are still using those popen instances. This can be done by havin the
popen instances added to _active only by Popen.__del__(), and only removed
by _cleanup() when they are done.

----------------------------------------------------------------------

Comment By: Donovan Baarda (abo)
Date: 2007-08-04 05:14

Message:
Logged In: YES
user_id=10273
Originator: NO

I just looked at popen2 and it has the same bug. I don't think __del__()
related changes have anything to do with this. Popen.poll() and _cleanup()
are non-threadsafe. When __init__() is called to create subprocesses
simultaniously in two different threads, they both call _cleanup() and
trample all over each other.

Removing _cleanup() will not leave zombies for popened processes that are
correctly handled to completion. Using methods like communicate() and
wait() will handle the process to completion and reap the child.

Unfortunately, I think a fairly common use-case is people using popen to
fire-and-forget subprocesses without using communicate() or wait(). These
will not get reaped, and will end up as zombies.

I cannot think of an easy way to reap abandoned popen instances that is
thread safe without using locks. At times like this I wish that the GIL was
exposed as a recursive lock... we could have __cleanup__() acquire/release
the GIL and make it atomic... no more race condition :-)

----------------------------------------------------------------------

Comment By: Peter Ã…strand (astrand)
Date: 2007-08-02 19:15

Message:
Logged In: YES
user_id=344921
Originator: NO

Suddenly starting to leave Zombies is NOT an option for me. To prevent
zombies, all applications using the subprocess module would need to be
changed. This is a major breakage of backwards compatibility, IMHO.
subprocess (as well as the popen2 module) has prevented zombies
automatically from the beginning and I believe they should continue to do
so (at least by default).

A little bit of history: When I wrote the subprocess module, I stole the
idea of calling _cleanup whenever a new instance is created from the popen2
module, since I thought it was nice, lightweight and avoided having a
__del__ method (which I have some bad experience with, GC-related). This
worked great for a long time. At 2006-04-10, however, martin.v.loewis
committed patch 1467770 (revision r45234), to solve bug 1460493. It
introduces a __del__ method and some more complexity. I must admit that I
didn't review this patch in detail, but it looked like thread safeness was
being taken care of. But apparently it isn't.

Perhaps reverting to the original popen2 approach would solve the problem?
Or does the popen2 module suffer from this bug as well?

I think that we need to fix this once and for all, this time. We should
probably also consider adding the option of not having subprocess auto-wait
while we are at it, for those what would like to wait() themselves (there
should be a tracker item for this, but I can't find it right now).

Are we tight on time for fixing this bug? I'm out in the forest right
now...

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2007-08-02 10:45

Message:
Logged In: YES
user_id=6380
Originator: NO

I like #2. I don't see any use for threadsafe Popen instances, and I
think that any self-respecting long-running server using Popen should
properly manage its subprocesses anyway. (And for short-running processes
it doesn't really matter if you have a few zombies.)

One could add a __del__ method that registers zombies to be reaped later,
but I don't think it's worth it, and __del__ has some serious issues of its
own. (If you really want to do this, use a weak reference callback instead
of __del__ to do the zombie registration.)

----------------------------------------------------------------------

Comment By: Donovan Baarda (abo)
Date: 2007-08-02 03:37

Message:
Logged In: YES
user_id=10273
Originator: NO

Having just gone through that waffly description of the problems and
various levels of fix, I think there are really only two fixes worth
considering;

1) Make Popen instances fully threadsafe. Give them a recursive lock
attribute and have every method acquire the lock at the start, and release
it at the end.

2) Decide the "try to reap abandoned children at each Popen" idea was an
ugly hack and abandon it. Remove _active and _cleanup(), and document that
any child process not explicitly handled to completion will result in
zombie child processes.

----------------------------------------------------------------------

Comment By: Donovan Baarda (abo)
Date: 2007-08-02 03:05

Message:
Logged In: YES
user_id=10273
Originator: NO

It appears that subprocess calls a module global "_cleanup()" whenever
opening a new subprocess. This method is meant to reap any child processes
that have terminated and have not explicitly cleaned up. These are
processes you would expect to be cleaned up by GC, however, subprocess
keeps a list of of all spawned subprocesses in _active until they are
reaped explicitly so it can cleanup any that nolonger referenced anywhere
else.

The problem is lots of methods, including poll() and wait(), check
self.returncode and then modify it. Any non-atomic read/modify action is
inherently non-threadsafe. And _cleanup() calls poll() on all un-reaped
child processes. If two threads happen to try and spawn subprocesses at
once, these _cleanup() calls collide..

The way to fix this depends on how thread-safe you want to make it. If you
want to share popen objects between threads to wait()/poll() with impunity
from any thread, you should add a recursive lock attribute to the Popen
instance and have it lock/release it at the start/end of every method
call.

If you only care about using popen objects in one thread at a time, then
all you need to fix is the nasty "every popen created calls poll() on every
other living popen object regardless of what thread started them,
and poll() is not threadsafe" behaviour.

Removing _cleanup() is one way, but it will then not reap child processes
that you del'ed all references to (except the one in subprocess._active)
before you checked they were done.

Probably another good idea is to not append and remove popen objects to
_active directly, instead and add a popen.__del__() method that defers
GC'ing of non-finished popen objects by adding them to _active. This
way, _active only contains un-reaped child processes that were due to be
GC'ed. _cleanup() will then be responsible for polling and removing these
popen objects from _active when they are done.

However, this alone will not fix things because you are still calling
_cleanup() from different threads, it is still calling poll() on all these
un-reaped processes, and poll() is not threadsafe. So... you either have to
make poll() threadsafe (lock/unlock at the beginning/end of the method), or
make _cleanup() threadsafe. The reason you can get away with making only
_cleanup() threadsafe this way is _active will contain a list of processes
that are not referenced anywhere else, so you know the only thing that will
call poll() on them is the _cleanup() method.

----------------------------------------------------------------------

Comment By: Neal Norwitz (nnorwitz)
Date: 2007-06-07 15:26

Message:
Logged In: YES
user_id=33168
Originator: NO

Peter, could you take a look at this? Thanks.

----------------------------------------------------------------------

You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1731717&group_id=5470
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

[ python-Bugs-1731717 ] race condition in subprocess module [ In reply to ]

noreply at sourceforge

Aug 3, 2007, 1:11 PM

Post #7 of 10 (393 views)

Permalink

Bugs item #1731717, was opened at 2007-06-05 18:19
Message generated for change (Comment added) made by gvanrossum
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1731717&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: Python 2.4
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: dsagal (dsagal)
Assigned to: Peter Ã…strand (astrand)
Summary: race condition in subprocess module

Initial Comment:
Python's subprocess module has a race condition: Popen() constructor has a call to global "_cleanup()" function on whenever a Popen object gets created, and that call causes a check for all pending Popen objects whether their subprocess has exited - i.e. the poll() method is called for every active Popen object.

Now, if I create Popen object "foo" in one thread, and then a.wait(), and meanwhile I create another Popen object "bar" in another thread, then a.poll() might get called by _clean() right at the time when my first thread is running a.wait(). But those are not synchronized - each calls os.waitpid() if returncode is None, but the section which checks returncode and calls os.waitpid() is not protected as a critical section should be.

The following code illustrates the problem, and is known to break reasonably consistenly with Python2.4. Changes to subprocess in Python2.5 seems to address a somewhat related problem, but not this one.

import sys, os, threading, subprocess, time

class X(threading.Thread):
def __init__(self, *args, **kwargs):
super(X, self).__init__(*args, **kwargs)
self.start()

def tt():
s = subprocess.Popen(("/bin/ls", "-a", "/tmp"), stdout=subprocess.PIPE,
universal_newlines=True)
# time.sleep(1)
s.communicate()[0]

for i in xrange(1000):
X(target = tt)

This typically gives a few dozen errors like these:
Exception in thread Thread-795:
Traceback (most recent call last):
File "/usr/lib/python2.4/threading.py", line 442, in __bootstrap
self.run()
File "/usr/lib/python2.4/threading.py", line 422, in run
self.__target(*self.__args, **self.__kwargs)
File "z.py", line 21, in tt
s.communicate()[0]
File "/usr/lib/python2.4/subprocess.py", line 1083, in communicate
self.wait()
File "/usr/lib/python2.4/subprocess.py", line 1007, in wait
pid, sts = os.waitpid(self.pid, 0)
OSError: [Errno 10] No child processes

Note that uncommenting time.sleep(1) fixes the problem. So does wrapping subprocess.poll() and wait() with a lock. So does removing the call to global _cleanup() in Popen constructor.

----------------------------------------------------------------------

>Comment By: Guido van Rossum (gvanrossum)
Date: 2007-08-03 16:11

Message:
Logged In: YES
user_id=6380
Originator: NO

If you want this fixed in 2.5.x, a new release is just around the corner,
and time is very tight. Otherwise, 2.6a1 is half a year off.

----------------------------------------------------------------------

Comment By: Donovan Baarda (abo)
Date: 2007-08-03 15:26

Message:
Logged In: YES
user_id=10273
Originator: NO

Actually, after thinking about it, there may be a way to make _cleanup()
threadsafe without using explicit locks by leveraging off the GIL. If
_active.pop() and _active.append() are atomic because of the GIL, then
_cleanup() can be made threadsafe. To be truely threadsafe it also needs to
make sure that _active contains only abandoned popen instances so that
_cleanup() doesn't have it's inst.poll() calls collide with other threads
that are still using those popen instances. This can be done by havin the
popen instances added to _active only by Popen.__del__(), and only removed
by _cleanup() when they are done.

----------------------------------------------------------------------

Comment By: Donovan Baarda (abo)
Date: 2007-08-03 15:14

Message:
Logged In: YES
user_id=10273
Originator: NO

I just looked at popen2 and it has the same bug. I don't think __del__()
related changes have anything to do with this. Popen.poll() and _cleanup()
are non-threadsafe. When __init__() is called to create subprocesses
simultaniously in two different threads, they both call _cleanup() and
trample all over each other.

Removing _cleanup() will not leave zombies for popened processes that are
correctly handled to completion. Using methods like communicate() and
wait() will handle the process to completion and reap the child.

Unfortunately, I think a fairly common use-case is people using popen to
fire-and-forget subprocesses without using communicate() or wait(). These
will not get reaped, and will end up as zombies.

I cannot think of an easy way to reap abandoned popen instances that is
thread safe without using locks. At times like this I wish that the GIL was
exposed as a recursive lock... we could have __cleanup__() acquire/release
the GIL and make it atomic... no more race condition :-)

----------------------------------------------------------------------

Comment By: Peter Ã…strand (astrand)
Date: 2007-08-02 05:15

Message:
Logged In: YES
user_id=344921
Originator: NO

Suddenly starting to leave Zombies is NOT an option for me. To prevent
zombies, all applications using the subprocess module would need to be
changed. This is a major breakage of backwards compatibility, IMHO.
subprocess (as well as the popen2 module) has prevented zombies
automatically from the beginning and I believe they should continue to do
so (at least by default).

A little bit of history: When I wrote the subprocess module, I stole the
idea of calling _cleanup whenever a new instance is created from the popen2
module, since I thought it was nice, lightweight and avoided having a
__del__ method (which I have some bad experience with, GC-related). This
worked great for a long time. At 2006-04-10, however, martin.v.loewis
committed patch 1467770 (revision r45234), to solve bug 1460493. It
introduces a __del__ method and some more complexity. I must admit that I
didn't review this patch in detail, but it looked like thread safeness was
being taken care of. But apparently it isn't.

Perhaps reverting to the original popen2 approach would solve the problem?
Or does the popen2 module suffer from this bug as well?

I think that we need to fix this once and for all, this time. We should
probably also consider adding the option of not having subprocess auto-wait
while we are at it, for those what would like to wait() themselves (there
should be a tracker item for this, but I can't find it right now).

Are we tight on time for fixing this bug? I'm out in the forest right
now...

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2007-08-01 20:45

Message:
Logged In: YES
user_id=6380
Originator: NO

I like #2. I don't see any use for threadsafe Popen instances, and I
think that any self-respecting long-running server using Popen should
properly manage its subprocesses anyway. (And for short-running processes
it doesn't really matter if you have a few zombies.)

One could add a __del__ method that registers zombies to be reaped later,
but I don't think it's worth it, and __del__ has some serious issues of its
own. (If you really want to do this, use a weak reference callback instead
of __del__ to do the zombie registration.)

----------------------------------------------------------------------

Comment By: Donovan Baarda (abo)
Date: 2007-08-01 13:37

Message:
Logged In: YES
user_id=10273
Originator: NO

Having just gone through that waffly description of the problems and
various levels of fix, I think there are really only two fixes worth
considering;

1) Make Popen instances fully threadsafe. Give them a recursive lock
attribute and have every method acquire the lock at the start, and release
it at the end.

2) Decide the "try to reap abandoned children at each Popen" idea was an
ugly hack and abandon it. Remove _active and _cleanup(), and document that
any child process not explicitly handled to completion will result in
zombie child processes.

----------------------------------------------------------------------

Comment By: Donovan Baarda (abo)
Date: 2007-08-01 13:05

Message:
Logged In: YES
user_id=10273
Originator: NO

It appears that subprocess calls a module global "_cleanup()" whenever
opening a new subprocess. This method is meant to reap any child processes
that have terminated and have not explicitly cleaned up. These are
processes you would expect to be cleaned up by GC, however, subprocess
keeps a list of of all spawned subprocesses in _active until they are
reaped explicitly so it can cleanup any that nolonger referenced anywhere
else.

The problem is lots of methods, including poll() and wait(), check
self.returncode and then modify it. Any non-atomic read/modify action is
inherently non-threadsafe. And _cleanup() calls poll() on all un-reaped
child processes. If two threads happen to try and spawn subprocesses at
once, these _cleanup() calls collide..

The way to fix this depends on how thread-safe you want to make it. If you
want to share popen objects between threads to wait()/poll() with impunity
from any thread, you should add a recursive lock attribute to the Popen
instance and have it lock/release it at the start/end of every method
call.

If you only care about using popen objects in one thread at a time, then
all you need to fix is the nasty "every popen created calls poll() on every
other living popen object regardless of what thread started them,
and poll() is not threadsafe" behaviour.

Removing _cleanup() is one way, but it will then not reap child processes
that you del'ed all references to (except the one in subprocess._active)
before you checked they were done.

Probably another good idea is to not append and remove popen objects to
_active directly, instead and add a popen.__del__() method that defers
GC'ing of non-finished popen objects by adding them to _active. This
way, _active only contains un-reaped child processes that were due to be
GC'ed. _cleanup() will then be responsible for polling and removing these
popen objects from _active when they are done.

However, this alone will not fix things because you are still calling
_cleanup() from different threads, it is still calling poll() on all these
un-reaped processes, and poll() is not threadsafe. So... you either have to
make poll() threadsafe (lock/unlock at the beginning/end of the method), or
make _cleanup() threadsafe. The reason you can get away with making only
_cleanup() threadsafe this way is _active will contain a list of processes
that are not referenced anywhere else, so you know the only thing that will
call poll() on them is the _cleanup() method.

----------------------------------------------------------------------

Comment By: Neal Norwitz (nnorwitz)
Date: 2007-06-07 01:26

Message:
Logged In: YES
user_id=33168
Originator: NO

Peter, could you take a look at this? Thanks.

----------------------------------------------------------------------

You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1731717&group_id=5470
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

[ python-Bugs-1731717 ] race condition in subprocess module [ In reply to ]

noreply at sourceforge

Aug 3, 2007, 2:29 PM

Post #8 of 10 (396 views)

Permalink

Bugs item #1731717, was opened at 2007-06-06 08:19
Message generated for change (Comment added) made by abo
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1731717&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: Python 2.4
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: dsagal (dsagal)
Assigned to: Peter Ã…strand (astrand)
Summary: race condition in subprocess module

Initial Comment:
Python's subprocess module has a race condition: Popen() constructor has a call to global "_cleanup()" function on whenever a Popen object gets created, and that call causes a check for all pending Popen objects whether their subprocess has exited - i.e. the poll() method is called for every active Popen object.

Now, if I create Popen object "foo" in one thread, and then a.wait(), and meanwhile I create another Popen object "bar" in another thread, then a.poll() might get called by _clean() right at the time when my first thread is running a.wait(). But those are not synchronized - each calls os.waitpid() if returncode is None, but the section which checks returncode and calls os.waitpid() is not protected as a critical section should be.

The following code illustrates the problem, and is known to break reasonably consistenly with Python2.4. Changes to subprocess in Python2.5 seems to address a somewhat related problem, but not this one.

import sys, os, threading, subprocess, time

class X(threading.Thread):
def __init__(self, *args, **kwargs):
super(X, self).__init__(*args, **kwargs)
self.start()

def tt():
s = subprocess.Popen(("/bin/ls", "-a", "/tmp"), stdout=subprocess.PIPE,
universal_newlines=True)
# time.sleep(1)
s.communicate()[0]

for i in xrange(1000):
X(target = tt)

This typically gives a few dozen errors like these:
Exception in thread Thread-795:
Traceback (most recent call last):
File "/usr/lib/python2.4/threading.py", line 442, in __bootstrap
self.run()
File "/usr/lib/python2.4/threading.py", line 422, in run
self.__target(*self.__args, **self.__kwargs)
File "z.py", line 21, in tt
s.communicate()[0]
File "/usr/lib/python2.4/subprocess.py", line 1083, in communicate
self.wait()
File "/usr/lib/python2.4/subprocess.py", line 1007, in wait
pid, sts = os.waitpid(self.pid, 0)
OSError: [Errno 10] No child processes

Note that uncommenting time.sleep(1) fixes the problem. So does wrapping subprocess.poll() and wait() with a lock. So does removing the call to global _cleanup() in Popen constructor.

----------------------------------------------------------------------

Comment By: Donovan Baarda (abo)
Date: 2007-08-04 07:29

Message:
Logged In: YES
user_id=10273
Originator: NO

I just looked at subprocess in svs trunk and it looks like it might
already be fixed in both subprocess.py and popen2.py. The part where
__del__() adds abandoned Popen instances to _active and _cleanup() removes
them is already there. _cleanup() has been made more thread resistant by
catching and ignoring exceptions that arise when two _cleanups() try to
remove the same instances. Popen.poll() has been made more robust by having
it set self.returncode to an optional _deadstate argument value when poll
gets an os.error from os.pidwait() on a child that gets cleaned up by
another thread. I think there are still a few minor race conditions around
Popen.poll(), but it will only affect what non-zero returncode you get...
it's not so much "thread safe" as "thread robust".

I think the _deadstate argument approach to try and make Popen.poll()
thread-robust is a bit hacky. I would rather see _cleanup() be properly
threadsafe by having it remove the inst from _active before processing it
and re-adding it back if it is still not finished. However, as it is now it
looks like it is fixed...

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2007-08-04 06:11

Message:
Logged In: YES
user_id=6380
Originator: NO

If you want this fixed in 2.5.x, a new release is just around the corner,
and time is very tight. Otherwise, 2.6a1 is half a year off.

----------------------------------------------------------------------

Comment By: Donovan Baarda (abo)
Date: 2007-08-04 05:26

Message:
Logged In: YES
user_id=10273
Originator: NO

Actually, after thinking about it, there may be a way to make _cleanup()
threadsafe without using explicit locks by leveraging off the GIL. If
_active.pop() and _active.append() are atomic because of the GIL, then
_cleanup() can be made threadsafe. To be truely threadsafe it also needs to
make sure that _active contains only abandoned popen instances so that
_cleanup() doesn't have it's inst.poll() calls collide with other threads
that are still using those popen instances. This can be done by havin the
popen instances added to _active only by Popen.__del__(), and only removed
by _cleanup() when they are done.

----------------------------------------------------------------------

Comment By: Donovan Baarda (abo)
Date: 2007-08-04 05:14

Message:
Logged In: YES
user_id=10273
Originator: NO

I just looked at popen2 and it has the same bug. I don't think __del__()
related changes have anything to do with this. Popen.poll() and _cleanup()
are non-threadsafe. When __init__() is called to create subprocesses
simultaniously in two different threads, they both call _cleanup() and
trample all over each other.

Removing _cleanup() will not leave zombies for popened processes that are
correctly handled to completion. Using methods like communicate() and
wait() will handle the process to completion and reap the child.

Unfortunately, I think a fairly common use-case is people using popen to
fire-and-forget subprocesses without using communicate() or wait(). These
will not get reaped, and will end up as zombies.

I cannot think of an easy way to reap abandoned popen instances that is
thread safe without using locks. At times like this I wish that the GIL was
exposed as a recursive lock... we could have __cleanup__() acquire/release
the GIL and make it atomic... no more race condition :-)

----------------------------------------------------------------------

Comment By: Peter Ã…strand (astrand)
Date: 2007-08-02 19:15

Message:
Logged In: YES
user_id=344921
Originator: NO

Suddenly starting to leave Zombies is NOT an option for me. To prevent
zombies, all applications using the subprocess module would need to be
changed. This is a major breakage of backwards compatibility, IMHO.
subprocess (as well as the popen2 module) has prevented zombies
automatically from the beginning and I believe they should continue to do
so (at least by default).

A little bit of history: When I wrote the subprocess module, I stole the
idea of calling _cleanup whenever a new instance is created from the popen2
module, since I thought it was nice, lightweight and avoided having a
__del__ method (which I have some bad experience with, GC-related). This
worked great for a long time. At 2006-04-10, however, martin.v.loewis
committed patch 1467770 (revision r45234), to solve bug 1460493. It
introduces a __del__ method and some more complexity. I must admit that I
didn't review this patch in detail, but it looked like thread safeness was
being taken care of. But apparently it isn't.

Perhaps reverting to the original popen2 approach would solve the problem?
Or does the popen2 module suffer from this bug as well?

I think that we need to fix this once and for all, this time. We should
probably also consider adding the option of not having subprocess auto-wait
while we are at it, for those what would like to wait() themselves (there
should be a tracker item for this, but I can't find it right now).

Are we tight on time for fixing this bug? I'm out in the forest right
now...

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2007-08-02 10:45

Message:
Logged In: YES
user_id=6380
Originator: NO

I like #2. I don't see any use for threadsafe Popen instances, and I
think that any self-respecting long-running server using Popen should
properly manage its subprocesses anyway. (And for short-running processes
it doesn't really matter if you have a few zombies.)

One could add a __del__ method that registers zombies to be reaped later,
but I don't think it's worth it, and __del__ has some serious issues of its
own. (If you really want to do this, use a weak reference callback instead
of __del__ to do the zombie registration.)

----------------------------------------------------------------------

Comment By: Donovan Baarda (abo)
Date: 2007-08-02 03:37

Message:
Logged In: YES
user_id=10273
Originator: NO

Having just gone through that waffly description of the problems and
various levels of fix, I think there are really only two fixes worth
considering;

1) Make Popen instances fully threadsafe. Give them a recursive lock
attribute and have every method acquire the lock at the start, and release
it at the end.

2) Decide the "try to reap abandoned children at each Popen" idea was an
ugly hack and abandon it. Remove _active and _cleanup(), and document that
any child process not explicitly handled to completion will result in
zombie child processes.

----------------------------------------------------------------------

Comment By: Donovan Baarda (abo)
Date: 2007-08-02 03:05

Message:
Logged In: YES
user_id=10273
Originator: NO

It appears that subprocess calls a module global "_cleanup()" whenever
opening a new subprocess. This method is meant to reap any child processes
that have terminated and have not explicitly cleaned up. These are
processes you would expect to be cleaned up by GC, however, subprocess
keeps a list of of all spawned subprocesses in _active until they are
reaped explicitly so it can cleanup any that nolonger referenced anywhere
else.

The problem is lots of methods, including poll() and wait(), check
self.returncode and then modify it. Any non-atomic read/modify action is
inherently non-threadsafe. And _cleanup() calls poll() on all un-reaped
child processes. If two threads happen to try and spawn subprocesses at
once, these _cleanup() calls collide..

The way to fix this depends on how thread-safe you want to make it. If you
want to share popen objects between threads to wait()/poll() with impunity
from any thread, you should add a recursive lock attribute to the Popen
instance and have it lock/release it at the start/end of every method
call.

If you only care about using popen objects in one thread at a time, then
all you need to fix is the nasty "every popen created calls poll() on every
other living popen object regardless of what thread started them,
and poll() is not threadsafe" behaviour.

Removing _cleanup() is one way, but it will then not reap child processes
that you del'ed all references to (except the one in subprocess._active)
before you checked they were done.

Probably another good idea is to not append and remove popen objects to
_active directly, instead and add a popen.__del__() method that defers
GC'ing of non-finished popen objects by adding them to _active. This
way, _active only contains un-reaped child processes that were due to be
GC'ed. _cleanup() will then be responsible for polling and removing these
popen objects from _active when they are done.

However, this alone will not fix things because you are still calling
_cleanup() from different threads, it is still calling poll() on all these
un-reaped processes, and poll() is not threadsafe. So... you either have to
make poll() threadsafe (lock/unlock at the beginning/end of the method), or
make _cleanup() threadsafe. The reason you can get away with making only
_cleanup() threadsafe this way is _active will contain a list of processes
that are not referenced anywhere else, so you know the only thing that will
call poll() on them is the _cleanup() method.

----------------------------------------------------------------------

Comment By: Neal Norwitz (nnorwitz)
Date: 2007-06-07 15:26

Message:
Logged In: YES
user_id=33168
Originator: NO

Peter, could you take a look at this? Thanks.

----------------------------------------------------------------------

You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1731717&group_id=5470
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

[ python-Bugs-1731717 ] race condition in subprocess module [ In reply to ]

noreply at sourceforge

Aug 6, 2007, 12:45 PM

Post #9 of 10 (391 views)

Permalink

Bugs item #1731717, was opened at 2007-06-06 08:19
Message generated for change (Comment added) made by abo
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1731717&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: Python 2.4
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: dsagal (dsagal)
Assigned to: Peter Ã…strand (astrand)
Summary: race condition in subprocess module

Initial Comment:
Python's subprocess module has a race condition: Popen() constructor has a call to global "_cleanup()" function on whenever a Popen object gets created, and that call causes a check for all pending Popen objects whether their subprocess has exited - i.e. the poll() method is called for every active Popen object.

Now, if I create Popen object "foo" in one thread, and then a.wait(), and meanwhile I create another Popen object "bar" in another thread, then a.poll() might get called by _clean() right at the time when my first thread is running a.wait(). But those are not synchronized - each calls os.waitpid() if returncode is None, but the section which checks returncode and calls os.waitpid() is not protected as a critical section should be.

The following code illustrates the problem, and is known to break reasonably consistenly with Python2.4. Changes to subprocess in Python2.5 seems to address a somewhat related problem, but not this one.

import sys, os, threading, subprocess, time

class X(threading.Thread):
def __init__(self, *args, **kwargs):
super(X, self).__init__(*args, **kwargs)
self.start()

def tt():
s = subprocess.Popen(("/bin/ls", "-a", "/tmp"), stdout=subprocess.PIPE,
universal_newlines=True)
# time.sleep(1)
s.communicate()[0]

for i in xrange(1000):
X(target = tt)

This typically gives a few dozen errors like these:
Exception in thread Thread-795:
Traceback (most recent call last):
File "/usr/lib/python2.4/threading.py", line 442, in __bootstrap
self.run()
File "/usr/lib/python2.4/threading.py", line 422, in run
self.__target(*self.__args, **self.__kwargs)
File "z.py", line 21, in tt
s.communicate()[0]
File "/usr/lib/python2.4/subprocess.py", line 1083, in communicate
self.wait()
File "/usr/lib/python2.4/subprocess.py", line 1007, in wait
pid, sts = os.waitpid(self.pid, 0)
OSError: [Errno 10] No child processes

Note that uncommenting time.sleep(1) fixes the problem. So does wrapping subprocess.poll() and wait() with a lock. So does removing the call to global _cleanup() in Popen constructor.

----------------------------------------------------------------------

Comment By: Donovan Baarda (abo)
Date: 2007-08-07 05:45

Message:
Logged In: YES
user_id=10273
Originator: NO

I can create a patch to make current head a bit cleaner, if anyone is
interested...

----------------------------------------------------------------------

Comment By: Donovan Baarda (abo)
Date: 2007-08-04 07:29

Message:
Logged In: YES
user_id=10273
Originator: NO

I just looked at subprocess in svs trunk and it looks like it might
already be fixed in both subprocess.py and popen2.py. The part where
__del__() adds abandoned Popen instances to _active and _cleanup() removes
them is already there. _cleanup() has been made more thread resistant by
catching and ignoring exceptions that arise when two _cleanups() try to
remove the same instances. Popen.poll() has been made more robust by having
it set self.returncode to an optional _deadstate argument value when poll
gets an os.error from os.pidwait() on a child that gets cleaned up by
another thread. I think there are still a few minor race conditions around
Popen.poll(), but it will only affect what non-zero returncode you get...
it's not so much "thread safe" as "thread robust".

I think the _deadstate argument approach to try and make Popen.poll()
thread-robust is a bit hacky. I would rather see _cleanup() be properly
threadsafe by having it remove the inst from _active before processing it
and re-adding it back if it is still not finished. However, as it is now it
looks like it is fixed...

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2007-08-04 06:11

Message:
Logged In: YES
user_id=6380
Originator: NO

If you want this fixed in 2.5.x, a new release is just around the corner,
and time is very tight. Otherwise, 2.6a1 is half a year off.

----------------------------------------------------------------------

Comment By: Donovan Baarda (abo)
Date: 2007-08-04 05:26

Message:
Logged In: YES
user_id=10273
Originator: NO

Actually, after thinking about it, there may be a way to make _cleanup()
threadsafe without using explicit locks by leveraging off the GIL. If
_active.pop() and _active.append() are atomic because of the GIL, then
_cleanup() can be made threadsafe. To be truely threadsafe it also needs to
make sure that _active contains only abandoned popen instances so that
_cleanup() doesn't have it's inst.poll() calls collide with other threads
that are still using those popen instances. This can be done by havin the
popen instances added to _active only by Popen.__del__(), and only removed
by _cleanup() when they are done.

----------------------------------------------------------------------

Comment By: Donovan Baarda (abo)
Date: 2007-08-04 05:14

Message:
Logged In: YES
user_id=10273
Originator: NO

I just looked at popen2 and it has the same bug. I don't think __del__()
related changes have anything to do with this. Popen.poll() and _cleanup()
are non-threadsafe. When __init__() is called to create subprocesses
simultaniously in two different threads, they both call _cleanup() and
trample all over each other.

Removing _cleanup() will not leave zombies for popened processes that are
correctly handled to completion. Using methods like communicate() and
wait() will handle the process to completion and reap the child.

Unfortunately, I think a fairly common use-case is people using popen to
fire-and-forget subprocesses without using communicate() or wait(). These
will not get reaped, and will end up as zombies.

I cannot think of an easy way to reap abandoned popen instances that is
thread safe without using locks. At times like this I wish that the GIL was
exposed as a recursive lock... we could have __cleanup__() acquire/release
the GIL and make it atomic... no more race condition :-)

----------------------------------------------------------------------

Comment By: Peter Ã…strand (astrand)
Date: 2007-08-02 19:15

Message:
Logged In: YES
user_id=344921
Originator: NO

Suddenly starting to leave Zombies is NOT an option for me. To prevent
zombies, all applications using the subprocess module would need to be
changed. This is a major breakage of backwards compatibility, IMHO.
subprocess (as well as the popen2 module) has prevented zombies
automatically from the beginning and I believe they should continue to do
so (at least by default).

A little bit of history: When I wrote the subprocess module, I stole the
idea of calling _cleanup whenever a new instance is created from the popen2
module, since I thought it was nice, lightweight and avoided having a
__del__ method (which I have some bad experience with, GC-related). This
worked great for a long time. At 2006-04-10, however, martin.v.loewis
committed patch 1467770 (revision r45234), to solve bug 1460493. It
introduces a __del__ method and some more complexity. I must admit that I
didn't review this patch in detail, but it looked like thread safeness was
being taken care of. But apparently it isn't.

Perhaps reverting to the original popen2 approach would solve the problem?
Or does the popen2 module suffer from this bug as well?

I think that we need to fix this once and for all, this time. We should
probably also consider adding the option of not having subprocess auto-wait
while we are at it, for those what would like to wait() themselves (there
should be a tracker item for this, but I can't find it right now).

Are we tight on time for fixing this bug? I'm out in the forest right
now...

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2007-08-02 10:45

Message:
Logged In: YES
user_id=6380
Originator: NO

I like #2. I don't see any use for threadsafe Popen instances, and I
think that any self-respecting long-running server using Popen should
properly manage its subprocesses anyway. (And for short-running processes
it doesn't really matter if you have a few zombies.)

One could add a __del__ method that registers zombies to be reaped later,
but I don't think it's worth it, and __del__ has some serious issues of its
own. (If you really want to do this, use a weak reference callback instead
of __del__ to do the zombie registration.)

----------------------------------------------------------------------

Comment By: Donovan Baarda (abo)
Date: 2007-08-02 03:37

Message:
Logged In: YES
user_id=10273
Originator: NO

Having just gone through that waffly description of the problems and
various levels of fix, I think there are really only two fixes worth
considering;

1) Make Popen instances fully threadsafe. Give them a recursive lock
attribute and have every method acquire the lock at the start, and release
it at the end.

2) Decide the "try to reap abandoned children at each Popen" idea was an
ugly hack and abandon it. Remove _active and _cleanup(), and document that
any child process not explicitly handled to completion will result in
zombie child processes.

----------------------------------------------------------------------

Comment By: Donovan Baarda (abo)
Date: 2007-08-02 03:05

Message:
Logged In: YES
user_id=10273
Originator: NO

It appears that subprocess calls a module global "_cleanup()" whenever
opening a new subprocess. This method is meant to reap any child processes
that have terminated and have not explicitly cleaned up. These are
processes you would expect to be cleaned up by GC, however, subprocess
keeps a list of of all spawned subprocesses in _active until they are
reaped explicitly so it can cleanup any that nolonger referenced anywhere
else.

The problem is lots of methods, including poll() and wait(), check
self.returncode and then modify it. Any non-atomic read/modify action is
inherently non-threadsafe. And _cleanup() calls poll() on all un-reaped
child processes. If two threads happen to try and spawn subprocesses at
once, these _cleanup() calls collide..

The way to fix this depends on how thread-safe you want to make it. If you
want to share popen objects between threads to wait()/poll() with impunity
from any thread, you should add a recursive lock attribute to the Popen
instance and have it lock/release it at the start/end of every method
call.

If you only care about using popen objects in one thread at a time, then
all you need to fix is the nasty "every popen created calls poll() on every
other living popen object regardless of what thread started them,
and poll() is not threadsafe" behaviour.

Removing _cleanup() is one way, but it will then not reap child processes
that you del'ed all references to (except the one in subprocess._active)
before you checked they were done.

Probably another good idea is to not append and remove popen objects to
_active directly, instead and add a popen.__del__() method that defers
GC'ing of non-finished popen objects by adding them to _active. This
way, _active only contains un-reaped child processes that were due to be
GC'ed. _cleanup() will then be responsible for polling and removing these
popen objects from _active when they are done.

However, this alone will not fix things because you are still calling
_cleanup() from different threads, it is still calling poll() on all these
un-reaped processes, and poll() is not threadsafe. So... you either have to
make poll() threadsafe (lock/unlock at the beginning/end of the method), or
make _cleanup() threadsafe. The reason you can get away with making only
_cleanup() threadsafe this way is _active will contain a list of processes
that are not referenced anywhere else, so you know the only thing that will
call poll() on them is the _cleanup() method.

----------------------------------------------------------------------

Comment By: Neal Norwitz (nnorwitz)
Date: 2007-06-07 15:26

Message:
Logged In: YES
user_id=33168
Originator: NO

Peter, could you take a look at this? Thanks.

----------------------------------------------------------------------

You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1731717&group_id=5470
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com

[ python-Bugs-1731717 ] race condition in subprocess module [ In reply to ]

noreply at sourceforge

Aug 6, 2007, 1:02 PM

Post #10 of 10 (383 views)

Permalink

Bugs item #1731717, was opened at 2007-06-06 00:19
Message generated for change (Comment added) made by astrand
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1731717&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: Python 2.4
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: dsagal (dsagal)
Assigned to: Peter Ã…strand (astrand)
Summary: race condition in subprocess module

Initial Comment:
Python's subprocess module has a race condition: Popen() constructor has a call to global "_cleanup()" function on whenever a Popen object gets created, and that call causes a check for all pending Popen objects whether their subprocess has exited - i.e. the poll() method is called for every active Popen object.

Now, if I create Popen object "foo" in one thread, and then a.wait(), and meanwhile I create another Popen object "bar" in another thread, then a.poll() might get called by _clean() right at the time when my first thread is running a.wait(). But those are not synchronized - each calls os.waitpid() if returncode is None, but the section which checks returncode and calls os.waitpid() is not protected as a critical section should be.

The following code illustrates the problem, and is known to break reasonably consistenly with Python2.4. Changes to subprocess in Python2.5 seems to address a somewhat related problem, but not this one.

import sys, os, threading, subprocess, time

class X(threading.Thread):
def __init__(self, *args, **kwargs):
super(X, self).__init__(*args, **kwargs)
self.start()

def tt():
s = subprocess.Popen(("/bin/ls", "-a", "/tmp"), stdout=subprocess.PIPE,
universal_newlines=True)
# time.sleep(1)
s.communicate()[0]

for i in xrange(1000):
X(target = tt)

This typically gives a few dozen errors like these:
Exception in thread Thread-795:
Traceback (most recent call last):
File "/usr/lib/python2.4/threading.py", line 442, in __bootstrap
self.run()
File "/usr/lib/python2.4/threading.py", line 422, in run
self.__target(*self.__args, **self.__kwargs)
File "z.py", line 21, in tt
s.communicate()[0]
File "/usr/lib/python2.4/subprocess.py", line 1083, in communicate
self.wait()
File "/usr/lib/python2.4/subprocess.py", line 1007, in wait
pid, sts = os.waitpid(self.pid, 0)
OSError: [Errno 10] No child processes

Note that uncommenting time.sleep(1) fixes the problem. So does wrapping subprocess.poll() and wait() with a lock. So does removing the call to global _cleanup() in Popen constructor.

----------------------------------------------------------------------

>Comment By: Peter Ã…strand (astrand)
Date: 2007-08-06 22:02

Message:
Logged In: YES
user_id=344921
Originator: NO

>I can create a patch to make current head a bit cleaner, if anyone is
>interested...

Sure, this would be great.

I would also love a test case that reproduces this problem.

----------------------------------------------------------------------

Comment By: Donovan Baarda (abo)
Date: 2007-08-06 21:45

Message:
Logged In: YES
user_id=10273
Originator: NO

I can create a patch to make current head a bit cleaner, if anyone is
interested...

----------------------------------------------------------------------

Comment By: Donovan Baarda (abo)
Date: 2007-08-03 23:29

Message:
Logged In: YES
user_id=10273
Originator: NO

I just looked at subprocess in svs trunk and it looks like it might
already be fixed in both subprocess.py and popen2.py. The part where
__del__() adds abandoned Popen instances to _active and _cleanup() removes
them is already there. _cleanup() has been made more thread resistant by
catching and ignoring exceptions that arise when two _cleanups() try to
remove the same instances. Popen.poll() has been made more robust by having
it set self.returncode to an optional _deadstate argument value when poll
gets an os.error from os.pidwait() on a child that gets cleaned up by
another thread. I think there are still a few minor race conditions around
Popen.poll(), but it will only affect what non-zero returncode you get...
it's not so much "thread safe" as "thread robust".

I think the _deadstate argument approach to try and make Popen.poll()
thread-robust is a bit hacky. I would rather see _cleanup() be properly
threadsafe by having it remove the inst from _active before processing it
and re-adding it back if it is still not finished. However, as it is now it
looks like it is fixed...

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2007-08-03 22:11

Message:
Logged In: YES
user_id=6380
Originator: NO

If you want this fixed in 2.5.x, a new release is just around the corner,
and time is very tight. Otherwise, 2.6a1 is half a year off.

----------------------------------------------------------------------

Comment By: Donovan Baarda (abo)
Date: 2007-08-03 21:26

Message:
Logged In: YES
user_id=10273
Originator: NO

Actually, after thinking about it, there may be a way to make _cleanup()
threadsafe without using explicit locks by leveraging off the GIL. If
_active.pop() and _active.append() are atomic because of the GIL, then
_cleanup() can be made threadsafe. To be truely threadsafe it also needs to
make sure that _active contains only abandoned popen instances so that
_cleanup() doesn't have it's inst.poll() calls collide with other threads
that are still using those popen instances. This can be done by havin the
popen instances added to _active only by Popen.__del__(), and only removed
by _cleanup() when they are done.

----------------------------------------------------------------------

Comment By: Donovan Baarda (abo)
Date: 2007-08-03 21:14

Message:
Logged In: YES
user_id=10273
Originator: NO

I just looked at popen2 and it has the same bug. I don't think __del__()
related changes have anything to do with this. Popen.poll() and _cleanup()
are non-threadsafe. When __init__() is called to create subprocesses
simultaniously in two different threads, they both call _cleanup() and
trample all over each other.

Removing _cleanup() will not leave zombies for popened processes that are
correctly handled to completion. Using methods like communicate() and
wait() will handle the process to completion and reap the child.

Unfortunately, I think a fairly common use-case is people using popen to
fire-and-forget subprocesses without using communicate() or wait(). These
will not get reaped, and will end up as zombies.

I cannot think of an easy way to reap abandoned popen instances that is
thread safe without using locks. At times like this I wish that the GIL was
exposed as a recursive lock... we could have __cleanup__() acquire/release
the GIL and make it atomic... no more race condition :-)

----------------------------------------------------------------------

Comment By: Peter Ã…strand (astrand)
Date: 2007-08-02 11:15

Message:
Logged In: YES
user_id=344921
Originator: NO

Suddenly starting to leave Zombies is NOT an option for me. To prevent
zombies, all applications using the subprocess module would need to be
changed. This is a major breakage of backwards compatibility, IMHO.
subprocess (as well as the popen2 module) has prevented zombies
automatically from the beginning and I believe they should continue to do
so (at least by default).

A little bit of history: When I wrote the subprocess module, I stole the
idea of calling _cleanup whenever a new instance is created from the popen2
module, since I thought it was nice, lightweight and avoided having a
__del__ method (which I have some bad experience with, GC-related). This
worked great for a long time. At 2006-04-10, however, martin.v.loewis
committed patch 1467770 (revision r45234), to solve bug 1460493. It
introduces a __del__ method and some more complexity. I must admit that I
didn't review this patch in detail, but it looked like thread safeness was
being taken care of. But apparently it isn't.

Perhaps reverting to the original popen2 approach would solve the problem?
Or does the popen2 module suffer from this bug as well?

I think that we need to fix this once and for all, this time. We should
probably also consider adding the option of not having subprocess auto-wait
while we are at it, for those what would like to wait() themselves (there
should be a tracker item for this, but I can't find it right now).

Are we tight on time for fixing this bug? I'm out in the forest right
now...

----------------------------------------------------------------------

Comment By: Guido van Rossum (gvanrossum)
Date: 2007-08-02 02:45

Message:
Logged In: YES
user_id=6380
Originator: NO

I like #2. I don't see any use for threadsafe Popen instances, and I
think that any self-respecting long-running server using Popen should
properly manage its subprocesses anyway. (And for short-running processes
it doesn't really matter if you have a few zombies.)

One could add a __del__ method that registers zombies to be reaped later,
but I don't think it's worth it, and __del__ has some serious issues of its
own. (If you really want to do this, use a weak reference callback instead
of __del__ to do the zombie registration.)

----------------------------------------------------------------------

Comment By: Donovan Baarda (abo)
Date: 2007-08-01 19:37

Message:
Logged In: YES
user_id=10273
Originator: NO

Having just gone through that waffly description of the problems and
various levels of fix, I think there are really only two fixes worth
considering;

1) Make Popen instances fully threadsafe. Give them a recursive lock
attribute and have every method acquire the lock at the start, and release
it at the end.

2) Decide the "try to reap abandoned children at each Popen" idea was an
ugly hack and abandon it. Remove _active and _cleanup(), and document that
any child process not explicitly handled to completion will result in
zombie child processes.

----------------------------------------------------------------------

Comment By: Donovan Baarda (abo)
Date: 2007-08-01 19:05

Message:
Logged In: YES
user_id=10273
Originator: NO

It appears that subprocess calls a module global "_cleanup()" whenever
opening a new subprocess. This method is meant to reap any child processes
that have terminated and have not explicitly cleaned up. These are
processes you would expect to be cleaned up by GC, however, subprocess
keeps a list of of all spawned subprocesses in _active until they are
reaped explicitly so it can cleanup any that nolonger referenced anywhere
else.

The problem is lots of methods, including poll() and wait(), check
self.returncode and then modify it. Any non-atomic read/modify action is
inherently non-threadsafe. And _cleanup() calls poll() on all un-reaped
child processes. If two threads happen to try and spawn subprocesses at
once, these _cleanup() calls collide..

The way to fix this depends on how thread-safe you want to make it. If you
want to share popen objects between threads to wait()/poll() with impunity
from any thread, you should add a recursive lock attribute to the Popen
instance and have it lock/release it at the start/end of every method
call.

If you only care about using popen objects in one thread at a time, then
all you need to fix is the nasty "every popen created calls poll() on every
other living popen object regardless of what thread started them,
and poll() is not threadsafe" behaviour.

Removing _cleanup() is one way, but it will then not reap child processes
that you del'ed all references to (except the one in subprocess._active)
before you checked they were done.

Probably another good idea is to not append and remove popen objects to
_active directly, instead and add a popen.__del__() method that defers
GC'ing of non-finished popen objects by adding them to _active. This
way, _active only contains un-reaped child processes that were due to be
GC'ed. _cleanup() will then be responsible for polling and removing these
popen objects from _active when they are done.

However, this alone will not fix things because you are still calling
_cleanup() from different threads, it is still calling poll() on all these
un-reaped processes, and poll() is not threadsafe. So... you either have to
make poll() threadsafe (lock/unlock at the beginning/end of the method), or
make _cleanup() threadsafe. The reason you can get away with making only
_cleanup() threadsafe this way is _active will contain a list of processes
that are not referenced anywhere else, so you know the only thing that will
call poll() on them is the _cleanup() method.

----------------------------------------------------------------------

Comment By: Neal Norwitz (nnorwitz)
Date: 2007-06-07 07:26

Message:
Logged In: YES
user_id=33168
Originator: NO

Peter, could you take a look at this? Thanks.

----------------------------------------------------------------------

You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1731717&group_id=5470
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com