Mailing List Archive

[PATCH rc1-mm] de_thread: fix deadlockable process addition
On top of
task-make-task-list-manipulations-rcu-safe.patch

This patch
pidhash-kill-switch_exec_pids.patch

changed de_thread() so that it doesn't remove 'leader' from it's thread group.
However de_thread() still adds current to init_task.tasks without removing
'leader' from this list. What if another CLONE_VM task starts do_coredump()
after de_thread() drops tasklist_lock but before it calls release_task(leader) ?

do_coredump()->zap_threads() will find this thread group twice on init_task.tasks
list. And it will increment mm->core_waiters twice for the new leader (current in
de_thread). So, exit_mm()->complete(mm->core_startup_done) doesn't happen, and we
have a deadlock.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>

--- MM/fs/exec.c~0_DET 2006-04-06 22:37:33.000000000 +0400
+++ MM/fs/exec.c 2006-04-06 22:51:51.000000000 +0400
@@ -713,7 +713,7 @@ static int de_thread(struct task_struct
attach_pid(current, PIDTYPE_PID, current->pid);
attach_pid(current, PIDTYPE_PGID, current->signal->pgrp);
attach_pid(current, PIDTYPE_SID, current->signal->session);
- list_add_tail_rcu(&current->tasks, &init_task.tasks);
+ list_replace_rcu(&leader->tasks, &current->tasks);

current->parent = current->real_parent = leader->real_parent;
leader->parent = leader->real_parent = child_reaper;
--- MM/kernel/exit.c~0_DET 2006-03-23 23:02:53.000000000 +0300
+++ MM/kernel/exit.c 2006-04-06 23:01:37.000000000 +0400
@@ -55,7 +55,9 @@ static void __unhash_process(struct task
detach_pid(p, PIDTYPE_PGID);
detach_pid(p, PIDTYPE_SID);

- list_del_rcu(&p->tasks);
+ /* see de_thread()->list_replace_rcu() */
+ if (likely(p->tasks.prev != LIST_POISON2))
+ list_del_rcu(&p->tasks);
__get_cpu_var(process_counts)--;
}
list_del_rcu(&p->thread_group);

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH rc1-mm] de_thread: fix deadlockable process addition [ In reply to ]
Oleg Nesterov <oleg@tv-sign.ru> writes:

> On top of
> task-make-task-list-manipulations-rcu-safe.patch
>
> This patch
> pidhash-kill-switch_exec_pids.patch
>
> changed de_thread() so that it doesn't remove 'leader' from it's thread group.
> However de_thread() still adds current to init_task.tasks without removing
> 'leader' from this list. What if another CLONE_VM task starts do_coredump()
> after de_thread() drops tasklist_lock but before it calls release_task(leader) ?
>
> do_coredump()->zap_threads() will find this thread group twice on
> init_task.tasks
> list. And it will increment mm->core_waiters twice for the new leader (current
> in
> de_thread). So, exit_mm()->complete(mm->core_startup_done) doesn't happen, and
> we
> have a deadlock.

Ack. The evils of de_thread!

We need this to keep from seeing the same task twice in
do_each_thread.

This bug is in 2.6.17-rc1 so this code needs to go to Linus
sometime soon.

Eric


> Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
>
> --- MM/fs/exec.c~0_DET 2006-04-06 22:37:33.000000000 +0400
> +++ MM/fs/exec.c 2006-04-06 22:51:51.000000000 +0400
> @@ -713,7 +713,7 @@ static int de_thread(struct task_struct
> attach_pid(current, PIDTYPE_PID, current->pid);
> attach_pid(current, PIDTYPE_PGID, current->signal->pgrp);
> attach_pid(current, PIDTYPE_SID, current->signal->session);
> - list_add_tail_rcu(&current->tasks, &init_task.tasks);
> + list_replace_rcu(&leader->tasks, &current->tasks);
>
> current->parent = current->real_parent = leader->real_parent;
> leader->parent = leader->real_parent = child_reaper;
> --- MM/kernel/exit.c~0_DET 2006-03-23 23:02:53.000000000 +0300
> +++ MM/kernel/exit.c 2006-04-06 23:01:37.000000000 +0400
> @@ -55,7 +55,9 @@ static void __unhash_process(struct task
> detach_pid(p, PIDTYPE_PGID);
> detach_pid(p, PIDTYPE_SID);
>
> - list_del_rcu(&p->tasks);
> + /* see de_thread()->list_replace_rcu() */
> + if (likely(p->tasks.prev != LIST_POISON2))
> + list_del_rcu(&p->tasks);
> __get_cpu_var(process_counts)--;
> }
> list_del_rcu(&p->thread_group);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH rc1-mm] de_thread: fix deadlockable process addition [ In reply to ]
On 04/06, Eric W. Biederman wrote:
>
> Ack. The evils of de_thread!

Yes, just noticed another thing,

[PATCH] task-make-task-list-manipulations-rcu-safe-fix-fix

We shoudn't decrement process_counts if de_thread() unhashed the task.

Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>

--- MM/kernel/exit.c~ 2006-04-06 23:01:37.000000000 +0400
+++ MM/kernel/exit.c 2006-04-08 03:35:24.000000000 +0400
@@ -56,9 +56,10 @@ static void __unhash_process(struct task
detach_pid(p, PIDTYPE_SID);

/* see de_thread()->list_replace_rcu() */
- if (likely(p->tasks.prev != LIST_POISON2))
+ if (likely(p->tasks.prev != LIST_POISON2)) {
list_del_rcu(&p->tasks);
- __get_cpu_var(process_counts)--;
+ __get_cpu_var(process_counts)--;
+ }
}
list_del_rcu(&p->thread_group);
remove_parent(p);

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH rc1-mm] de_thread: fix deadlockable process addition [ In reply to ]
Oleg Nesterov <oleg@tv-sign.ru> wrote:
>
> - if (likely(p->tasks.prev != LIST_POISON2))
> + if (likely(p->tasks.prev != LIST_POISON2)) {

argh.

c'mon guys, we can't put a dependency on list_head poisoning into generic
code.

We have one in detach_timer() but that's just debugging.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH rc1-mm] de_thread: fix deadlockable process addition [ In reply to ]
Andrew Morton <akpm@osdl.org> wrote:
>
> Oleg Nesterov <oleg@tv-sign.ru> wrote:
> >
> > - if (likely(p->tasks.prev != LIST_POISON2))
> > + if (likely(p->tasks.prev != LIST_POISON2)) {
>
> argh.
>
> c'mon guys, we can't put a dependency on list_head poisoning into generic
> code.
>

A suitable fix might be to add a new list_del_poison() (or
list_del_rcu_something()?) and use that everywhere.

But it should use a different poisoning pattern, so we know that the kernel
will still work correctly when someone removes the list_head debugging.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH rc1-mm] de_thread: fix deadlockable process addition [ In reply to ]
Andrew Morton <akpm@osdl.org> writes:

> Andrew Morton <akpm@osdl.org> wrote:
>>
>> Oleg Nesterov <oleg@tv-sign.ru> wrote:
>> >
>> > - if (likely(p->tasks.prev != LIST_POISON2))
>> > + if (likely(p->tasks.prev != LIST_POISON2)) {
>>
>> argh.
>>
>> c'mon guys, we can't put a dependency on list_head poisoning into generic
>> code.
>>
>
> A suitable fix might be to add a new list_del_poison() (or
> list_del_rcu_something()?) and use that everywhere.
>
> But it should use a different poisoning pattern, so we know that the kernel
> will still work correctly when someone removes the list_head debugging.

Agreed. That is ugly. I would recommend some new functions
list functions but in thinking about this I believe I see
how we can avoid this case completely.

The first step is to optimize thread_group_leader to be
defined in terms of tasks and not process ids. Which
is one less pointer dereference.

The second step is to modify de_thread to reduce the
old thread group leader to a thread. This requires changing the
leaders parents, changing the leaders thread_group leader, unhashing
the leader from the process group and session, and removing
the leader from the task list.

With those two changes exit.c should not need to account for
the de_thread case.

Oleg please take a hard look and see if you can find anything
that will break with the patch below.

I believe that is all that is needed to cleanly keep do_each_thread
from seeing a single thread multiple times.

Eric

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 541f482..2964a2c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1203,7 +1203,7 @@ extern void wait_task_inactive(task_t *
#define while_each_thread(g, t) \
while ((t = next_thread(t)) != g)

-#define thread_group_leader(p) (p->pid == p->tgid)
+#define thread_group_leader(p) (p == p->group_leader)

static inline task_t *next_thread(task_t *p)
{


diff --git a/fs/exec.c b/fs/exec.c
index 0291a68..9b0f9c4 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -721,9 +721,14 @@ static int de_thread(struct task_struct
list_add_tail(&current->tasks, &init_task.tasks);

current->parent = current->real_parent = leader->real_parent;
- leader->parent = leader->real_parent = child_reaper;
+ leader->parent = leader->real_parent = current;
current->group_leader = current;
- leader->group_leader = leader;
+ leader->group_leader = current;
+
+ /* Reduce leader to a thread */
+ detach_pid(leader, PIDTYPE_PGID, current->signal->pgrp);
+ detach_pid(leader, PIDTYPE_SID current->signal->session);
+ list_del_init(&leader->tasks);

add_parent(current);
add_parent(leader);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH rc1-mm] de_thread: fix deadlockable process addition [ In reply to ]
On 04/08, Eric W. Biederman wrote:
>
> Agreed. That is ugly.

Yes, I agree also.

>
> -#define thread_group_leader(p) (p->pid == p->tgid)
> +#define thread_group_leader(p) (p == p->group_leader)
>
> ...
>
> - leader->group_leader = leader;
> + leader->group_leader = current;

I thought about similar change too, but I am unsure about
release_task(old_leader)->proc_flush_task() path (because
I don't understand this code).

This change can confuse next_tid(), but this is minor.
I don't see other problems.

However, I think we can do something better instead of
attach_pid(current)/detach_pid(leader):

void exec_pid(task_t *old, task_t * new, enum pid_type type)
{
new->pids[type].pid = old->pids[type].pid;
hlist_replace_rcu(&old->pids[type].node, &new->pids[type].node);
old->pids[type].pid = NULL;
}

So de_thread() can do

exec_pid(leader, current, PIDTYPE_PGID);
exec_pid(leader, current, PIDTYPE_SID);

This allows us to iterate over pgrp/session lockless without
seeing the same task twice, btw. But may be it is just unneeded
complication.

> This requires changing the leaders parents
>
> current->parent = current->real_parent = leader->real_parent;
> - leader->parent = leader->real_parent = child_reaper;
> + leader->parent = leader->real_parent = current;
> current->group_leader = current;

I don't understand why do we need this change.

Actually, I think leader doesn't need reparenting at all.
ptrace_unlink(leader) already restored leader->parent = ->real_parent
and ->sibling. So I think we can do (for review only, should go in a
separate patch) this:

--- MM/fs/exec.c~ 2006-04-08 02:19:15.000000000 +0400
+++ MM/fs/exec.c 2006-04-08 18:50:35.000000000 +0400
@@ -704,7 +704,6 @@ static int de_thread(struct task_struct
ptrace_unlink(current);
ptrace_unlink(leader);
remove_parent(current);
- remove_parent(leader);


/* Become a process group leader with the old leader's pid.
@@ -718,13 +717,11 @@ static int de_thread(struct task_struct
attach_pid(current, PIDTYPE_SID, current->signal->session);
list_replace_rcu(&leader->tasks, &current->tasks);

- current->parent = current->real_parent = leader->real_parent;
- leader->parent = leader->real_parent = child_reaper;
+ current->parent = current->real_parent = leader->parent;
current->group_leader = current;
leader->group_leader = leader;

add_parent(current);
- add_parent(leader);
if (ptrace) {
current->ptrace = ptrace;
__ptrace_link(current, parent);

Oleg.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH rc1-mm] de_thread: fix deadlockable process addition [ In reply to ]
Oleg Nesterov <oleg@tv-sign.ru> writes:

> On 04/08, Eric W. Biederman wrote:
>>
>> Agreed. That is ugly.
>
> Yes, I agree also.
>
>>
>> -#define thread_group_leader(p) (p->pid == p->tgid)
>> +#define thread_group_leader(p) (p == p->group_leader)
>>
>> ...
>>
>> - leader->group_leader = leader;
>> + leader->group_leader = current;
>
> I thought about similar change too, but I am unsure about
> release_task(old_leader)->proc_flush_task() path (because
> I don't understand this code).

proc_flush_task() is purely an optimization to kill the
proc dcache entries when a process exits. So we don't
plug up the dcache with old proc entries.

All it looks at currently and pid an tgid.

So proc_flush_task() should not be a non-issue.

proc_pid_unhash() in -linus is a little more
serious but it only unhashes things.

> This change can confuse next_tid(), but this is minor.
> I don't see other problems.

next_tid?

> However, I think we can do something better instead of
> attach_pid(current)/detach_pid(leader):
>
> void exec_pid(task_t *old, task_t * new, enum pid_type type)
> {
> new->pids[type].pid = old->pids[type].pid;
> hlist_replace_rcu(&old->pids[type].node, &new->pids[type].node);
> old->pids[type].pid = NULL;
> }
>
> So de_thread() can do
>
> exec_pid(leader, current, PIDTYPE_PGID);
> exec_pid(leader, current, PIDTYPE_SID);
>
> This allows us to iterate over pgrp/session lockless without
> seeing the same task twice, btw. But may be it is just unneeded
> complication.

I think it may be worthwhile. Currently we can't do any process
group or session traversal lockless so it isn't worth looking
at until we get the bug fix handled.

Ultimately I think I would like PGID and SID to live in ->signal
in which case we would not need to touch them at all.

>> This requires changing the leaders parents
>>
>> current->parent = current->real_parent = leader->real_parent;
>> - leader->parent = leader->real_parent = child_reaper;
>> + leader->parent = leader->real_parent = current;
>> current->group_leader = current;
>
> I don't understand why do we need this change.

I was just trying to come as close as I could to normal
thread semantics. The fewer special cases in de_thread,
the fewer problems it is.

> Actually, I think leader doesn't need reparenting at all.
> ptrace_unlink(leader) already restored leader->parent = ->real_parent
> and ->sibling. So I think we can do (for review only, should go in a
> separate patch) this:

Duh. All processes in a thread group share the same real_parent.
It looks like the only practical thing we accomplish
with remove_parent/add_parent is to change our place on
our parents list of children.

Since ptrace_link and ptrace_unlink already does this we don't have a
guaranteed order on that list, so skipping the order change
should definitely be safe.

This means your patch doesn't go far enough. We should be
able to kill all of the parent list manipulation in
de_thread. Doing reduces the places that assign
real_parent to just fork and exit.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH rc1-mm] de_thread: fix deadlockable process addition [ In reply to ]
On 04/08, Eric W. Biederman wrote:
> Oleg Nesterov <oleg@tv-sign.ru> writes:
>
> > This change can confuse next_tid(), but this is minor.
> > I don't see other problems.
>
> next_tid?

proc_task_readdir:

first_tid() returns old_leader

next_tid() returns new_leader

de_thread:
old_leader->group_leader = new_leader;


next_rid() returns old_leader again,
because it is not thread_group_leader()
anymore


> This means your patch doesn't go far enough. We should be
> able to kill all of the parent list manipulation in
> de_thread. Doing reduces the places that assign
> real_parent to just fork and exit.

Yes!

I think I understand why we had the reason to reparent 'leader'
in the past. We used to set leader->exit_state = EXIT_ZOMBIE,
so without reparenting current's parent could have a bogus do_wait()
result if this do_wait() happens before release_task(leader).

Now we set leader->exit_state = EXIT_DEAD, which means this task
is not visible to do_wait().

Oleg.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH rc1-mm] de_thread: fix deadlockable process addition [ In reply to ]
On 04/09, Oleg Nesterov wrote:
>
> proc_task_readdir:
>
> first_tid() returns old_leader
>
> next_tid() returns new_leader
>
> de_thread:
> old_leader->group_leader = new_leader;
>
>
> next_rid() returns old_leader again,
> because it is not thread_group_leader()
> anymore

I think something like this for next_tid() is sufficient:

- if (thread_group_leader(pos))
+ if (pos->pid == pos->tgid)

We can also do the same change in first_tgid().

Oleg.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH rc1-mm] de_thread: fix deadlockable process addition [ In reply to ]
Oleg Nesterov <oleg@tv-sign.ru> writes:

> On 04/09, Oleg Nesterov wrote:
>>
>> proc_task_readdir:
>>
>> first_tid() returns old_leader
>>
>> next_tid() returns new_leader
>>
>> de_thread:
>> old_leader->group_leader = new_leader;
>>
>>
>> next_rid() returns old_leader again,
>> because it is not thread_group_leader()
>> anymore
>
> I think something like this for next_tid() is sufficient:
>
> - if (thread_group_leader(pos))
> + if (pos->pid == pos->tgid)
>
> We can also do the same change in first_tgid().

But the loop will terminate when things settle down,
and we can always use my original test of pos == start->group_leader.

I was worried about the stable kernel case and next_tid was
such a generic name I failed to associate it with /proc.

Short of confusion (and de_thread invariably introduces that)
I don't see anything that the code will actually do wrong.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/