Mailing List Archive

[Workqueue] crash in process_one_work
Hello Tejun/Lai,

I am seeing the following crash in 3.10.49 kernel.

[ 1133.893817] Unable to handle kernel NULL pointer dereference at
virtual address 00000004
[ 1133.893821] pgd = c0004000
[ 1133.893827] [00000004] *pgd=00000000
[ 1133.893834] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
[ 1133.893841] Modules linked in:
[ 1133.893849] CPU: 2 PID: 5359 Comm: kworker/u8:20 Not tainted
3.10.28-g99b6153-00006-gc32dab7 #1
[ 1133.893859] task: d8c2aa00 ti: e79a4000 task.ti: e79a4000
[ 1133.893873] PC is at process_one_work+0x18/0x448
[ 1133.893878] LR is at process_one_work+0x14/0x448
[ 1133.893887] pc : [<c0135218>] lr : [<c0135214>] psr: 400f0093
sp : e79a5ef8 ip : daf7f100 fp : 00000089
[ 1133.893891] r10: daf7f118 r9 : ee80e820 r8 : ee80e800
[ 1133.893897] r7 : c111872e r6 : ee80e800 r5 : ed7cf150 r4 : daf7f100
[ 1133.893902] r3 : ffffffe0 r2 : 00000081 r1 : ed7cf150 r0 : 00000000
[ 1133.893908] Flags: nZcv IRQs off FIQs on Mode SVC_32 ISA ARM
Segment kernel
[ 1133.893914] Control: 10c5383d Table: a7dbc06a DAC: 00000015

Pasting the code snippet of process_one_work function where crash happens,

struct pool_workqueue *pwq = get_work_pwq(work);
struct worker_pool *pool = worker->pool;
bool cpu_intensive = pwq->wq->flags & WQ_CPU_INTENSIVE;

get_work_pwq returned NULL because WORK_STRUCT_PWQ flag was not set on
work_struct->data. And the crash happened while dereferencing the NULL
pointer. There is no NULL check here, which signifies that this
condition must not have happened.

The corresponding work_struct looks likes this,

crash> struct work_struct ed7cf150
struct work_struct {
data = {
counter = 0xffffffe0
},
entry = {
next = 0xed7cf154,
prev = 0xed7cf154
},
func = 0xc0140ac4 <async_run_entry_fn>
}

The value of data is 0xffffffe0, which is basically the value after an
INIT_WORK() or WORK_DATA_INIT().
This can happen if a driver calls INIT_WORK on same struct work again
after queuing it.

From the above details of the work_struct shows that the work is
queued from kernel/async.c. async_schedule dynamically allocates the
work_struct and queues it to system_unbonded_wq. And possibility of
calling INIT_WORK on same work is not there.

After inspecting ramdump for async_entry structure in kernel/async.c

crash> struct async_entry ed7cf140
struct async_entry {
domain_list = {
next = 0xed7cf140,
prev = 0xed7cf140
},
global_list = {
next = 0xed7cf148,
prev = 0xed7cf148
},
work = {
data = {
counter = 0xffffffe0
},
entry = {
next = 0xed7cf154,
prev = 0xed7cf154
},
func = 0xc0140ac4 <async_run_entry_fn>
},
cookie = 0x263e5,
func = 0xc074dda0 <dapm_post_sequence_async>,
data = 0xed48432c,
domain = 0xe5457dec
}

the func points to dapm_post_sequence_async. and you can see the
domain_list and global_list is empty. Which shows that the work has
finished execution and there is no pending execution in async.

But how come this struct work was with work queue data structures?
Is there any corner case in work queue which can miss unlinking the
struct_work from pool_workqueue after executing them?

I really appreciate your inputs/pointers.
Please let me know if you want any more information from the crashed system.

Thanks,
Arun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [Workqueue] crash in process_one_work [ In reply to ]
Hello, Arun.

On Mon, Sep 29, 2014 at 09:40:50PM +0530, Arun KS wrote:
...
> The value of data is 0xffffffe0, which is basically the value after an
> INIT_WORK() or WORK_DATA_INIT().
> This can happen if a driver calls INIT_WORK on same struct work again
> after queuing it.
>
> From the above details of the work_struct shows that the work is
> queued from kernel/async.c. async_schedule dynamically allocates the
> work_struct and queues it to system_unbonded_wq. And possibility of
> calling INIT_WORK on same work is not there.
>
> After inspecting ramdump for async_entry structure in kernel/async.c
>
> crash> struct async_entry ed7cf140
> struct async_entry {
> domain_list = {
> next = 0xed7cf140,
> prev = 0xed7cf140
> },
> global_list = {
> next = 0xed7cf148,
> prev = 0xed7cf148
> },
> work = {
> data = {
> counter = 0xffffffe0
> },
> entry = {
> next = 0xed7cf154,
> prev = 0xed7cf154
> },
> func = 0xc0140ac4 <async_run_entry_fn>
> },
> cookie = 0x263e5,
> func = 0xc074dda0 <dapm_post_sequence_async>,
> data = 0xed48432c,
> domain = 0xe5457dec
> }
>
> the func points to dapm_post_sequence_async. and you can see the
> domain_list and global_list is empty. Which shows that the work has
> finished execution and there is no pending execution in async.
>
> But how come this struct work was with work queue data structures?
> Is there any corner case in work queue which can miss unlinking the
> struct_work from pool_workqueue after executing them?

I sure hope not. How reproducible is the issue? Can you try w/
CONFIG_DEBUG_OBJECTS_WORK enabled?

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [Workqueue] crash in process_one_work [ In reply to ]
Hello Tejun,

On Mon, Oct 6, 2014 at 9:02 PM, Tejun Heo <tj@kernel.org> wrote:
> Hello, Arun.
>
> On Mon, Sep 29, 2014 at 09:40:50PM +0530, Arun KS wrote:
> ...
>> The value of data is 0xffffffe0, which is basically the value after an
>> INIT_WORK() or WORK_DATA_INIT().
>> This can happen if a driver calls INIT_WORK on same struct work again
>> after queuing it.
>>
>> From the above details of the work_struct shows that the work is
>> queued from kernel/async.c. async_schedule dynamically allocates the
>> work_struct and queues it to system_unbonded_wq. And possibility of
>> calling INIT_WORK on same work is not there.
>>
>> After inspecting ramdump for async_entry structure in kernel/async.c
>>
>> crash> struct async_entry ed7cf140
>> struct async_entry {
>> domain_list = {
>> next = 0xed7cf140,
>> prev = 0xed7cf140
>> },
>> global_list = {
>> next = 0xed7cf148,
>> prev = 0xed7cf148
>> },
>> work = {
>> data = {
>> counter = 0xffffffe0
>> },
>> entry = {
>> next = 0xed7cf154,
>> prev = 0xed7cf154
>> },
>> func = 0xc0140ac4 <async_run_entry_fn>
>> },
>> cookie = 0x263e5,
>> func = 0xc074dda0 <dapm_post_sequence_async>,
>> data = 0xed48432c,
>> domain = 0xe5457dec
>> }
>>
>> the func points to dapm_post_sequence_async. and you can see the
>> domain_list and global_list is empty. Which shows that the work has
>> finished execution and there is no pending execution in async.
>>
>> But how come this struct work was with work queue data structures?
>> Is there any corner case in work queue which can miss unlinking the
>> struct_work from pool_workqueue after executing them?
>
> I sure hope not. How reproducible is the issue? Can you try w/
> CONFIG_DEBUG_OBJECTS_WORK enabled?

Thanks for replying.
That was a problem with one of our driver. It was freeing the
memory(struct work) without flushing workqueue.
We caught faulty driver by adding a BUG_ON() in INIT_WORK and looking
at the func pointer in work_struct( which will be pointing to the
faulty driver work function)

1) faulty driver queue_work to system_unbownded_wq
2) free work_struct memory, but it is still queued in the work queue.
3) another driver request the memory from SLAB, go the same memory, it INIT_WORK
4) process work try to execute the work queued by the faulty driver,
result in a crash.


Thanks,
Arun

>
> Thanks.
>
> --
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [Workqueue] crash in process_one_work [ In reply to ]
Hello, Arun.

On Wed, Oct 08, 2014 at 05:30:20PM +0530, Arun KS wrote:
> > I sure hope not. How reproducible is the issue? Can you try w/
> > CONFIG_DEBUG_OBJECTS_WORK enabled?
>
> Thanks for replying.
> That was a problem with one of our driver. It was freeing the
> memory(struct work) without flushing workqueue.
> We caught faulty driver by adding a BUG_ON() in INIT_WORK and looking
> at the func pointer in work_struct( which will be pointing to the
> faulty driver work function)
>
> 1) faulty driver queue_work to system_unbownded_wq
> 2) free work_struct memory, but it is still queued in the work queue.
> 3) another driver request the memory from SLAB, go the same memory, it INIT_WORK
> 4) process work try to execute the work queued by the faulty driver,
> result in a crash.

Ah, good to hear. I think bugs like the above should be detectable
with CONFIG_DEBUG_OBJECTS_WORK, so if you see something similar next
time, please try it out.

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/