Mailing List Archive

[BUG v6.3-rc4+] WARNING: CPU: 0 PID: 1 at drivers/thermal/thermal_sysfs.c:879 cooling_device_stats_setup+0xac/0xc0
In preparation to adding my patch that checks for some kinds of bugs in
trace events, I decided to run it on the Linus's latest branch, to see if
there's any other trace events that may cause issues. But instead I hit
this unrelated bug. Looks to be triggering an added lockdep_assert() on
boot up.

static void cooling_device_stats_setup(struct thermal_cooling_device *cdev)
{
const struct attribute_group *stats_attr_group = NULL;
struct cooling_dev_stats *stats;
/* Total number of states is highest state + 1 */
unsigned long states = cdev->max_state + 1;
int var;

lockdep_assert_held(&cdev->lock); <<<---- line 879


[ 12.379345] ------------[ cut here ]------------
[ 12.380501] WARNING: CPU: 0 PID: 1 at drivers/thermal/thermal_sysfs.c:879 cooling_device_stats_setup+0xac/0xc0
[ 12.382844] Modules linked in:
[ 12.383657] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.3.0-rc4-test-00040-g077623e2804f-dirty #12
[ 12.385753] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[ 12.387921] RIP: 0010:cooling_device_stats_setup+0xac/0xc0
[ 12.389256] Code: ff 48 89 1d 2e 8d 07 01 5b 5d 41 5c c3 cc cc cc cc 48 8d bf 18 05 00 00 be ff ff ff ff e8 dc c1 36 00 85 c0 0f 85 72 ff ff ff <0f> 0b e9 6b ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90
[ 12.394812] RSP: 0000:ffffa4d580013c60 EFLAGS: 00010246
[ 12.396076] RAX: 0000000000000000 RBX: ffff99cc82e4e000 RCX: 0000000000000001
[ 12.397747] RDX: 0000000000000000 RSI: ffffffff9b2e7401 RDI: ffffffff9b316e4c
[ 12.399398] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000001
[ 12.401057] R10: ffff99cc80262340 R11: ffff99cc80263138 R12: ffff99cc82e4e000
[ 12.402698] R13: ffff99cc82e4e018 R14: ffff99cc80bc7800 R15: 0000000000000000
[ 12.404337] FS: 0000000000000000(0000) GS:ffff99ccfbc00000(0000) knlGS:0000000000000000
[ 12.406266] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 12.407632] CR2: ffff99ccb7401000 CR3: 0000000134a64001 CR4: 0000000000170ef0
[ 12.409313] Call Trace:
[ 12.410001] <TASK>
[ 12.410618] thermal_cooling_device_setup_sysfs+0x12/0x30
[ 12.411910] __thermal_cooling_device_register.part.0+0x13c/0x3d0
[ 12.413372] acpi_processor_thermal_init+0x26/0x100
[ 12.414557] __acpi_processor_start+0x7f/0xf0
[ 12.415628] acpi_processor_start+0x30/0x50
[ 12.416684] really_probe+0x1a2/0x400
[ 12.417635] ? __pfx___driver_attach+0x10/0x10
[ 12.418726] __driver_probe_device+0x78/0x170
[ 12.419800] driver_probe_device+0x1f/0x90
[ 12.423242] __driver_attach+0xd6/0x1d0
[ 12.424220] bus_for_each_dev+0x8f/0xe0
[ 12.425210] bus_add_driver+0x119/0x220
[ 12.426174] driver_register+0x59/0x100
[ 12.427139] ? __pfx_acpi_processor_driver_init+0x10/0x10
[ 12.428427] acpi_processor_driver_init+0x3f/0xc0
[ 12.429582] ? __pfx_acpi_processor_driver_init+0x10/0x10
[ 12.430865] do_one_initcall+0x74/0x290
[ 12.431839] kernel_init_freeable+0x3cc/0x580
[ 12.432938] ? __pfx_kernel_init+0x10/0x10
[ 12.433957] kernel_init+0x1a/0x1c0
[ 12.434847] ret_from_fork+0x2c/0x50
[ 12.435772] </TASK>
[ 12.436402] irq event stamp: 10911001
[ 12.437356] hardirqs last enabled at (10911013): [<ffffffff99e6513e>] __up_console_sem+0x5e/0x70
[ 12.439440] hardirqs last disabled at (10911022): [<ffffffff99e65123>] __up_console_sem+0x43/0x70
[ 12.441531] softirqs last enabled at (10910744): [<ffffffff9abd17a2>] __do_softirq+0x322/0x429
[ 12.443577] softirqs last disabled at (10910739): [<ffffffff99dc27e7>] __irq_exit_rcu+0xc7/0x150
[ 12.445651] ---[ end trace 0000000000000000 ]---
[ 12.446965] ------------[ cut here ]------------

Config attached.

-- Steve
Re: [BUG v6.3-rc4+] WARNING: CPU: 0 PID: 1 at drivers/thermal/thermal_sysfs.c:879 cooling_device_stats_setup+0xac/0xc0 [ In reply to ]
On Wed, 29 Mar 2023 16:58:19 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> In preparation to adding my patch that checks for some kinds of bugs in
> trace events, I decided to run it on the Linus's latest branch, to see if
> there's any other trace events that may cause issues. But instead I hit
> this unrelated bug. Looks to be triggering an added lockdep_assert() on
> boot up.

The bug happened on this commit, from Linus's tree:

fcd476ea6a888ef6e6627f4c21a2ea8cca3e9312

In case it was fixed after that.

-- Steve
Re: [BUG v6.3-rc4+] WARNING: CPU: 0 PID: 1 at drivers/thermal/thermal_sysfs.c:879 cooling_device_stats_setup+0xac/0xc0 [ In reply to ]
On Wed, 29 Mar 2023 16:59:44 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> The bug happened on this commit, from Linus's tree:
>
> fcd476ea6a888ef6e6627f4c21a2ea8cca3e9312

Heh, and I forgot to update Linus's latest for today.

ffe78bbd512166e0ef1cc4858010b128c510ed7d

which has the other fix I was going to test.

But I still don't see anything that would prevent the bug I reported.

-- Steve
Re: [BUG v6.3-rc4+] WARNING: CPU: 0 PID: 1 at drivers/thermal/thermal_sysfs.c:879 cooling_device_stats_setup+0xac/0xc0 [ In reply to ]
On Wed, Mar 29, 2023 at 1:58?PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> In preparation to adding my patch that checks for some kinds of bugs in
> trace events, I decided to run it on the Linus's latest branch, to see if
> there's any other trace events that may cause issues. But instead I hit
> this unrelated bug. Looks to be triggering an added lockdep_assert() on
> boot up.

So I think that lockdep assert is likely bogus.

It was added in commit 790930f44289 ("thermal: core: Introduce
thermal_cooling_device_update()") but the reason I say it's bogus is
that I don't think it has ever been tested:

> static void cooling_device_stats_setup(struct thermal_cooling_device *cdev)
> {
> lockdep_assert_held(&cdev->lock); <<<---- line 879

Yeah, so cooling_device_stats_setup() is called from two places:

- thermal_cooling_device_setup_sysfs()

- thermal_cooling_device_stats_reinit()

and that first place is when that cdev is created, before it's
registered anywhere. It's not locked in that case, and yes, the
lockdep_assert_held() will trigger.

As far as I can tell it will always trigger, and this lockdep_assert()
has thus never been tested with lockdep enabled.

The "stats_reinit" case seems to also be called from only one place
(thermal_cooling_device_update()), and that path does indeed hold the
cdev->lock.

That lockdep could be made happy by having
thermal_cooling_device_setup_sysfs() create that device with the cdev
lock held. I guess that's easy enough, although somewhat annoyingly
there is no "mutex_init_locked()", you have to actually do
"mutex_init()" followed by a "mutex_lock()". And obviously unlock it
after doing the setup_sysfs().

But I question whether the lockdep test should be done at all. I find
it distasteful that it was added with absolutely zero testing.

Linus
Re: [BUG v6.3-rc4+] WARNING: CPU: 0 PID: 1 at drivers/thermal/thermal_sysfs.c:879 cooling_device_stats_setup+0xac/0xc0 [ In reply to ]
Hi, Linus,

On Wed, 2023-03-29 at 15:52 -0700, Linus Torvalds wrote:
> On Wed, Mar 29, 2023 at 1:58?PM Steven Rostedt <rostedt@goodmis.org>
> wrote:
> > In preparation to adding my patch that checks for some kinds of
> > bugs in
> > trace events, I decided to run it on the Linus's latest branch, to
> > see if
> > there's any other trace events that may cause issues. But instead I
> > hit
> > this unrelated bug. Looks to be triggering an added
> > lockdep_assert() on
> > boot up.
>
> So I think that lockdep assert is likely bogus.
>
> It was added in commit 790930f44289 ("thermal: core: Introduce
> thermal_cooling_device_update()") but the reason I say it's bogus is
> that I don't think it has ever been tested:
>
> > static void cooling_device_stats_setup(struct
> > thermal_cooling_device *cdev)
> > {
> > lockdep_assert_held(&cdev->lock); <<<---- line 879
>
> Yeah, so cooling_device_stats_setup() is called from two places:
>
> - thermal_cooling_device_setup_sysfs()
>
> - thermal_cooling_device_stats_reinit()
>
> and that first place is when that cdev is created, before it's
> registered anywhere. It's not locked in that case, and yes, the
> lockdep_assert_held() will trigger.
>
> As far as I can tell it will always trigger, and this
> lockdep_assert()
> has thus never been tested with lockdep enabled.
>
> The "stats_reinit" case seems to also be called from only one place
> (thermal_cooling_device_update()), and that path does indeed hold the
> cdev->lock.
>
> That lockdep could be made happy by having
> thermal_cooling_device_setup_sysfs() create that device with the cdev
> lock held. I guess that's easy enough, although somewhat annoyingly
> there is no "mutex_init_locked()", you have to actually do
> "mutex_init()" followed by a "mutex_lock()". And obviously unlock it
> after doing the setup_sysfs().
>
> But I question whether the lockdep test should be done at all. I find
> it distasteful that it was added with absolutely zero testing.
>
>
I just realized why I cannot reproduce this problem on my testbox.

In order to test the original patch with ACPI passive cooling enabled,
I rebuild the kernel with customized DSDT.
This taints the kernel, and clears the debug_locks, thus I didn't get
any lockdep warnings...

thanks,
rui
Re: [BUG v6.3-rc4+] WARNING: CPU: 0 PID: 1 at drivers/thermal/thermal_sysfs.c:879 cooling_device_stats_setup+0xac/0xc0 [ In reply to ]
On 3/30/2023 11:57 AM, Zhang, Rui wrote:
> Hi, Linus,
>
> On Wed, 2023-03-29 at 15:52 -0700, Linus Torvalds wrote:
>> On Wed, Mar 29, 2023 at 1:58?PM Steven Rostedt <rostedt@goodmis.org>
>> wrote:
>>> In preparation to adding my patch that checks for some kinds of
>>> bugs in
>>> trace events, I decided to run it on the Linus's latest branch, to
>>> see if
>>> there's any other trace events that may cause issues. But instead I
>>> hit
>>> this unrelated bug. Looks to be triggering an added
>>> lockdep_assert() on
>>> boot up.
>> So I think that lockdep assert is likely bogus.

It is, sorry about that.

I have a fix queued up,
https://patchwork.kernel.org/project/linux-pm/patch/2681615.mvXUDI8C0e@kreacher


>> It was added in commit 790930f44289 ("thermal: core: Introduce
>> thermal_cooling_device_update()") but the reason I say it's bogus is
>> that I don't think it has ever been tested:
>>
>>> static void cooling_device_stats_setup(struct
>>> thermal_cooling_device *cdev)
>>> {
>>> lockdep_assert_held(&cdev->lock); <<<---- line 879
>> Yeah, so cooling_device_stats_setup() is called from two places:
>>
>> - thermal_cooling_device_setup_sysfs()
>>
>> - thermal_cooling_device_stats_reinit()
>>
>> and that first place is when that cdev is created, before it's
>> registered anywhere. It's not locked in that case, and yes, the
>> lockdep_assert_held() will trigger.
>>
>> As far as I can tell it will always trigger, and this
>> lockdep_assert()
>> has thus never been tested with lockdep enabled.
>>
>> The "stats_reinit" case seems to also be called from only one place
>> (thermal_cooling_device_update()), and that path does indeed hold the
>> cdev->lock.
>>
>> That lockdep could be made happy by having
>> thermal_cooling_device_setup_sysfs() create that device with the cdev
>> lock held. I guess that's easy enough, although somewhat annoyingly
>> there is no "mutex_init_locked()", you have to actually do
>> "mutex_init()" followed by a "mutex_lock()". And obviously unlock it
>> after doing the setup_sysfs().
>>
>> But I question whether the lockdep test should be done at all. I find
>> it distasteful that it was added with absolutely zero testing.
>>
>>
> I just realized why I cannot reproduce this problem on my testbox.
>
> In order to test the original patch with ACPI passive cooling enabled,
> I rebuild the kernel with customized DSDT.
> This taints the kernel, and clears the debug_locks, thus I didn't get
> any lockdep warnings...

So I was happy when I got a Tested-by from Rui on this and I didn't
double check.  Turns out this was a mistake.

Cheers,

Rafael