Mailing List Archive

[PATCH (REPOST)] profiling: initialize prof_cpu_mask from profile_online_cpu()
syzbot is reporting uninit-value at profile_hits(), for commit acd895795d35
("profiling: fix broken profiling regression") by error initialized
prof_cpu_mask too early.

do_profile_hits() is called from profile_tick() from timer interrupt
only if cpumask_test_cpu(smp_processor_id(), prof_cpu_mask) is true and
prof_buffer is not NULL. But the syzbot's report says that profile_hits()
was called while current thread is still doing vzalloc(buffer_bytes)
where prof_buffer is NULL at this moment. This indicates two things.

One is that cpumask_set_cpu(cpu, prof_cpu_mask) should have been called
from profile_online_cpu() from cpuhp_setup_state() only after
profile_init() completed. Fix this by explicitly calling cpumask_copy()
from create_proc_profile() on only UP kernels.

The other is that multiple threads concurrently tried to write to
/sys/kernel/profiling interface, which caused that somebody else tried
to re-initialize prof_buffer despite somebody has already initialized
prof_buffer. Fix this by using serialization.

Reported-by: syzbot+b1a83ab2a9eb9321fbdd@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=b1a83ab2a9eb9321fbdd
Fixes: acd895795d35 ("profiling: fix broken profiling regression")
Tested-by: syzbot+b1a83ab2a9eb9321fbdd@syzkaller.appspotmail.com
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
Can somebody test this patch? I don't know how not using
cpu_possible_mask affects expected profile data collection
(especially when written to /sys/kernel/profiling interface
when some of CPUs are offline).

kernel/ksysfs.c | 27 ++++++++++++++++++++++-----
kernel/profile.c | 6 +++---
2 files changed, 25 insertions(+), 8 deletions(-)

diff --git a/kernel/ksysfs.c b/kernel/ksysfs.c
index 495b69a71a5d..0540483ae7f0 100644
--- a/kernel/ksysfs.c
+++ b/kernel/ksysfs.c
@@ -91,10 +91,23 @@ static ssize_t profiling_store(struct kobject *kobj,
struct kobj_attribute *attr,
const char *buf, size_t count)
{
+ static DEFINE_MUTEX(lock);
int ret;

- if (prof_on)
- return -EEXIST;
+ /*
+ * We need serialization, for profile_setup() initializes prof_on
+ * value. Also, use killable wait in case memory allocation from
+ * profile_init() triggered the OOM killer and chose current thread
+ * blocked here.
+ */
+ if (mutex_lock_killable(&lock))
+ return -EINTR;
+
+ if (prof_on) {
+ count = -EEXIST;
+ goto out;
+ }
+
/*
* This eventually calls into get_option() which
* has a ton of callers and is not const. It is
@@ -102,11 +115,15 @@ static ssize_t profiling_store(struct kobject *kobj,
*/
profile_setup((char *)buf);
ret = profile_init();
- if (ret)
- return ret;
+ if (ret) {
+ count = ret;
+ goto out;
+ }
ret = create_proc_profile();
if (ret)
- return ret;
+ count = ret;
+out:
+ mutex_unlock(&lock);
return count;
}
KERNEL_ATTR_RW(profiling);
diff --git a/kernel/profile.c b/kernel/profile.c
index 2b775cc5c28f..1a6c1cf98485 100644
--- a/kernel/profile.c
+++ b/kernel/profile.c
@@ -114,11 +114,9 @@ int __ref profile_init(void)

buffer_bytes = prof_len*sizeof(atomic_t);

- if (!alloc_cpumask_var(&prof_cpu_mask, GFP_KERNEL))
+ if (!zalloc_cpumask_var(&prof_cpu_mask, GFP_KERNEL))
return -ENOMEM;

- cpumask_copy(prof_cpu_mask, cpu_possible_mask);
-
prof_buffer = kzalloc(buffer_bytes, GFP_KERNEL|__GFP_NOWARN);
if (prof_buffer)
return 0;
@@ -438,6 +436,8 @@ int __ref create_proc_profile(void)
goto err_state_prep;
online_state = err;
err = 0;
+#else
+ cpumask_copy(prof_cpu_mask, cpu_possible_mask);
#endif
entry = proc_create("profile", S_IWUSR | S_IRUGO,
NULL, &profile_proc_ops);
--
2.18.4
Re: [PATCH (REPOST)] profiling: initialize prof_cpu_mask from profile_online_cpu() [ In reply to ]
On 2024/04/27 15:51, Tetsuo Handa wrote:
> Can somebody test this patch? I don't know how not using
> cpu_possible_mask affects expected profile data collection
> (especially when written to /sys/kernel/profiling interface
> when some of CPUs are offline).

I confirmed that writing to /sys/kernel/profiling while some of
/sys/devices/system/cpu/cpu$num/online are 0 will not affect profile data
collection, for profile_online_cpu() is called (and corresponding bit is set)
when /sys/devices/system/cpu/cpu$num/online becomes 1. Thus, I consider that
this patch itself is correct and safe.

But I found a different problem with current code. Just running

# echo -n sleep > /sys/kernel/profiling

after boot causes the system to lock up. ;-)



[ 102.002703] kernel profiling enabled schedstats, disable via kernel.sched_schedstats.
[ 102.013462] kernel sleep profiling enabled (shift: 0)
[ 102.019515]
[ 102.021862] ======================================================
[ 102.028037] WARNING: possible circular locking dependency detected
[ 102.034940] 6.9.0-rc6+ #64 Not tainted
[ 102.039178] ------------------------------------------------------
[ 102.045236] swapper/1/0 is trying to acquire lock:
[ 102.050140] ffff9a3d809a0e30 (&p->pi_lock){-.-.}-{2:2}, at: get_wchan+0x33/0x80
[ 102.057446]
[ 102.057446] but task is already holding lock:
[ 102.062449] ffff9a3e1a3f79d8 (&rq->__lock){-.-.}-{2:2}, at: _raw_spin_rq_lock_irqsave+0x27/0x50
[ 102.071546]
[ 102.071546] which lock already depends on the new lock.
[ 102.071546]
[ 102.079435]
[ 102.079435] the existing dependency chain (in reverse order) is:
[ 102.085407]
[ 102.085407] -> #1 (&rq->__lock){-.-.}-{2:2}:
[ 102.090805] lock_acquire+0xca/0x2d0
[ 102.095400] _raw_spin_lock_nested+0x32/0x80
[ 102.100616] raw_spin_rq_lock_nested+0x15/0x30
[ 102.104595] task_fork_fair+0x3a/0xe0
[ 102.109176] sched_cgroup_fork+0x11e/0x160
[ 102.113277] copy_process+0x15b5/0x2140
[ 102.117990] kernel_clone+0x9e/0x6d0
[ 102.122376] user_mode_thread+0x5f/0x90
[ 102.127207] rest_init+0x1e/0x1d0
[ 102.131843] start_kernel+0x64a/0x730
[ 102.136213] x86_64_start_reservations+0x18/0x30
[ 102.141857] x86_64_start_kernel+0x91/0xa0
[ 102.147018] common_startup_64+0x13e/0x141
[ 102.152186]
[ 102.152186] -> #0 (&p->pi_lock){-.-.}-{2:2}:
[ 102.158338] check_prevs_add+0x1e2/0xc60
[ 102.162335] __lock_acquire+0x10d4/0x12b0
[ 102.167147] lock_acquire+0xca/0x2d0
[ 102.171738] _raw_spin_lock_irq+0x48/0x90
[ 102.176284] get_wchan+0x33/0x80
[ 102.180519] __update_stats_enqueue_sleeper+0x156/0x390
[ 102.185878] enqueue_entity+0x34c/0x3e0
[ 102.190237] enqueue_task_fair+0xa3/0x350
[ 102.195030] activate_task+0x4c/0xe0
[ 102.199643] ttwu_do_activate+0x5d/0x1f0
[ 102.203992] sched_ttwu_pending+0xe2/0x1a0
[ 102.209097] __flush_smp_call_function_queue+0x1e8/0x6e0
[ 102.214120] __sysvec_call_function_single+0x36/0x150
[ 102.220072] sysvec_call_function_single+0x65/0x80
[ 102.224282] asm_sysvec_call_function_single+0x1a/0x20
[ 102.229298] pv_native_safe_halt+0xf/0x20
[ 102.233969] acpi_safe_halt+0x14/0x20
[ 102.238649] acpi_idle_enter+0x7f/0xd0
[ 102.242926] cpuidle_enter_state+0x93/0x570
[ 102.248198] cpuidle_enter+0x2d/0x40
[ 102.252902] do_idle+0x213/0x270
[ 102.257070] cpu_startup_entry+0x29/0x30
[ 102.262396] start_secondary+0x123/0x140
[ 102.267423] common_startup_64+0x13e/0x141
[ 102.272129]
[ 102.272129] other info that might help us debug this:
[ 102.272129]
[ 102.280775] Possible unsafe locking scenario:
[ 102.280775]
[ 102.285824] CPU0 CPU1
[ 102.289461] ---- ----
[ 102.294154] lock(&rq->__lock);
[ 102.297868] lock(&p->pi_lock);
[ 102.303641] lock(&rq->__lock);
[ 102.310000] lock(&p->pi_lock);
[ 102.311916]
[ 102.311916] *** DEADLOCK ***
[ 102.311916]
[ 102.318184] 1 lock held by swapper/1/0:
[ 102.321543] #0: ffff9a3e1a3f79d8 (&rq->__lock){-.-.}-{2:2}, at: _raw_spin_rq_lock_irqsave+0x27/0x50
[ 102.331290]
[ 102.331290] stack backtrace:
[ 102.336159] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.9.0-rc6+ #64
[ 102.342814] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
[ 102.354177] Call Trace:
[ 102.356981] <IRQ>
[ 102.359373] dump_stack_lvl+0x77/0xb0
[ 102.363183] check_noncircular+0x131/0x150
[ 102.367844] check_prevs_add+0x1e2/0xc60
[ 102.372341] __lock_acquire+0x10d4/0x12b0
[ 102.376757] lock_acquire+0xca/0x2d0
[ 102.380836] ? get_wchan+0x33/0x80
[ 102.383514] ? lock_is_held_type+0xac/0x120
[ 102.387749] _raw_spin_lock_irq+0x48/0x90
[ 102.391997] ? get_wchan+0x33/0x80
[ 102.395911] get_wchan+0x33/0x80
[ 102.399713] __update_stats_enqueue_sleeper+0x156/0x390
[ 102.405394] enqueue_entity+0x34c/0x3e0
[ 102.409721] enqueue_task_fair+0xa3/0x350
[ 102.413522] activate_task+0x4c/0xe0
[ 102.417455] ttwu_do_activate+0x5d/0x1f0
[ 102.421393] sched_ttwu_pending+0xe2/0x1a0
[ 102.424410] __flush_smp_call_function_queue+0x1e8/0x6e0
[ 102.430129] __sysvec_call_function_single+0x36/0x150
[ 102.435543] sysvec_call_function_single+0x65/0x80
[ 102.441465] </IRQ>
[ 102.443080] <TASK>
[ 102.445524] asm_sysvec_call_function_single+0x1a/0x20
[ 102.451011] RIP: 0010:pv_native_safe_halt+0xf/0x20
[ 102.456052] Code: 22 d7 c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 55 af 26 00 fb f4 <c3> cc cc cc cc 66 90 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90
[ 102.474565] RSP: 0018:ffffbba0000dfe60 EFLAGS: 00000246
[ 102.481811] RAX: 0000000000004000 RBX: 0000000000000001 RCX: ffff9a3e1a200000
[ 102.488391] RDX: 00000000001f37e8 RSI: ffffffff88961bc0 RDI: ffff9a3e39030464
[ 102.496168] RBP: ffff9a3e39030464 R08: ffff9a3e39030400 R09: 0000000000000001
[ 102.503844] R10: 0000000000000001 R11: ffffffff888d2648 R12: ffff9a3e238ac000
[ 102.510614] R13: ffffffff88961bc0 R14: 0000000000000001 R15: 0000000000000000
[ 102.517439] acpi_safe_halt+0x14/0x20
[ 102.522052] acpi_idle_enter+0x7f/0xd0
[ 102.525629] cpuidle_enter_state+0x93/0x570
[ 102.530429] cpuidle_enter+0x2d/0x40
[ 102.534386] do_idle+0x213/0x270
[ 102.536860] cpu_startup_entry+0x29/0x30
[ 102.540286] start_secondary+0x123/0x140
[ 102.542516] common_startup_64+0x13e/0x141
[ 102.545389] </TASK>



[ 93.099702] kernel profiling enabled schedstats, disable via kernel.sched_schedstats.
[ 93.110562] kernel sleep profiling enabled (shift: 0)
[ 93.118228]
[ 93.121032] ============================================
[ 93.127279] WARNING: possible recursive locking detected
[ 93.133167] 6.9.0-rc6+ #65 Not tainted
[ 93.136486] --------------------------------------------
[ 93.142349] swapper/8/0 is trying to acquire lock:
[ 93.147140] ffff9227c0998e30 (&p->pi_lock){-.-.}-{2:2}, at: get_wchan+0x33/0x80
[ 93.154036]
[ 93.154036] but task is already holding lock:
[ 93.160368] ffff9227c0998e30 (&p->pi_lock){-.-.}-{2:2}, at: try_to_wake_up+0x56/0x580
[ 93.166995]
[ 93.166995] other info that might help us debug this:
[ 93.173123] Possible unsafe locking scenario:
[ 93.173123]
[ 93.179771] CPU0
[ 93.181941] ----
[ 93.184727] lock(&p->pi_lock);
[ 93.188471] lock(&p->pi_lock);
[ 93.192588]
[ 93.192588] *** DEADLOCK ***
[ 93.192588]
[ 93.198248] May be due to missing lock nesting notation
[ 93.198248]
[ 93.205640] 3 locks held by swapper/8/0:
[ 93.210208] #0: ffffb17f40674e98 ((&timer.timer)){+.-.}-{0:0}, at: call_timer_fn+0x80/0x260
[ 93.216945] #1: ffff9227c0998e30 (&p->pi_lock){-.-.}-{2:2}, at: try_to_wake_up+0x56/0x580
[ 93.225435] #2: ffff92285bff79d8 (&rq->__lock){-.-.}-{2:2}, at: try_to_wake_up+0x1ab/0x580
[ 93.234238]
[ 93.234238] stack backtrace:
[ 93.238342] CPU: 8 PID: 0 Comm: swapper/8 Not tainted 6.9.0-rc6+ #65
[ 93.245220] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
[ 93.254093] Call Trace:
[ 93.256901] <IRQ>
[ 93.258420] dump_stack_lvl+0x77/0xb0
[ 93.261825] __lock_acquire+0x765/0x12b0
[ 93.289798] lock_acquire+0xca/0x2d0
[ 93.293381] ? get_wchan+0x33/0x80
[ 93.297140] ? lock_is_held_type+0xac/0x120
[ 93.301057] _raw_spin_lock_irq+0x48/0x90
[ 93.307264] ? get_wchan+0x33/0x80
[ 93.310410] get_wchan+0x33/0x80
[ 93.314484] __update_stats_enqueue_sleeper+0x156/0x390
[ 93.318493] enqueue_entity+0x34c/0x3e0
[ 93.322788] enqueue_task_fair+0xa3/0x350
[ 93.326960] activate_task+0x4c/0xe0
[ 93.330226] ttwu_do_activate+0x5d/0x1f0
[ 93.347525] try_to_wake_up+0x1da/0x580
[ 93.352071] ? __pfx_process_timeout+0x10/0x10
[ 93.357023] ? __pfx_process_timeout+0x10/0x10
[ 93.361356] call_timer_fn+0xae/0x260
[ 93.365165] __run_timer_base.part.35+0x26d/0x2c0
[ 93.371324] ? ktime_get+0x6b/0x130
[ 93.375678] ? ktime_get+0x8b/0x130
[ 93.379055] ? native_apic_msr_write+0x30/0x40
[ 93.383911] run_timer_softirq+0x51/0x90
[ 93.397918] __do_softirq+0xb6/0x3ec
[ 93.402236] irq_exit_rcu+0xba/0xe0
[ 93.406044] sysvec_apic_timer_interrupt+0x6a/0x80
[ 93.410164] </IRQ>
[ 93.411967] <TASK>
[ 93.413257] asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 93.418952] RIP: 0010:pv_native_safe_halt+0xf/0x20
[ 93.423460] Code: 22 d7 c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 55 af 26 00 fb f4 <c3> cc cc cc cc 66 90 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90
[ 93.459197] RSP: 0018:ffffb17f40117e60 EFLAGS: 00000246
[ 93.465006] RAX: 0000000000004000 RBX: 0000000000000001 RCX: ffff92285be00000
[ 93.471498] RDX: 00000000001f37e8 RSI: ffffffffafd61b00 RDI: ffff92286a432c64
[ 93.479505] RBP: ffff92286a432c64 R08: ffff92286a432c00 R09: 0000000000000001
[ 93.503897] R10: 0000000000000001 R11: ffff92285bfe6258 R12: ffff92286646a400
[ 93.512211] R13: ffffffffafd61b00 R14: 0000000000000001 R15: 0000000000000000
[ 93.518061] acpi_safe_halt+0x14/0x20
[ 93.522244] acpi_idle_enter+0x7f/0xd0
[ 93.526335] cpuidle_enter_state+0x93/0x570
[ 93.530482] cpuidle_enter+0x2d/0x40
[ 93.533615] do_idle+0x213/0x270
[ 93.545854] cpu_startup_entry+0x29/0x30
[ 93.549904] start_secondary+0x123/0x140
[ 93.553957] common_startup_64+0x13e/0x141
[ 93.558172] </TASK>