Mailing List Archive: [PATCH] x86/resctrl: Fix mbm_setup_overflow

[PATCH] x86/resctrl: Fix mbm_setup_overflow_handler() when last CPU goes offline

Mar 27, 2024, 11:46 AM

Post #1 of 4 (22 views)

Don't bother looking for another CPU to take over MBM overflow duties
when the last CPU in a domain goes offline. Doing so results in this
Oops:

[ 97.166136] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 97.173118] #PF: supervisor read access in kernel mode
[ 97.178263] #PF: error_code(0x0000) - not-present page
[ 97.183410] PGD 0
[ 97.185438] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 97.189805] CPU: 36 PID: 235 Comm: cpuhp/36 Tainted: G T 6.9.0-rc1 #356
[ 97.208322] RIP: 0010:__find_nth_andnot_bit+0x66/0x110

Fixes: 978fcca954cb ("x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but CPU")
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/resctrl/monitor.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 757d475158a3..4d9987acffd6 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -929,6 +929,10 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms,
unsigned long delay = msecs_to_jiffies(delay_ms);
int cpu;

+ /* Nothing to do if this is the last CPU in a domain going offline */
+ if (!delay_ms && bitmap_weight(cpumask_bits(&dom->cpu_mask), nr_cpu_ids) == 1)
+ return;
+
/*
* When a domain comes online there is no guarantee the filesystem is
* mounted. If not, there is no need to catch counter overflow.
--
2.44.0

Re: [PATCH] x86/resctrl: Fix mbm_setup_overflow_handler() when last CPU goes offline [ In reply to ]

reinette.chatre at intel

Mar 27, 2024, 3:37 PM

Post #2 of 4 (22 views)

Permalink

Hi Tony,

Thank you very much for taking a closer look at this.

On 3/27/2024 11:46 AM, Tony Luck wrote:
> Don't bother looking for another CPU to take over MBM overflow duties
> when the last CPU in a domain goes offline. Doing so results in this
> Oops:
>
> [ 97.166136] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [ 97.173118] #PF: supervisor read access in kernel mode
> [ 97.178263] #PF: error_code(0x0000) - not-present page
> [ 97.183410] PGD 0
> [ 97.185438] Oops: 0000 [#1] PREEMPT SMP NOPTI
> [ 97.189805] CPU: 36 PID: 235 Comm: cpuhp/36 Tainted: G T 6.9.0-rc1 #356
> [ 97.208322] RIP: 0010:__find_nth_andnot_bit+0x66/0x110
>
> Fixes: 978fcca954cb ("x86/resctrl: Allow overflow/limbo handlers to be scheduled on any-but CPU")
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
> arch/x86/kernel/cpu/resctrl/monitor.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 757d475158a3..4d9987acffd6 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -929,6 +929,10 @@ void mbm_setup_overflow_handler(struct rdt_domain *dom, unsigned long delay_ms,
> unsigned long delay = msecs_to_jiffies(delay_ms);
> int cpu;
>
> + /* Nothing to do if this is the last CPU in a domain going offline */
> + if (!delay_ms && bitmap_weight(cpumask_bits(&dom->cpu_mask), nr_cpu_ids) == 1)
> + return;
> +
> /*
> * When a domain comes online there is no guarantee the filesystem is
> * mounted. If not, there is no need to catch counter overflow.

While this addresses the scenario you tested I do not think that this solves the
underlying problem and thus I believe that there remains other scenarios in which this
same OOPS can be encountered.

For example, I think you will encounter the same OOPS a few lines later within
cqm_setup_limbo_handler() if the system happened to have some busy RMIDs. Another
example would be if the tick_nohz_full_mask contains all but the exclude CPU.
In this scenario a bitmap_weight() test will not be sufficient since it does
not give insight into how many CPUs remain after taking into account
tick_nohz_full_mask.

There seems to be two issues here (although I am not familiar with these flows). First,
it seems that tick_nohz_full_mask is not actually allocated unless the user boots
with a "nohz_full=". This means that any attempt to access bits within tick_nohz_full_mask
will cause this OOPS. If that is allocated then the second issue seems that the
buried __ffs() call requires that it not be called with 0 and this checking is not done.

To me it seems most appropriate to fix this at the central place to ensure all scenarios
are handled instead of scattering checks.

To that end, what do you think of something like below? It uses tick_nohz_full_enabled() check
to ensure that tick_nohz_full_mask is actually allocated while the other changes aim to
avoid __ffs() on 0.

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 8f40fb35db78..61337f32830c 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -72,6 +72,7 @@ static inline unsigned int
cpumask_any_housekeeping(const struct cpumask *mask, int exclude_cpu)
{
unsigned int cpu, hk_cpu;
+ cpumask_var_t cpu_remain;

if (exclude_cpu == RESCTRL_PICK_ANY_CPU)
cpu = cpumask_any(mask);
@@ -85,14 +86,26 @@ cpumask_any_housekeeping(const struct cpumask *mask, int exclude_cpu)
if (cpu < nr_cpu_ids && !tick_nohz_full_cpu(cpu))
return cpu;

- /* Try to find a CPU that isn't nohz_full to use in preference */
- hk_cpu = cpumask_nth_andnot(0, mask, tick_nohz_full_mask);
- if (hk_cpu == exclude_cpu)
- hk_cpu = cpumask_nth_andnot(1, mask, tick_nohz_full_mask);
+ /* Do not try to access tick_nohz_full_mask if it has not been allocated. */
+ if (!tick_nohz_full_enabled())
+ return cpu;
+
+ if (!zalloc_cpumask_var(&cpu_remain, GFP_KERNEL))
+ return cpu;

+ if (!cpumask_andnot(cpu_remain, mask, tick_nohz_full_mask)) {
+ free_cpumask_var(cpu_remain);
+ return cpu;
+ }
+
+ cpumask_clear_cpu(exclude_cpu, cpu_remain);
+
+ hk_cpu = cpumask_any(cpu_remain);
if (hk_cpu < nr_cpu_ids)
cpu = hk_cpu;

+ free_cpumask_var(cpu_remain);
+
return cpu;
}

Reinette

RE: [PATCH] x86/resctrl: Fix mbm_setup_overflow_handler() when last CPU goes offline [ In reply to ]

tony.luck at intel

Mar 27, 2024, 4:01 PM

Post #3 of 4 (22 views)

Permalink

> There seems to be two issues here (although I am not familiar with these flows). First,
> it seems that tick_nohz_full_mask is not actually allocated unless the user boots
> with a "nohz_full=". This means that any attempt to access bits within tick_nohz_full_mask
> will cause this OOPS. If that is allocated then the second issue seems that the
> buried __ffs() call requires that it not be called with 0 and this checking is not done.
>
> To me it seems most appropriate to fix this at the central place to ensure all scenarios
> are handled instead of scattering checks.

Good analysis.

> To that end, what do you think of something like below? It uses tick_nohz_full_enabled() check
> to ensure that tick_nohz_full_mask is actually allocated while the other changes aim to
> avoid __ffs() on 0.

Looks good.

Tested-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Tont Luck <tony.luck@intel.com>

-Tony

Re: [PATCH] x86/resctrl: Fix mbm_setup_overflow_handler() when last CPU goes offline [ In reply to ]

reinette.chatre at intel

Mar 28, 2024, 2:11 PM

Post #4 of 4 (20 views)

Permalink

Hi Tony,

On 3/27/2024 4:01 PM, Luck, Tony wrote:
>> There seems to be two issues here (although I am not familiar with these flows). First,
>> it seems that tick_nohz_full_mask is not actually allocated unless the user boots
>> with a "nohz_full=". This means that any attempt to access bits within tick_nohz_full_mask
>> will cause this OOPS. If that is allocated then the second issue seems that the
>> buried __ffs() call requires that it not be called with 0 and this checking is not done.
>>
>> To me it seems most appropriate to fix this at the central place to ensure all scenarios
>> are handled instead of scattering checks.
>
> Good analysis.
>
>> To that end, what do you think of something like below? It uses tick_nohz_full_enabled() check
>> to ensure that tick_nohz_full_mask is actually allocated while the other changes aim to
>> avoid __ffs() on 0.

I studied the flows some more and I no longer believe that there is a risk of __ffs() being
called on 0. Looking at the flows starting with cpumask_nth_andnot() the value provided to
__ffs() is ensured to be non-zero via either an explicit check or hweight_long().

To confirm this I tested a CONFIG_NO_HZ_FULL kernel by booting with "nohz_full=" set to the
highest numbered CPU in the domain. When all the CPUs are offlined the behavior for the
"second to last" and "last" CPU are most interesting and worked as expected when using
cpumask_nth_andnot().

>
> Looks good.
>
> Tested-by: Tony Luck <tony.luck@intel.com>
> Reviewed-by: Tont Luck <tony.luck@intel.com>

Considering the motivation above I will submit a change that only adds the
tick_nohz_full_enabled() check. Since it is such a big change from the snippet you reviewed
and tested here I dropped your tags from it and hope you can reconsider the fix after considering
the information above.

Reinette