Mailing List Archive

[PATCH v3 09/19] x86/resctrl: Queue mon_event_read() instead of sending an IPI
x86 is blessed with an abundance of monitors, one per RMID, that can be
read from any CPU in the domain. MPAMs monitors reside in the MMIO MSC,
the number implemented is up to the manufacturer. This means when there are
fewer monitors than needed, they need to be allocated and freed.

Worse, the domain may be broken up into slices, and the MMIO accesses
for each slice may need performing from different CPUs.

These two details mean MPAMs monitor code needs to be able to sleep, and
IPI another CPU in the domain to read from a resource that has been sliced.

mon_event_read() already invokes mon_event_count() via IPI, which means
this isn't possible. On systems using nohz-full, some CPUs need to be
interrupted to run kernel work as they otherwise stay in user-space
running realtime workloads. Interrupting these CPUs should be avoided,
and scheduling work on them may never complete.

Change mon_event_read() to pick a housekeeping CPU, (one that is not using
nohz_full) and schedule mon_event_count() and wait. If all the CPUs
in a domain are using nohz-full, then an IPI is used as the fallback.

This function is only used in response to a user-space filesystem request
(not the timing sensitive overflow code).

This allows MPAM to hide the slice behaviour from resctrl, and to keep
the monitor-allocation in monitor.c. When the IPI fallback is used on
machines where MPAM needs to make an access on multiple CPUs, the counter
read will always fail.

Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com>
Signed-off-by: James Morse <james.morse@arm.com>
---
Changes since v2:
* Use cpumask_any_housekeeping() and fallback to an IPI if needed
---
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 19 +++++++++++++++++--
arch/x86/kernel/cpu/resctrl/internal.h | 2 +-
arch/x86/kernel/cpu/resctrl/monitor.c | 6 ++++--
3 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index eb07d4435391..b06e86839d00 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -19,6 +19,7 @@
#include <linux/kernfs.h>
#include <linux/seq_file.h>
#include <linux/slab.h>
+#include <linux/tick.h>
#include "internal.h"

/*
@@ -527,8 +528,13 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
struct rdt_domain *d, struct rdtgroup *rdtgrp,
int evtid, int first)
{
+ int cpu;
+
+ /* When picking a CPU from cpu_mask, ensure it can't race with cpuhp */
+ lockdep_assert_held(&rdtgroup_mutex);
+
/*
- * setup the parameters to send to the IPI to read the data.
+ * setup the parameters to pass to mon_event_count() to read the data.
*/
rr->rgrp = rdtgrp;
rr->evtid = evtid;
@@ -537,7 +543,16 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
rr->val = 0;
rr->first = first;

- smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
+ cpu = get_cpu();
+ if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
+ mon_event_count(rr);
+ put_cpu();
+ } else {
+ put_cpu();
+
+ cpu = cpumask_any_housekeeping(&d->cpu_mask);
+ smp_call_on_cpu(cpu, mon_event_count, rr, false);
+ }
}

int rdtgroup_mondata_show(struct seq_file *m, void *arg)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index 0b5fd5a0cda2..a07557390895 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -563,7 +563,7 @@ int alloc_rmid(u32 closid);
void free_rmid(u32 closid, u32 rmid);
int rdt_get_mon_l3_config(struct rdt_resource *r);
bool __init rdt_cpu_has(int flag);
-void mon_event_count(void *info);
+int mon_event_count(void *info);
int rdtgroup_mondata_show(struct seq_file *m, void *arg);
void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
struct rdt_domain *d, struct rdtgroup *rdtgrp,
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 3bec5c59ca0e..5e9e876c3409 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -550,10 +550,10 @@ static void mbm_bw_count(u32 closid, u32 rmid, struct rmid_read *rr)
}

/*
- * This is called via IPI to read the CQM/MBM counters
+ * This is scheduled by mon_event_read() to read the CQM/MBM counters
* on a domain.
*/
-void mon_event_count(void *info)
+int mon_event_count(void *info)
{
struct rdtgroup *rdtgrp, *entry;
struct rmid_read *rr = info;
@@ -586,6 +586,8 @@ void mon_event_count(void *info)
*/
if (ret == 0)
rr->err = 0;
+
+ return 0;
}

/*
--
2.39.2
Re: [PATCH v3 09/19] x86/resctrl: Queue mon_event_read() instead of sending an IPI [ In reply to ]
Hi James,

On Mon, Mar 20, 2023 at 6:27?PM James Morse <james.morse@arm.com> wrote:
>
> x86 is blessed with an abundance of monitors, one per RMID, that can be

As I explained earlier, this is not the case on AMD.

> read from any CPU in the domain. MPAMs monitors reside in the MMIO MSC,
> the number implemented is up to the manufacturer. This means when there are
> fewer monitors than needed, they need to be allocated and freed.
>
> Worse, the domain may be broken up into slices, and the MMIO accesses
> for each slice may need performing from different CPUs.
>
> These two details mean MPAMs monitor code needs to be able to sleep, and
> IPI another CPU in the domain to read from a resource that has been sliced.

This doesn't sound very convincing. Could mon_event_read() IPI all the
CPUs in the domain? (after waiting to allocate and install monitors
when necessary?)


>
> mon_event_read() already invokes mon_event_count() via IPI, which means
> this isn't possible. On systems using nohz-full, some CPUs need to be
> interrupted to run kernel work as they otherwise stay in user-space
> running realtime workloads. Interrupting these CPUs should be avoided,
> and scheduling work on them may never complete.
>
> Change mon_event_read() to pick a housekeeping CPU, (one that is not using
> nohz_full) and schedule mon_event_count() and wait. If all the CPUs
> in a domain are using nohz-full, then an IPI is used as the fallback.
>
> This function is only used in response to a user-space filesystem request
> (not the timing sensitive overflow code).
>
> This allows MPAM to hide the slice behaviour from resctrl, and to keep
> the monitor-allocation in monitor.c.

This goal sounds more likely.

If it makes the initial enablement smoother, then I'm all for it.

Reviewed-By: Peter Newman <peternewman@google.com>

These changes worked fine for me on tip/master, though there were merge
conflicts to resolve.

Tested-By: Peter Newman <peternewman@google.com>

Thanks!

-Peter
Re: [PATCH v3 09/19] x86/resctrl: Queue mon_event_read() instead of sending an IPI [ In reply to ]
On Wed, Mar 22, 2023 at 3:07?PM Peter Newman <peternewman@google.com> wrote:
> On Mon, Mar 20, 2023 at 6:27?PM James Morse <james.morse@arm.com> wrote:
> >
> > x86 is blessed with an abundance of monitors, one per RMID, that can be
>
> As I explained earlier, this is not the case on AMD.
>
> > read from any CPU in the domain. MPAMs monitors reside in the MMIO MSC,
> > the number implemented is up to the manufacturer. This means when there are
> > fewer monitors than needed, they need to be allocated and freed.
> >
> > Worse, the domain may be broken up into slices, and the MMIO accesses
> > for each slice may need performing from different CPUs.
> >
> > These two details mean MPAMs monitor code needs to be able to sleep, and
> > IPI another CPU in the domain to read from a resource that has been sliced.
>
> This doesn't sound very convincing. Could mon_event_read() IPI all the
> CPUs in the domain? (after waiting to allocate and install monitors
> when necessary?)

No wait, I know that isn't correct.

As you explained it, the remote CPU needs to sleep because it may need
to atomically acquire, install, and read a CSU monitor.

It still seems possible for the mon_event_read() thread to do all the
waiting (tell remote CPU to program CSU monitor, wait, tell same remote
CPU to read monitor), but that sounds like more work that I don't see a
lot of benefit to doing today.

Can you update the changelog to just say the remote CPU needs to block
when installing a CSU monitor?

Thanks!
-Peter
Re: [PATCH v3 09/19] x86/resctrl: Queue mon_event_read() instead of sending an IPI [ In reply to ]
Hi James,

On 3/20/2023 10:26 AM, James Morse wrote:
> x86 is blessed with an abundance of monitors, one per RMID, that can be
> read from any CPU in the domain. MPAMs monitors reside in the MMIO MSC,
> the number implemented is up to the manufacturer. This means when there are
> fewer monitors than needed, they need to be allocated and freed.
>
> Worse, the domain may be broken up into slices, and the MMIO accesses
> for each slice may need performing from different CPUs.
>
> These two details mean MPAMs monitor code needs to be able to sleep, and
> IPI another CPU in the domain to read from a resource that has been sliced.
>
> mon_event_read() already invokes mon_event_count() via IPI, which means
> this isn't possible. On systems using nohz-full, some CPUs need to be
> interrupted to run kernel work as they otherwise stay in user-space
> running realtime workloads. Interrupting these CPUs should be avoided,
> and scheduling work on them may never complete.
>
> Change mon_event_read() to pick a housekeeping CPU, (one that is not using
> nohz_full) and schedule mon_event_count() and wait. If all the CPUs
> in a domain are using nohz-full, then an IPI is used as the fallback.

It is not clear to me where in this solution an IPI is used as fallback ...
(see below)

> + int cpu;
> +
> + /* When picking a CPU from cpu_mask, ensure it can't race with cpuhp */
> + lockdep_assert_held(&rdtgroup_mutex);
> +
> /*
> - * setup the parameters to send to the IPI to read the data.
> + * setup the parameters to pass to mon_event_count() to read the data.
> */
> rr->rgrp = rdtgrp;
> rr->evtid = evtid;
> @@ -537,7 +543,16 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> rr->val = 0;
> rr->first = first;
>
> - smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
> + cpu = get_cpu();
> + if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
> + mon_event_count(rr);
> + put_cpu();
> + } else {
> + put_cpu();
> +
> + cpu = cpumask_any_housekeeping(&d->cpu_mask);
> + smp_call_on_cpu(cpu, mon_event_count, rr, false);
> + }
> }
>

... from what I can tell there is no IPI fallback here. As per previous
patch I understand cpumask_any_housekeeping() could still return
a nohz_full CPU and calling smp_call_on_cpu() on it would not send
an IPI but instead queue the work to it. What did I miss?

Reinette
Re: [PATCH v3 09/19] x86/resctrl: Queue mon_event_read() instead of sending an IPI [ In reply to ]
Hi Peter,

On 23/03/2023 09:09, Peter Newman wrote:
> On Wed, Mar 22, 2023 at 3:07?PM Peter Newman <peternewman@google.com> wrote:
>> On Mon, Mar 20, 2023 at 6:27?PM James Morse <james.morse@arm.com> wrote:
>>>
>>> x86 is blessed with an abundance of monitors, one per RMID, that can be
>>
>> As I explained earlier, this is not the case on AMD.
>>
>>> read from any CPU in the domain. MPAMs monitors reside in the MMIO MSC,
>>> the number implemented is up to the manufacturer. This means when there are
>>> fewer monitors than needed, they need to be allocated and freed.
>>>
>>> Worse, the domain may be broken up into slices, and the MMIO accesses
>>> for each slice may need performing from different CPUs.
>>>
>>> These two details mean MPAMs monitor code needs to be able to sleep, and
>>> IPI another CPU in the domain to read from a resource that has been sliced.
>>
>> This doesn't sound very convincing. Could mon_event_read() IPI all the
>> CPUs in the domain? (after waiting to allocate and install monitors
>> when necessary?)
>
> No wait, I know that isn't correct.
>
> As you explained it, the remote CPU needs to sleep because it may need
> to atomically acquire, install, and read a CSU monitor.
>
> It still seems possible for the mon_event_read() thread to do all the
> waiting (tell remote CPU to program CSU monitor, wait, tell same remote
> CPU to read monitor), but that sounds like more work that I don't see a
> lot of benefit to doing today.
>
> Can you update the changelog to just say the remote CPU needs to block
> when installing a CSU monitor?

Sure, I've added this after the first paragraph:
-------%<-------
MPAM's CSU monitors are used to back the 'llc_occupancy' monitor file. The
CSU counter is allowed to return 'not ready' for a small number of
micro-seconds after programming. To allow one CSU hardware monitor to be
used for multiple control or monitor groups, the CPU accessing the
monitor needs to be able to block when configuring and reading the
counter.
-------%<-------


Thanks,

James
Re: [PATCH v3 09/19] x86/resctrl: Queue mon_event_read() instead of sending an IPI [ In reply to ]
Hi Peter,

On 22/03/2023 14:07, Peter Newman wrote:
> On Mon, Mar 20, 2023 at 6:27?PM James Morse <james.morse@arm.com> wrote:
>>
>> x86 is blessed with an abundance of monitors, one per RMID, that can be
>
> As I explained earlier, this is not the case on AMD.

I'll change it so say Intel.


>> read from any CPU in the domain. MPAMs monitors reside in the MMIO MSC,
>> the number implemented is up to the manufacturer. This means when there are
>> fewer monitors than needed, they need to be allocated and freed.
>>
>> Worse, the domain may be broken up into slices, and the MMIO accesses
>> for each slice may need performing from different CPUs.
>>
>> These two details mean MPAMs monitor code needs to be able to sleep, and
>> IPI another CPU in the domain to read from a resource that has been sliced.
>
> This doesn't sound very convincing. Could mon_event_read() IPI all the
> CPUs in the domain? (after waiting to allocate and install monitors
> when necessary?)

On the majority of platforms this would be a waste of time as the IPI only needs sending
to one. I'd like to keep the cost of being strange limited to the strange platforms.

I don't think exposing a 'sub domain' cpumask to resctrl is helpful: this needs to be
hidden in the architecture specific code.

The IPI is because of SoC components being implemented as slices which are private to that
slice.


The sleeping is because the CSU counters are allowed to be 'not ready' immediately after
programming. The time is short, and to allow platforms that have too few CSU monitors to
support the same user-interface as x86^W Intel, the MPAM driver needs to be able to
multiplex a single CSU monitor between multiple control/monitor groups. Allowing it to
sleep for the advertised not-ready period is the simplest way of doing this.


>> mon_event_read() already invokes mon_event_count() via IPI, which means
>> this isn't possible. On systems using nohz-full, some CPUs need to be
>> interrupted to run kernel work as they otherwise stay in user-space
>> running realtime workloads. Interrupting these CPUs should be avoided,
>> and scheduling work on them may never complete.
>>
>> Change mon_event_read() to pick a housekeeping CPU, (one that is not using
>> nohz_full) and schedule mon_event_count() and wait. If all the CPUs
>> in a domain are using nohz-full, then an IPI is used as the fallback.
>>
>> This function is only used in response to a user-space filesystem request
>> (not the timing sensitive overflow code).
>>
>> This allows MPAM to hide the slice behaviour from resctrl, and to keep
>> the monitor-allocation in monitor.c.
>
> This goal sounds more likely.
>
> If it makes the initial enablement smoother, then I'm all for it.

> Reviewed-By: Peter Newman <peternewman@google.com>
>
> These changes worked fine for me on tip/master, though there were merge
> conflicts to resolve.
>
> Tested-By: Peter Newman <peternewman@google.com>

Thanks!


James
Re: [PATCH v3 09/19] x86/resctrl: Queue mon_event_read() instead of sending an IPI [ In reply to ]
Hi Reinette,

On 01/04/2023 00:25, Reinette Chatre wrote:
> On 3/20/2023 10:26 AM, James Morse wrote:
>> x86 is blessed with an abundance of monitors, one per RMID, that can be
>> read from any CPU in the domain. MPAMs monitors reside in the MMIO MSC,
>> the number implemented is up to the manufacturer. This means when there are
>> fewer monitors than needed, they need to be allocated and freed.
>>
>> Worse, the domain may be broken up into slices, and the MMIO accesses
>> for each slice may need performing from different CPUs.
>>
>> These two details mean MPAMs monitor code needs to be able to sleep, and
>> IPI another CPU in the domain to read from a resource that has been sliced.
>>
>> mon_event_read() already invokes mon_event_count() via IPI, which means
>> this isn't possible. On systems using nohz-full, some CPUs need to be
>> interrupted to run kernel work as they otherwise stay in user-space
>> running realtime workloads. Interrupting these CPUs should be avoided,
>> and scheduling work on them may never complete.
>>
>> Change mon_event_read() to pick a housekeeping CPU, (one that is not using
>> nohz_full) and schedule mon_event_count() and wait. If all the CPUs
>> in a domain are using nohz-full, then an IPI is used as the fallback.
>
> It is not clear to me where in this solution an IPI is used as fallback ...
> (see below)

>> @@ -537,7 +543,16 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
>> rr->val = 0;
>> rr->first = first;
>>
>> - smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
>> + cpu = get_cpu();
>> + if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
>> + mon_event_count(rr);
>> + put_cpu();
>> + } else {
>> + put_cpu();
>> +
>> + cpu = cpumask_any_housekeeping(&d->cpu_mask);
>> + smp_call_on_cpu(cpu, mon_event_count, rr, false);
>> + }
>> }
>>
>
> ... from what I can tell there is no IPI fallback here. As per previous
> patch I understand cpumask_any_housekeeping() could still return
> a nohz_full CPU and calling smp_call_on_cpu() on it would not send
> an IPI but instead queue the work to it. What did I miss?

Huh, looks like its still in my git-stash. Sorry about that. The combined hunk looks like
this:
----------------------%<----------------------
@@ -537,7 +550,26 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
rr->val = 0;
rr->first = first;

- smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
+ cpu = get_cpu();
+ if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
+ mon_event_count(rr);
+ put_cpu();
+ } else {
+ put_cpu();
+
+ cpu = cpumask_any_housekeeping(&d->cpu_mask);
+
+ /*
+ * cpumask_any_housekeeping() prefers housekeeping CPUs, but
+ * are all the CPUs nohz_full? If yes, pick a CPU to IPI.
+ * MPAM's resctrl_arch_rmid_read() is unable to read the
+ * counters on some platforms if its called in irq context.
+ */
+ if (tick_nohz_full_cpu(cpu))
+ smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
+ else
+ smp_call_on_cpu(cpu, smp_mon_event_count, rr, false);
+ }
}

----------------------%<----------------------

Where smp_mon_event_count() is a static wrapper to make the types work.


Thanks,

James