Mailing List Archive

[RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock
Currently, when sleeping is not allowed during rstat flushing, we hold
the global rstat lock with interrupts disabled throughout the entire
flush operation. Flushing in an O(# cgroups * # cpus) operation, and
having interrupts disabled throughout is dangerous.

For some contexts, we may not want to sleep, but can be interrupted
(e.g. while holding a spinlock or RCU read lock). As such, do not
disable interrupts throughout rstat flushing, only when holding the
percpu lock. This breaks down the O(# cgroups * # cpus) duration with
interrupts disabled to a series of O(# cgroups) durations.

Furthermore, if a cpu spinning waiting for the global rstat lock, it
doesn't need to spin with interrupts disabled anymore.

If the caller of rstat flushing needs interrupts to be disabled, it's up
to them to decide that, and it should be fine to hold the global rstat
lock with interrupts disabled. There is currently a single context that
may invoke rstat flushing with interrupts disabled, the
mem_cgroup_flush_stats() call in mem_cgroup_usage(), if called from
mem_cgroup_threshold().

To make it safe to hold the global rstat lock with interrupts enabled,
make sure we only flush from in_task() contexts. The side effect of that
we read stale stats in interrupt context, but this should be okay, as
flushing in interrupt context is dangerous anyway as it is an expensive
operation, so reading stale stats is safer.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
kernel/cgroup/rstat.c | 40 +++++++++++++++++++++++++++++-----------
1 file changed, 29 insertions(+), 11 deletions(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 831f1f472bb8..af11e28bd055 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -7,6 +7,7 @@
#include <linux/btf.h>
#include <linux/btf_ids.h>

+/* This lock should only be held from task context */
static DEFINE_SPINLOCK(cgroup_rstat_lock);
static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);

@@ -210,14 +211,24 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep)
/* if @may_sleep, play nice and yield if necessary */
if (may_sleep && (need_resched() ||
spin_needbreak(&cgroup_rstat_lock))) {
- spin_unlock_irq(&cgroup_rstat_lock);
+ spin_unlock(&cgroup_rstat_lock);
if (!cond_resched())
cpu_relax();
- spin_lock_irq(&cgroup_rstat_lock);
+ spin_lock(&cgroup_rstat_lock);
}
}
}

+static bool should_skip_flush(void)
+{
+ /*
+ * We acquire cgroup_rstat_lock without disabling interrupts, so we
+ * should not try to acquire from interrupt contexts to avoid deadlocks.
+ * It can be expensive to flush stats in interrupt context anyway.
+ */
+ return !in_task();
+}
+
/**
* cgroup_rstat_flush - flush stats in @cgrp's subtree
* @cgrp: target cgroup
@@ -229,15 +240,18 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep)
* This also gets all cgroups in the subtree including @cgrp off the
* ->updated_children lists.
*
- * This function may block.
+ * This function is safe to call from contexts that disable interrupts, but
+ * @may_sleep must be set to false, otherwise the function may block.
*/
__bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
{
- might_sleep();
+ if (should_skip_flush())
+ return;

- spin_lock_irq(&cgroup_rstat_lock);
+ might_sleep();
+ spin_lock(&cgroup_rstat_lock);
cgroup_rstat_flush_locked(cgrp, true);
- spin_unlock_irq(&cgroup_rstat_lock);
+ spin_unlock(&cgroup_rstat_lock);
}

/**
@@ -248,11 +262,12 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
*/
void cgroup_rstat_flush_irqsafe(struct cgroup *cgrp)
{
- unsigned long flags;
+ if (should_skip_flush())
+ return;

- spin_lock_irqsave(&cgroup_rstat_lock, flags);
+ spin_lock(&cgroup_rstat_lock);
cgroup_rstat_flush_locked(cgrp, false);
- spin_unlock_irqrestore(&cgroup_rstat_lock, flags);
+ spin_unlock(&cgroup_rstat_lock);
}

/**
@@ -267,8 +282,11 @@ void cgroup_rstat_flush_irqsafe(struct cgroup *cgrp)
void cgroup_rstat_flush_hold(struct cgroup *cgrp)
__acquires(&cgroup_rstat_lock)
{
+ if (should_skip_flush())
+ return;
+
might_sleep();
- spin_lock_irq(&cgroup_rstat_lock);
+ spin_lock(&cgroup_rstat_lock);
cgroup_rstat_flush_locked(cgrp, true);
}

@@ -278,7 +296,7 @@ void cgroup_rstat_flush_hold(struct cgroup *cgrp)
void cgroup_rstat_flush_release(void)
__releases(&cgroup_rstat_lock)
{
- spin_unlock_irq(&cgroup_rstat_lock);
+ spin_unlock(&cgroup_rstat_lock);
}

int cgroup_rstat_init(struct cgroup *cgrp)
--
2.40.0.rc1.284.g88254d51c5-goog
Re: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock [ In reply to ]
On Wed, Mar 22, 2023 at 9:00?PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> Currently, when sleeping is not allowed during rstat flushing, we hold
> the global rstat lock with interrupts disabled throughout the entire
> flush operation. Flushing in an O(# cgroups * # cpus) operation, and
> having interrupts disabled throughout is dangerous.
>
> For some contexts, we may not want to sleep, but can be interrupted
> (e.g. while holding a spinlock or RCU read lock). As such, do not
> disable interrupts throughout rstat flushing, only when holding the
> percpu lock. This breaks down the O(# cgroups * # cpus) duration with
> interrupts disabled to a series of O(# cgroups) durations.
>
> Furthermore, if a cpu spinning waiting for the global rstat lock, it
> doesn't need to spin with interrupts disabled anymore.
>
> If the caller of rstat flushing needs interrupts to be disabled, it's up
> to them to decide that, and it should be fine to hold the global rstat
> lock with interrupts disabled. There is currently a single context that
> may invoke rstat flushing with interrupts disabled, the
> mem_cgroup_flush_stats() call in mem_cgroup_usage(), if called from
> mem_cgroup_threshold().
>
> To make it safe to hold the global rstat lock with interrupts enabled,
> make sure we only flush from in_task() contexts. The side effect of that
> we read stale stats in interrupt context, but this should be okay, as
> flushing in interrupt context is dangerous anyway as it is an expensive
> operation, so reading stale stats is safer.
>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>

Couple of questions:

1. What exactly is cgroup_rstat_lock protecting? Can we just remove it
altogether?
2. Are we really calling rstat flush in irq context?
3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
done for root memcg. Why is mem_cgroup_threshold() interested in root
memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
Re: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock [ In reply to ]
On Wed, Mar 22, 2023 at 9:29?PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Wed, Mar 22, 2023 at 9:00?PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > Currently, when sleeping is not allowed during rstat flushing, we hold
> > the global rstat lock with interrupts disabled throughout the entire
> > flush operation. Flushing in an O(# cgroups * # cpus) operation, and
> > having interrupts disabled throughout is dangerous.
> >
> > For some contexts, we may not want to sleep, but can be interrupted
> > (e.g. while holding a spinlock or RCU read lock). As such, do not
> > disable interrupts throughout rstat flushing, only when holding the
> > percpu lock. This breaks down the O(# cgroups * # cpus) duration with
> > interrupts disabled to a series of O(# cgroups) durations.
> >
> > Furthermore, if a cpu spinning waiting for the global rstat lock, it
> > doesn't need to spin with interrupts disabled anymore.
> >
> > If the caller of rstat flushing needs interrupts to be disabled, it's up
> > to them to decide that, and it should be fine to hold the global rstat
> > lock with interrupts disabled. There is currently a single context that
> > may invoke rstat flushing with interrupts disabled, the
> > mem_cgroup_flush_stats() call in mem_cgroup_usage(), if called from
> > mem_cgroup_threshold().
> >
> > To make it safe to hold the global rstat lock with interrupts enabled,
> > make sure we only flush from in_task() contexts. The side effect of that
> > we read stale stats in interrupt context, but this should be okay, as
> > flushing in interrupt context is dangerous anyway as it is an expensive
> > operation, so reading stale stats is safer.
> >
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
>
> Couple of questions:
>
> 1. What exactly is cgroup_rstat_lock protecting? Can we just remove it
> altogether?

I believe it protects the global state variables that we flush into.
For example, for memcg, it protects mem_cgroup->vmstats.

I tried removing the lock and allowing concurrent flushing on
different cpus, by changing mem_cgroup->vmstats to use atomics
instead, but that turned out to be a little expensive. Also,
cgroup_rstat_lock is already contended by different flushers
(mitigated by stats_flush_lock on the memcg side). If we remove it,
concurrent flushers contend on every single percpu lock instead, which
also seems to be expensive.

> 2. Are we really calling rstat flush in irq context?

I think it is possible through the charge/uncharge path:
memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
added the protection against flushing in an interrupt context for
future callers as well, as it may cause a deadlock if we don't disable
interrupts when acquiring cgroup_rstat_lock.

> 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> done for root memcg. Why is mem_cgroup_threshold() interested in root
> memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?

I am not sure, but the code looks like event notifications may be set
up on root memcg, which is why we need to check thresholds.

Even if mem_cgroup_threshold() does not flush memcg stats, the purpose
of this patch is to make sure the rstat flushing code itself is not
disabling interrupts; which it currently does for any unsleepable
context, even if it is interruptible.
Re: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock [ In reply to ]
On Wed, Mar 22, 2023 at 10:15?PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
[...]
> > Couple of questions:
> >
> > 1. What exactly is cgroup_rstat_lock protecting? Can we just remove it
> > altogether?
>
> I believe it protects the global state variables that we flush into.
> For example, for memcg, it protects mem_cgroup->vmstats.
>
> I tried removing the lock and allowing concurrent flushing on
> different cpus, by changing mem_cgroup->vmstats to use atomics
> instead, but that turned out to be a little expensive. Also,
> cgroup_rstat_lock is already contended by different flushers
> (mitigated by stats_flush_lock on the memcg side). If we remove it,
> concurrent flushers contend on every single percpu lock instead, which
> also seems to be expensive.

We should add a comment on what it is protecting. I think block rstat
are fine but memcg and bpf would need this.

>
> > 2. Are we really calling rstat flush in irq context?
>
> I think it is possible through the charge/uncharge path:
> memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> added the protection against flushing in an interrupt context for
> future callers as well, as it may cause a deadlock if we don't disable
> interrupts when acquiring cgroup_rstat_lock.
>
> > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
>
> I am not sure, but the code looks like event notifications may be set
> up on root memcg, which is why we need to check thresholds.

This is something we should deprecate as root memcg's usage is ill defined.

>
> Even if mem_cgroup_threshold() does not flush memcg stats, the purpose
> of this patch is to make sure the rstat flushing code itself is not
> disabling interrupts; which it currently does for any unsleepable
> context, even if it is interruptible.

Basically I am saying we should aim for VM_BUG_ON(!in_task()) in the
flush function rather than adding should_skip_flush() which does not
stop potential new irq flushers.
Re: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock [ In reply to ]
On Wed, Mar 22, 2023 at 11:33?PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Wed, Mar 22, 2023 at 10:15?PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> [...]
> > > Couple of questions:
> > >
> > > 1. What exactly is cgroup_rstat_lock protecting? Can we just remove it
> > > altogether?
> >
> > I believe it protects the global state variables that we flush into.
> > For example, for memcg, it protects mem_cgroup->vmstats.
> >
> > I tried removing the lock and allowing concurrent flushing on
> > different cpus, by changing mem_cgroup->vmstats to use atomics
> > instead, but that turned out to be a little expensive. Also,
> > cgroup_rstat_lock is already contended by different flushers
> > (mitigated by stats_flush_lock on the memcg side). If we remove it,
> > concurrent flushers contend on every single percpu lock instead, which
> > also seems to be expensive.
>
> We should add a comment on what it is protecting. I think block rstat
> are fine but memcg and bpf would need this.

I think it also protects the cpu base stats flushed by cgroup_base_stat_flush().

I will add a comment in the next version.

>
> >
> > > 2. Are we really calling rstat flush in irq context?
> >
> > I think it is possible through the charge/uncharge path:
> > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > added the protection against flushing in an interrupt context for
> > future callers as well, as it may cause a deadlock if we don't disable
> > interrupts when acquiring cgroup_rstat_lock.
> >
> > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> >
> > I am not sure, but the code looks like event notifications may be set
> > up on root memcg, which is why we need to check thresholds.
>
> This is something we should deprecate as root memcg's usage is ill defined.

Right, but I think this would be orthogonal to this patch series.

>
> >
> > Even if mem_cgroup_threshold() does not flush memcg stats, the purpose
> > of this patch is to make sure the rstat flushing code itself is not
> > disabling interrupts; which it currently does for any unsleepable
> > context, even if it is interruptible.
>
> Basically I am saying we should aim for VM_BUG_ON(!in_task()) in the
> flush function rather than adding should_skip_flush() which does not
> stop potential new irq flushers.

I wanted to start with VM_BUG_ON(!in_task()) but I wasn't sure that
all contexts that call rstat flushing are not in irq contexts. I added
should_skip_flush() so that if there are existing flushers in irq
context, or new flushers are added, we are protected against a
deadlock.

We can change should_skip_flush() to have a WARN_ON_ONCE(!in_task())
to alert in this case. If you prefer removing should_skip_flush() and
just adding VM_BUG_ON(!in_task()) we can do that, but personally I was
not confident enough that we have no code paths today that may attempt
flushing from irq context.
Re: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock [ In reply to ]
On Thu, Mar 23, 2023 at 6:36?AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
[...]
> > >
> > > > 2. Are we really calling rstat flush in irq context?
> > >
> > > I think it is possible through the charge/uncharge path:
> > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > > added the protection against flushing in an interrupt context for
> > > future callers as well, as it may cause a deadlock if we don't disable
> > > interrupts when acquiring cgroup_rstat_lock.
> > >
> > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> > >
> > > I am not sure, but the code looks like event notifications may be set
> > > up on root memcg, which is why we need to check thresholds.
> >
> > This is something we should deprecate as root memcg's usage is ill defined.
>
> Right, but I think this would be orthogonal to this patch series.
>

I don't think we can make cgroup_rstat_lock a non-irq-disabling lock
without either breaking a link between mem_cgroup_threshold and
cgroup_rstat_lock or make mem_cgroup_threshold work without disabling
irqs.

So, this patch can not be applied before either of those two tasks are
done (and we may find more such scenarios).
Re: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock [ In reply to ]
On Thu, Mar 23, 2023 at 8:40?AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Thu, Mar 23, 2023 at 6:36?AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> [...]
> > > >
> > > > > 2. Are we really calling rstat flush in irq context?
> > > >
> > > > I think it is possible through the charge/uncharge path:
> > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > > > added the protection against flushing in an interrupt context for
> > > > future callers as well, as it may cause a deadlock if we don't disable
> > > > interrupts when acquiring cgroup_rstat_lock.
> > > >
> > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> > > >
> > > > I am not sure, but the code looks like event notifications may be set
> > > > up on root memcg, which is why we need to check thresholds.
> > >
> > > This is something we should deprecate as root memcg's usage is ill defined.
> >
> > Right, but I think this would be orthogonal to this patch series.
> >
>
> I don't think we can make cgroup_rstat_lock a non-irq-disabling lock
> without either breaking a link between mem_cgroup_threshold and
> cgroup_rstat_lock or make mem_cgroup_threshold work without disabling
> irqs.
>
> So, this patch can not be applied before either of those two tasks are
> done (and we may find more such scenarios).


Could you elaborate why?

My understanding is that with an in_task() check to make sure we only
acquire cgroup_rstat_lock from non-irq context it should be fine to
acquire cgroup_rstat_lock without disabling interrupts.
Re: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock [ In reply to ]
On Thu, Mar 23, 2023 at 8:43?AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Thu, Mar 23, 2023 at 8:40?AM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > On Thu, Mar 23, 2023 at 6:36?AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > [...]
> > > > >
> > > > > > 2. Are we really calling rstat flush in irq context?
> > > > >
> > > > > I think it is possible through the charge/uncharge path:
> > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > > > > added the protection against flushing in an interrupt context for
> > > > > future callers as well, as it may cause a deadlock if we don't disable
> > > > > interrupts when acquiring cgroup_rstat_lock.
> > > > >
> > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> > > > >
> > > > > I am not sure, but the code looks like event notifications may be set
> > > > > up on root memcg, which is why we need to check thresholds.
> > > >
> > > > This is something we should deprecate as root memcg's usage is ill defined.
> > >
> > > Right, but I think this would be orthogonal to this patch series.
> > >
> >
> > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock
> > without either breaking a link between mem_cgroup_threshold and
> > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling
> > irqs.
> >
> > So, this patch can not be applied before either of those two tasks are
> > done (and we may find more such scenarios).
>
>
> Could you elaborate why?
>
> My understanding is that with an in_task() check to make sure we only
> acquire cgroup_rstat_lock from non-irq context it should be fine to
> acquire cgroup_rstat_lock without disabling interrupts.

From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken
with irq disabled while other code paths will take cgroup_rstat_lock
with irq enabled. This is a potential deadlock hazard unless
cgroup_rstat_lock is always taken with irq disabled.
Re: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock [ In reply to ]
On Thu, Mar 23, 2023 at 8:46?AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Thu, Mar 23, 2023 at 8:43?AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Thu, Mar 23, 2023 at 8:40?AM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > On Thu, Mar 23, 2023 at 6:36?AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > >
> > > [...]
> > > > > >
> > > > > > > 2. Are we really calling rstat flush in irq context?
> > > > > >
> > > > > > I think it is possible through the charge/uncharge path:
> > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > > > > > added the protection against flushing in an interrupt context for
> > > > > > future callers as well, as it may cause a deadlock if we don't disable
> > > > > > interrupts when acquiring cgroup_rstat_lock.
> > > > > >
> > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> > > > > >
> > > > > > I am not sure, but the code looks like event notifications may be set
> > > > > > up on root memcg, which is why we need to check thresholds.
> > > > >
> > > > > This is something we should deprecate as root memcg's usage is ill defined.
> > > >
> > > > Right, but I think this would be orthogonal to this patch series.
> > > >
> > >
> > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock
> > > without either breaking a link between mem_cgroup_threshold and
> > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling
> > > irqs.
> > >
> > > So, this patch can not be applied before either of those two tasks are
> > > done (and we may find more such scenarios).
> >
> >
> > Could you elaborate why?
> >
> > My understanding is that with an in_task() check to make sure we only
> > acquire cgroup_rstat_lock from non-irq context it should be fine to
> > acquire cgroup_rstat_lock without disabling interrupts.
>
> From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken
> with irq disabled while other code paths will take cgroup_rstat_lock
> with irq enabled. This is a potential deadlock hazard unless
> cgroup_rstat_lock is always taken with irq disabled.

Oh you are making sure it is not taken in the irq context through
should_skip_flush(). Hmm seems like a hack. Normally it is recommended
to actually remove all such users instead of silently
ignoring/bypassing the functionality.

So, how about removing mem_cgroup_flush_stats() from
mem_cgroup_usage(). It will break the known chain which is taking
cgroup_rstat_lock with irq disabled and you can add
WARN_ON_ONCE(!in_task()).
Re: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock [ In reply to ]
On Thu, Mar 23, 2023 at 9:10?AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Thu, Mar 23, 2023 at 8:46?AM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > On Thu, Mar 23, 2023 at 8:43?AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Thu, Mar 23, 2023 at 8:40?AM Shakeel Butt <shakeelb@google.com> wrote:
> > > >
> > > > On Thu, Mar 23, 2023 at 6:36?AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > >
> > > > [...]
> > > > > > >
> > > > > > > > 2. Are we really calling rstat flush in irq context?
> > > > > > >
> > > > > > > I think it is possible through the charge/uncharge path:
> > > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > > > > > > added the protection against flushing in an interrupt context for
> > > > > > > future callers as well, as it may cause a deadlock if we don't disable
> > > > > > > interrupts when acquiring cgroup_rstat_lock.
> > > > > > >
> > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> > > > > > >
> > > > > > > I am not sure, but the code looks like event notifications may be set
> > > > > > > up on root memcg, which is why we need to check thresholds.
> > > > > >
> > > > > > This is something we should deprecate as root memcg's usage is ill defined.
> > > > >
> > > > > Right, but I think this would be orthogonal to this patch series.
> > > > >
> > > >
> > > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock
> > > > without either breaking a link between mem_cgroup_threshold and
> > > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling
> > > > irqs.
> > > >
> > > > So, this patch can not be applied before either of those two tasks are
> > > > done (and we may find more such scenarios).
> > >
> > >
> > > Could you elaborate why?
> > >
> > > My understanding is that with an in_task() check to make sure we only
> > > acquire cgroup_rstat_lock from non-irq context it should be fine to
> > > acquire cgroup_rstat_lock without disabling interrupts.
> >
> > From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken
> > with irq disabled while other code paths will take cgroup_rstat_lock
> > with irq enabled. This is a potential deadlock hazard unless
> > cgroup_rstat_lock is always taken with irq disabled.
>
> Oh you are making sure it is not taken in the irq context through
> should_skip_flush(). Hmm seems like a hack. Normally it is recommended
> to actually remove all such users instead of silently
> ignoring/bypassing the functionality.

It is a workaround, we simply accept to read stale stats in irq
context instead of the expensive flush operation.

>
> So, how about removing mem_cgroup_flush_stats() from
> mem_cgroup_usage(). It will break the known chain which is taking
> cgroup_rstat_lock with irq disabled and you can add
> WARN_ON_ONCE(!in_task()).

This changes the behavior in a more obvious way because:
1. The memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage()
path is also exercised in a lot of paths outside irq context, this
will change the behavior for any event thresholds on the root memcg.
With proposed skipped flushing in irq context we only change the
behavior in a small subset of cases.

I think we can skip flushing in irq context for now, and separately
deprecate threshold events for the root memcg. When that is done we
can come back and remove should_skip_flush() and add a VM_BUG_ON or
WARN_ON_ONCE instead. WDYT?

2. mem_cgroup_usage() is also used when reading usage from userspace.
This should be an easy workaround though.
Re: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock [ In reply to ]
On Thu, Mar 23, 2023 at 9:18?AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Thu, Mar 23, 2023 at 9:10?AM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > On Thu, Mar 23, 2023 at 8:46?AM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > On Thu, Mar 23, 2023 at 8:43?AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > >
> > > > On Thu, Mar 23, 2023 at 8:40?AM Shakeel Butt <shakeelb@google.com> wrote:
> > > > >
> > > > > On Thu, Mar 23, 2023 at 6:36?AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > >
> > > > > [...]
> > > > > > > >
> > > > > > > > > 2. Are we really calling rstat flush in irq context?
> > > > > > > >
> > > > > > > > I think it is possible through the charge/uncharge path:
> > > > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > > > > > > > added the protection against flushing in an interrupt context for
> > > > > > > > future callers as well, as it may cause a deadlock if we don't disable
> > > > > > > > interrupts when acquiring cgroup_rstat_lock.
> > > > > > > >
> > > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> > > > > > > >
> > > > > > > > I am not sure, but the code looks like event notifications may be set
> > > > > > > > up on root memcg, which is why we need to check thresholds.
> > > > > > >
> > > > > > > This is something we should deprecate as root memcg's usage is ill defined.
> > > > > >
> > > > > > Right, but I think this would be orthogonal to this patch series.
> > > > > >
> > > > >
> > > > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock
> > > > > without either breaking a link between mem_cgroup_threshold and
> > > > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling
> > > > > irqs.
> > > > >
> > > > > So, this patch can not be applied before either of those two tasks are
> > > > > done (and we may find more such scenarios).
> > > >
> > > >
> > > > Could you elaborate why?
> > > >
> > > > My understanding is that with an in_task() check to make sure we only
> > > > acquire cgroup_rstat_lock from non-irq context it should be fine to
> > > > acquire cgroup_rstat_lock without disabling interrupts.
> > >
> > > From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken
> > > with irq disabled while other code paths will take cgroup_rstat_lock
> > > with irq enabled. This is a potential deadlock hazard unless
> > > cgroup_rstat_lock is always taken with irq disabled.
> >
> > Oh you are making sure it is not taken in the irq context through
> > should_skip_flush(). Hmm seems like a hack. Normally it is recommended
> > to actually remove all such users instead of silently
> > ignoring/bypassing the functionality.
>
> It is a workaround, we simply accept to read stale stats in irq
> context instead of the expensive flush operation.
>
> >
> > So, how about removing mem_cgroup_flush_stats() from
> > mem_cgroup_usage(). It will break the known chain which is taking
> > cgroup_rstat_lock with irq disabled and you can add
> > WARN_ON_ONCE(!in_task()).
>
> This changes the behavior in a more obvious way because:
> 1. The memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage()
> path is also exercised in a lot of paths outside irq context, this
> will change the behavior for any event thresholds on the root memcg.
> With proposed skipped flushing in irq context we only change the
> behavior in a small subset of cases.
>
> I think we can skip flushing in irq context for now, and separately
> deprecate threshold events for the root memcg. When that is done we
> can come back and remove should_skip_flush() and add a VM_BUG_ON or
> WARN_ON_ONCE instead. WDYT?
>
> 2. mem_cgroup_usage() is also used when reading usage from userspace.
> This should be an easy workaround though.

This is a cgroup v1 behavior and to me it is totally reasonable to get
the 2 second stale root's usage. Even if you want to skip flushing in
irq, do that in the memcg code and keep VM_BUG_ON/WARN_ON_ONCE in the
rstat core code. This way we will know if other subsystems are doing
the same or not.
Re: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock [ In reply to ]
On Thu, Mar 23, 2023 at 9:29?AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Thu, Mar 23, 2023 at 9:18?AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Thu, Mar 23, 2023 at 9:10?AM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > On Thu, Mar 23, 2023 at 8:46?AM Shakeel Butt <shakeelb@google.com> wrote:
> > > >
> > > > On Thu, Mar 23, 2023 at 8:43?AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > >
> > > > > On Thu, Mar 23, 2023 at 8:40?AM Shakeel Butt <shakeelb@google.com> wrote:
> > > > > >
> > > > > > On Thu, Mar 23, 2023 at 6:36?AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > >
> > > > > > [...]
> > > > > > > > >
> > > > > > > > > > 2. Are we really calling rstat flush in irq context?
> > > > > > > > >
> > > > > > > > > I think it is possible through the charge/uncharge path:
> > > > > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > > > > > > > > added the protection against flushing in an interrupt context for
> > > > > > > > > future callers as well, as it may cause a deadlock if we don't disable
> > > > > > > > > interrupts when acquiring cgroup_rstat_lock.
> > > > > > > > >
> > > > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > > > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > > > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> > > > > > > > >
> > > > > > > > > I am not sure, but the code looks like event notifications may be set
> > > > > > > > > up on root memcg, which is why we need to check thresholds.
> > > > > > > >
> > > > > > > > This is something we should deprecate as root memcg's usage is ill defined.
> > > > > > >
> > > > > > > Right, but I think this would be orthogonal to this patch series.
> > > > > > >
> > > > > >
> > > > > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock
> > > > > > without either breaking a link between mem_cgroup_threshold and
> > > > > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling
> > > > > > irqs.
> > > > > >
> > > > > > So, this patch can not be applied before either of those two tasks are
> > > > > > done (and we may find more such scenarios).
> > > > >
> > > > >
> > > > > Could you elaborate why?
> > > > >
> > > > > My understanding is that with an in_task() check to make sure we only
> > > > > acquire cgroup_rstat_lock from non-irq context it should be fine to
> > > > > acquire cgroup_rstat_lock without disabling interrupts.
> > > >
> > > > From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken
> > > > with irq disabled while other code paths will take cgroup_rstat_lock
> > > > with irq enabled. This is a potential deadlock hazard unless
> > > > cgroup_rstat_lock is always taken with irq disabled.
> > >
> > > Oh you are making sure it is not taken in the irq context through
> > > should_skip_flush(). Hmm seems like a hack. Normally it is recommended
> > > to actually remove all such users instead of silently
> > > ignoring/bypassing the functionality.
> >
> > It is a workaround, we simply accept to read stale stats in irq
> > context instead of the expensive flush operation.
> >
> > >
> > > So, how about removing mem_cgroup_flush_stats() from
> > > mem_cgroup_usage(). It will break the known chain which is taking
> > > cgroup_rstat_lock with irq disabled and you can add
> > > WARN_ON_ONCE(!in_task()).
> >
> > This changes the behavior in a more obvious way because:
> > 1. The memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage()
> > path is also exercised in a lot of paths outside irq context, this
> > will change the behavior for any event thresholds on the root memcg.
> > With proposed skipped flushing in irq context we only change the
> > behavior in a small subset of cases.
> >
> > I think we can skip flushing in irq context for now, and separately
> > deprecate threshold events for the root memcg. When that is done we
> > can come back and remove should_skip_flush() and add a VM_BUG_ON or
> > WARN_ON_ONCE instead. WDYT?
> >
> > 2. mem_cgroup_usage() is also used when reading usage from userspace.
> > This should be an easy workaround though.
>
> This is a cgroup v1 behavior and to me it is totally reasonable to get
> the 2 second stale root's usage. Even if you want to skip flushing in
> irq, do that in the memcg code and keep VM_BUG_ON/WARN_ON_ONCE in the
> rstat core code. This way we will know if other subsystems are doing
> the same or not.

We can do that. Basically in mem_cgroup_usage() have:

/* Some useful comment */
if (in_task())
mem_cgroup_flush_stats();

and in cgroup_rstat_flush() have:
WARN_ON_ONCE(!in_task());

I am assuming VM_BUG_ON is not used outside mm code.

The only thing that worries me is that if there is another unlikely
path somewhere that flushes stats in irq context we may run into a
deadlock. I am a little bit nervous about not skipping flushing if
!in_task() in cgroup_rstat_flush().
Re: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock [ In reply to ]
On Thu, Mar 23, 2023 at 9:37?AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Thu, Mar 23, 2023 at 9:29?AM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > On Thu, Mar 23, 2023 at 9:18?AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Thu, Mar 23, 2023 at 9:10?AM Shakeel Butt <shakeelb@google.com> wrote:
> > > >
> > > > On Thu, Mar 23, 2023 at 8:46?AM Shakeel Butt <shakeelb@google.com> wrote:
> > > > >
> > > > > On Thu, Mar 23, 2023 at 8:43?AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > >
> > > > > > On Thu, Mar 23, 2023 at 8:40?AM Shakeel Butt <shakeelb@google.com> wrote:
> > > > > > >
> > > > > > > On Thu, Mar 23, 2023 at 6:36?AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > > >
> > > > > > > [...]
> > > > > > > > > >
> > > > > > > > > > > 2. Are we really calling rstat flush in irq context?
> > > > > > > > > >
> > > > > > > > > > I think it is possible through the charge/uncharge path:
> > > > > > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > > > > > > > > > added the protection against flushing in an interrupt context for
> > > > > > > > > > future callers as well, as it may cause a deadlock if we don't disable
> > > > > > > > > > interrupts when acquiring cgroup_rstat_lock.
> > > > > > > > > >
> > > > > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > > > > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > > > > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> > > > > > > > > >
> > > > > > > > > > I am not sure, but the code looks like event notifications may be set
> > > > > > > > > > up on root memcg, which is why we need to check thresholds.
> > > > > > > > >
> > > > > > > > > This is something we should deprecate as root memcg's usage is ill defined.
> > > > > > > >
> > > > > > > > Right, but I think this would be orthogonal to this patch series.
> > > > > > > >
> > > > > > >
> > > > > > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock
> > > > > > > without either breaking a link between mem_cgroup_threshold and
> > > > > > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling
> > > > > > > irqs.
> > > > > > >
> > > > > > > So, this patch can not be applied before either of those two tasks are
> > > > > > > done (and we may find more such scenarios).
> > > > > >
> > > > > >
> > > > > > Could you elaborate why?
> > > > > >
> > > > > > My understanding is that with an in_task() check to make sure we only
> > > > > > acquire cgroup_rstat_lock from non-irq context it should be fine to
> > > > > > acquire cgroup_rstat_lock without disabling interrupts.
> > > > >
> > > > > From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken
> > > > > with irq disabled while other code paths will take cgroup_rstat_lock
> > > > > with irq enabled. This is a potential deadlock hazard unless
> > > > > cgroup_rstat_lock is always taken with irq disabled.
> > > >
> > > > Oh you are making sure it is not taken in the irq context through
> > > > should_skip_flush(). Hmm seems like a hack. Normally it is recommended
> > > > to actually remove all such users instead of silently
> > > > ignoring/bypassing the functionality.
> > >
> > > It is a workaround, we simply accept to read stale stats in irq
> > > context instead of the expensive flush operation.
> > >
> > > >
> > > > So, how about removing mem_cgroup_flush_stats() from
> > > > mem_cgroup_usage(). It will break the known chain which is taking
> > > > cgroup_rstat_lock with irq disabled and you can add
> > > > WARN_ON_ONCE(!in_task()).
> > >
> > > This changes the behavior in a more obvious way because:
> > > 1. The memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage()
> > > path is also exercised in a lot of paths outside irq context, this
> > > will change the behavior for any event thresholds on the root memcg.
> > > With proposed skipped flushing in irq context we only change the
> > > behavior in a small subset of cases.
> > >
> > > I think we can skip flushing in irq context for now, and separately
> > > deprecate threshold events for the root memcg. When that is done we
> > > can come back and remove should_skip_flush() and add a VM_BUG_ON or
> > > WARN_ON_ONCE instead. WDYT?
> > >
> > > 2. mem_cgroup_usage() is also used when reading usage from userspace.
> > > This should be an easy workaround though.
> >
> > This is a cgroup v1 behavior and to me it is totally reasonable to get
> > the 2 second stale root's usage. Even if you want to skip flushing in
> > irq, do that in the memcg code and keep VM_BUG_ON/WARN_ON_ONCE in the
> > rstat core code. This way we will know if other subsystems are doing
> > the same or not.
>
> We can do that. Basically in mem_cgroup_usage() have:
>
> /* Some useful comment */
> if (in_task())
> mem_cgroup_flush_stats();
>
> and in cgroup_rstat_flush() have:
> WARN_ON_ONCE(!in_task());
>
> I am assuming VM_BUG_ON is not used outside mm code.
>
> The only thing that worries me is that if there is another unlikely
> path somewhere that flushes stats in irq context we may run into a
> deadlock. I am a little bit nervous about not skipping flushing if
> !in_task() in cgroup_rstat_flush().

I think it is a good thing. We will find such scenarios and fix those
instead of hiding them forever or keeping the door open for new such
scenarios.
Re: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock [ In reply to ]
On Thu, Mar 23, 2023 at 9:45?AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Thu, Mar 23, 2023 at 9:37?AM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Thu, Mar 23, 2023 at 9:29?AM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > On Thu, Mar 23, 2023 at 9:18?AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > >
> > > > On Thu, Mar 23, 2023 at 9:10?AM Shakeel Butt <shakeelb@google.com> wrote:
> > > > >
> > > > > On Thu, Mar 23, 2023 at 8:46?AM Shakeel Butt <shakeelb@google.com> wrote:
> > > > > >
> > > > > > On Thu, Mar 23, 2023 at 8:43?AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > >
> > > > > > > On Thu, Mar 23, 2023 at 8:40?AM Shakeel Butt <shakeelb@google.com> wrote:
> > > > > > > >
> > > > > > > > On Thu, Mar 23, 2023 at 6:36?AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > > > >
> > > > > > > > [...]
> > > > > > > > > > >
> > > > > > > > > > > > 2. Are we really calling rstat flush in irq context?
> > > > > > > > > > >
> > > > > > > > > > > I think it is possible through the charge/uncharge path:
> > > > > > > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > > > > > > > > > > added the protection against flushing in an interrupt context for
> > > > > > > > > > > future callers as well, as it may cause a deadlock if we don't disable
> > > > > > > > > > > interrupts when acquiring cgroup_rstat_lock.
> > > > > > > > > > >
> > > > > > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > > > > > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > > > > > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> > > > > > > > > > >
> > > > > > > > > > > I am not sure, but the code looks like event notifications may be set
> > > > > > > > > > > up on root memcg, which is why we need to check thresholds.
> > > > > > > > > >
> > > > > > > > > > This is something we should deprecate as root memcg's usage is ill defined.
> > > > > > > > >
> > > > > > > > > Right, but I think this would be orthogonal to this patch series.
> > > > > > > > >
> > > > > > > >
> > > > > > > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock
> > > > > > > > without either breaking a link between mem_cgroup_threshold and
> > > > > > > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling
> > > > > > > > irqs.
> > > > > > > >
> > > > > > > > So, this patch can not be applied before either of those two tasks are
> > > > > > > > done (and we may find more such scenarios).
> > > > > > >
> > > > > > >
> > > > > > > Could you elaborate why?
> > > > > > >
> > > > > > > My understanding is that with an in_task() check to make sure we only
> > > > > > > acquire cgroup_rstat_lock from non-irq context it should be fine to
> > > > > > > acquire cgroup_rstat_lock without disabling interrupts.
> > > > > >
> > > > > > From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken
> > > > > > with irq disabled while other code paths will take cgroup_rstat_lock
> > > > > > with irq enabled. This is a potential deadlock hazard unless
> > > > > > cgroup_rstat_lock is always taken with irq disabled.
> > > > >
> > > > > Oh you are making sure it is not taken in the irq context through
> > > > > should_skip_flush(). Hmm seems like a hack. Normally it is recommended
> > > > > to actually remove all such users instead of silently
> > > > > ignoring/bypassing the functionality.
> > > >
> > > > It is a workaround, we simply accept to read stale stats in irq
> > > > context instead of the expensive flush operation.
> > > >
> > > > >
> > > > > So, how about removing mem_cgroup_flush_stats() from
> > > > > mem_cgroup_usage(). It will break the known chain which is taking
> > > > > cgroup_rstat_lock with irq disabled and you can add
> > > > > WARN_ON_ONCE(!in_task()).
> > > >
> > > > This changes the behavior in a more obvious way because:
> > > > 1. The memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage()
> > > > path is also exercised in a lot of paths outside irq context, this
> > > > will change the behavior for any event thresholds on the root memcg.
> > > > With proposed skipped flushing in irq context we only change the
> > > > behavior in a small subset of cases.
> > > >
> > > > I think we can skip flushing in irq context for now, and separately
> > > > deprecate threshold events for the root memcg. When that is done we
> > > > can come back and remove should_skip_flush() and add a VM_BUG_ON or
> > > > WARN_ON_ONCE instead. WDYT?
> > > >
> > > > 2. mem_cgroup_usage() is also used when reading usage from userspace.
> > > > This should be an easy workaround though.
> > >
> > > This is a cgroup v1 behavior and to me it is totally reasonable to get
> > > the 2 second stale root's usage. Even if you want to skip flushing in
> > > irq, do that in the memcg code and keep VM_BUG_ON/WARN_ON_ONCE in the
> > > rstat core code. This way we will know if other subsystems are doing
> > > the same or not.
> >
> > We can do that. Basically in mem_cgroup_usage() have:
> >
> > /* Some useful comment */
> > if (in_task())
> > mem_cgroup_flush_stats();
> >
> > and in cgroup_rstat_flush() have:
> > WARN_ON_ONCE(!in_task());
> >
> > I am assuming VM_BUG_ON is not used outside mm code.
> >
> > The only thing that worries me is that if there is another unlikely
> > path somewhere that flushes stats in irq context we may run into a
> > deadlock. I am a little bit nervous about not skipping flushing if
> > !in_task() in cgroup_rstat_flush().
>
> I think it is a good thing. We will find such scenarios and fix those
> instead of hiding them forever or keeping the door open for new such
> scenarios.

Sure, I can do that in the next version. I will include a patch that
adds an in_task() check to mem_cgroup_usage() before this one. Since
BUG_ON() is discouraged and VM_BUG_ON() is mm specific, I guess we are
left with WARN_ON_ONCE() for the rstat core code, right?

Thanks Shakeel. Any other thoughts I should address for the next version?
Re: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock [ In reply to ]
On Thu, Mar 23, 2023 at 09:17:33AM -0700, Yosry Ahmed wrote:
> On Thu, Mar 23, 2023 at 9:10?AM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > On Thu, Mar 23, 2023 at 8:46?AM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > On Thu, Mar 23, 2023 at 8:43?AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > >
> > > > On Thu, Mar 23, 2023 at 8:40?AM Shakeel Butt <shakeelb@google.com> wrote:
> > > > >
> > > > > On Thu, Mar 23, 2023 at 6:36?AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > >
> > > > > [...]
> > > > > > > >
> > > > > > > > > 2. Are we really calling rstat flush in irq context?
> > > > > > > >
> > > > > > > > I think it is possible through the charge/uncharge path:
> > > > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > > > > > > > added the protection against flushing in an interrupt context for
> > > > > > > > future callers as well, as it may cause a deadlock if we don't disable
> > > > > > > > interrupts when acquiring cgroup_rstat_lock.
> > > > > > > >
> > > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> > > > > > > >
> > > > > > > > I am not sure, but the code looks like event notifications may be set
> > > > > > > > up on root memcg, which is why we need to check thresholds.
> > > > > > >
> > > > > > > This is something we should deprecate as root memcg's usage is ill defined.
> > > > > >
> > > > > > Right, but I think this would be orthogonal to this patch series.
> > > > > >
> > > > >
> > > > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock
> > > > > without either breaking a link between mem_cgroup_threshold and
> > > > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling
> > > > > irqs.
> > > > >
> > > > > So, this patch can not be applied before either of those two tasks are
> > > > > done (and we may find more such scenarios).
> > > >
> > > >
> > > > Could you elaborate why?
> > > >
> > > > My understanding is that with an in_task() check to make sure we only
> > > > acquire cgroup_rstat_lock from non-irq context it should be fine to
> > > > acquire cgroup_rstat_lock without disabling interrupts.
> > >
> > > From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken
> > > with irq disabled while other code paths will take cgroup_rstat_lock
> > > with irq enabled. This is a potential deadlock hazard unless
> > > cgroup_rstat_lock is always taken with irq disabled.
> >
> > Oh you are making sure it is not taken in the irq context through
> > should_skip_flush(). Hmm seems like a hack. Normally it is recommended
> > to actually remove all such users instead of silently
> > ignoring/bypassing the functionality.

+1

It shouldn't silently skip the requested operation, rather it
shouldn't be requested from an incompatible context.

> > So, how about removing mem_cgroup_flush_stats() from
> > mem_cgroup_usage(). It will break the known chain which is taking
> > cgroup_rstat_lock with irq disabled and you can add
> > WARN_ON_ONCE(!in_task()).
>
> This changes the behavior in a more obvious way because:
> 1. The memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage()
> path is also exercised in a lot of paths outside irq context, this
> will change the behavior for any event thresholds on the root memcg.
> With proposed skipped flushing in irq context we only change the
> behavior in a small subset of cases.

Can you do

/* Note: stale usage data when called from irq context!! */
if (in_task())
mem_cgroup_flush_stats()

directly in the callsite? Maybe even include the whole callchain in
the comment that's currently broken and needs fixing/removing.
Re: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock [ In reply to ]
On Thu, Mar 23, 2023 at 10:33?AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Mar 23, 2023 at 09:17:33AM -0700, Yosry Ahmed wrote:
> > On Thu, Mar 23, 2023 at 9:10?AM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > On Thu, Mar 23, 2023 at 8:46?AM Shakeel Butt <shakeelb@google.com> wrote:
> > > >
> > > > On Thu, Mar 23, 2023 at 8:43?AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > >
> > > > > On Thu, Mar 23, 2023 at 8:40?AM Shakeel Butt <shakeelb@google.com> wrote:
> > > > > >
> > > > > > On Thu, Mar 23, 2023 at 6:36?AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > >
> > > > > > [...]
> > > > > > > > >
> > > > > > > > > > 2. Are we really calling rstat flush in irq context?
> > > > > > > > >
> > > > > > > > > I think it is possible through the charge/uncharge path:
> > > > > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > > > > > > > > added the protection against flushing in an interrupt context for
> > > > > > > > > future callers as well, as it may cause a deadlock if we don't disable
> > > > > > > > > interrupts when acquiring cgroup_rstat_lock.
> > > > > > > > >
> > > > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > > > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > > > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> > > > > > > > >
> > > > > > > > > I am not sure, but the code looks like event notifications may be set
> > > > > > > > > up on root memcg, which is why we need to check thresholds.
> > > > > > > >
> > > > > > > > This is something we should deprecate as root memcg's usage is ill defined.
> > > > > > >
> > > > > > > Right, but I think this would be orthogonal to this patch series.
> > > > > > >
> > > > > >
> > > > > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock
> > > > > > without either breaking a link between mem_cgroup_threshold and
> > > > > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling
> > > > > > irqs.
> > > > > >
> > > > > > So, this patch can not be applied before either of those two tasks are
> > > > > > done (and we may find more such scenarios).
> > > > >
> > > > >
> > > > > Could you elaborate why?
> > > > >
> > > > > My understanding is that with an in_task() check to make sure we only
> > > > > acquire cgroup_rstat_lock from non-irq context it should be fine to
> > > > > acquire cgroup_rstat_lock without disabling interrupts.
> > > >
> > > > From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken
> > > > with irq disabled while other code paths will take cgroup_rstat_lock
> > > > with irq enabled. This is a potential deadlock hazard unless
> > > > cgroup_rstat_lock is always taken with irq disabled.
> > >
> > > Oh you are making sure it is not taken in the irq context through
> > > should_skip_flush(). Hmm seems like a hack. Normally it is recommended
> > > to actually remove all such users instead of silently
> > > ignoring/bypassing the functionality.
>
> +1
>
> It shouldn't silently skip the requested operation, rather it
> shouldn't be requested from an incompatible context.
>
> > > So, how about removing mem_cgroup_flush_stats() from
> > > mem_cgroup_usage(). It will break the known chain which is taking
> > > cgroup_rstat_lock with irq disabled and you can add
> > > WARN_ON_ONCE(!in_task()).
> >
> > This changes the behavior in a more obvious way because:
> > 1. The memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage()
> > path is also exercised in a lot of paths outside irq context, this
> > will change the behavior for any event thresholds on the root memcg.
> > With proposed skipped flushing in irq context we only change the
> > behavior in a small subset of cases.
>
> Can you do
>
> /* Note: stale usage data when called from irq context!! */
> if (in_task())
> mem_cgroup_flush_stats()
>
> directly in the callsite? Maybe even include the whole callchain in
> the comment that's currently broken and needs fixing/removing.

Yeah, we can do that in mem_cgroup_usage(), which is the only context
that I am aware of that may flush from irq context. We can also add
WARN_ON_ONCE(!in_task()) in the rstat core flushing code to catch any
other code paths that we are not aware of -- which may result in a
deadlock, but hopefully if there is a violation it will be caught soon
enough.
Re: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock [ In reply to ]
On Thu, Mar 23, 2023 at 11:09:30AM -0700, Yosry Ahmed wrote:
> On Thu, Mar 23, 2023 at 10:33?AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Thu, Mar 23, 2023 at 09:17:33AM -0700, Yosry Ahmed wrote:
> > > On Thu, Mar 23, 2023 at 9:10?AM Shakeel Butt <shakeelb@google.com> wrote:
> > > >
> > > > On Thu, Mar 23, 2023 at 8:46?AM Shakeel Butt <shakeelb@google.com> wrote:
> > > > >
> > > > > On Thu, Mar 23, 2023 at 8:43?AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > >
> > > > > > On Thu, Mar 23, 2023 at 8:40?AM Shakeel Butt <shakeelb@google.com> wrote:
> > > > > > >
> > > > > > > On Thu, Mar 23, 2023 at 6:36?AM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > > > > > >
> > > > > > > [...]
> > > > > > > > > >
> > > > > > > > > > > 2. Are we really calling rstat flush in irq context?
> > > > > > > > > >
> > > > > > > > > > I think it is possible through the charge/uncharge path:
> > > > > > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > > > > > > > > > added the protection against flushing in an interrupt context for
> > > > > > > > > > future callers as well, as it may cause a deadlock if we don't disable
> > > > > > > > > > interrupts when acquiring cgroup_rstat_lock.
> > > > > > > > > >
> > > > > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > > > > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > > > > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> > > > > > > > > >
> > > > > > > > > > I am not sure, but the code looks like event notifications may be set
> > > > > > > > > > up on root memcg, which is why we need to check thresholds.
> > > > > > > > >
> > > > > > > > > This is something we should deprecate as root memcg's usage is ill defined.
> > > > > > > >
> > > > > > > > Right, but I think this would be orthogonal to this patch series.
> > > > > > > >
> > > > > > >
> > > > > > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock
> > > > > > > without either breaking a link between mem_cgroup_threshold and
> > > > > > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling
> > > > > > > irqs.
> > > > > > >
> > > > > > > So, this patch can not be applied before either of those two tasks are
> > > > > > > done (and we may find more such scenarios).
> > > > > >
> > > > > >
> > > > > > Could you elaborate why?
> > > > > >
> > > > > > My understanding is that with an in_task() check to make sure we only
> > > > > > acquire cgroup_rstat_lock from non-irq context it should be fine to
> > > > > > acquire cgroup_rstat_lock without disabling interrupts.
> > > > >
> > > > > From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken
> > > > > with irq disabled while other code paths will take cgroup_rstat_lock
> > > > > with irq enabled. This is a potential deadlock hazard unless
> > > > > cgroup_rstat_lock is always taken with irq disabled.
> > > >
> > > > Oh you are making sure it is not taken in the irq context through
> > > > should_skip_flush(). Hmm seems like a hack. Normally it is recommended
> > > > to actually remove all such users instead of silently
> > > > ignoring/bypassing the functionality.
> >
> > +1
> >
> > It shouldn't silently skip the requested operation, rather it
> > shouldn't be requested from an incompatible context.
> >
> > > > So, how about removing mem_cgroup_flush_stats() from
> > > > mem_cgroup_usage(). It will break the known chain which is taking
> > > > cgroup_rstat_lock with irq disabled and you can add
> > > > WARN_ON_ONCE(!in_task()).
> > >
> > > This changes the behavior in a more obvious way because:
> > > 1. The memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage()
> > > path is also exercised in a lot of paths outside irq context, this
> > > will change the behavior for any event thresholds on the root memcg.
> > > With proposed skipped flushing in irq context we only change the
> > > behavior in a small subset of cases.
> >
> > Can you do
> >
> > /* Note: stale usage data when called from irq context!! */
> > if (in_task())
> > mem_cgroup_flush_stats()
> >
> > directly in the callsite? Maybe even include the whole callchain in
> > the comment that's currently broken and needs fixing/removing.
>
> Yeah, we can do that in mem_cgroup_usage(), which is the only context
> that I am aware of that may flush from irq context. We can also add
> WARN_ON_ONCE(!in_task()) in the rstat core flushing code to catch any
> other code paths that we are not aware of -- which may result in a
> deadlock, but hopefully if there is a violation it will be caught soon
> enough.

That sounds good to me, thanks!
Re: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock [ In reply to ]
On Thu, Mar 23, 2023 at 9:52?AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
[...]
>
> Sure, I can do that in the next version. I will include a patch that
> adds an in_task() check to mem_cgroup_usage() before this one. Since
> BUG_ON() is discouraged and VM_BUG_ON() is mm specific, I guess we are
> left with WARN_ON_ONCE() for the rstat core code, right?
>
> Thanks Shakeel. Any other thoughts I should address for the next version?

This is all for now (at least for this patch).
Re: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock [ In reply to ]
Hello,

On Thu, Mar 23, 2023 at 04:00:31AM +0000, Yosry Ahmed wrote:
> Currently, when sleeping is not allowed during rstat flushing, we hold
> the global rstat lock with interrupts disabled throughout the entire
> flush operation. Flushing in an O(# cgroups * # cpus) operation, and
> having interrupts disabled throughout is dangerous.
>
> For some contexts, we may not want to sleep, but can be interrupted
> (e.g. while holding a spinlock or RCU read lock). As such, do not
> disable interrupts throughout rstat flushing, only when holding the
> percpu lock. This breaks down the O(# cgroups * # cpus) duration with
> interrupts disabled to a series of O(# cgroups) durations.
>
> Furthermore, if a cpu spinning waiting for the global rstat lock, it
> doesn't need to spin with interrupts disabled anymore.

I'm generally not a fan of big spin locks w/o irq protection. They too often
become a source of unpredictable latency spikes. As you said, the global
rstat lock can be held for quite a while. Removing _irq makes irq latency
better on the CPU but on the other hand it makes a lot more likely that the
lock is gonna be held even longer, possibly significantly so depending on
the configuration and workload which will in turn stall other CPUs waiting
for the lock. Sure, irqs are being serviced quicker but if the cost is more
and longer !irq context multi-cpu stalls, what's the point?

I don't think there's anything which requires the global lock to be held
throughout the entire flushing sequence and irq needs to be disabled when
grabbing the percpu lock anyway, so why not just release the global lock on
CPU boundaries instead? We don't really lose anything significant that way.
The durations of irq disabled sections are still about the same as in the
currently proposed solution at O(# cgroups) and we avoid the risk of holding
the global lock for too long unexpectedly from getting hit repeatedly by
irqs while holding the global lock.

Thanks.

--
tejun
Re: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock [ In reply to ]
On Thu, Mar 23, 2023 at 6:39?PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Thu, Mar 23, 2023 at 04:00:31AM +0000, Yosry Ahmed wrote:
> > Currently, when sleeping is not allowed during rstat flushing, we hold
> > the global rstat lock with interrupts disabled throughout the entire
> > flush operation. Flushing in an O(# cgroups * # cpus) operation, and
> > having interrupts disabled throughout is dangerous.
> >
> > For some contexts, we may not want to sleep, but can be interrupted
> > (e.g. while holding a spinlock or RCU read lock). As such, do not
> > disable interrupts throughout rstat flushing, only when holding the
> > percpu lock. This breaks down the O(# cgroups * # cpus) duration with
> > interrupts disabled to a series of O(# cgroups) durations.
> >
> > Furthermore, if a cpu spinning waiting for the global rstat lock, it
> > doesn't need to spin with interrupts disabled anymore.
>
> I'm generally not a fan of big spin locks w/o irq protection. They too often
> become a source of unpredictable latency spikes. As you said, the global
> rstat lock can be held for quite a while. Removing _irq makes irq latency
> better on the CPU but on the other hand it makes a lot more likely that the
> lock is gonna be held even longer, possibly significantly so depending on
> the configuration and workload which will in turn stall other CPUs waiting
> for the lock. Sure, irqs are being serviced quicker but if the cost is more
> and longer !irq context multi-cpu stalls, what's the point?
>
> I don't think there's anything which requires the global lock to be held
> throughout the entire flushing sequence and irq needs to be disabled when
> grabbing the percpu lock anyway, so why not just release the global lock on
> CPU boundaries instead? We don't really lose anything significant that way.
> The durations of irq disabled sections are still about the same as in the
> currently proposed solution at O(# cgroups) and we avoid the risk of holding
> the global lock for too long unexpectedly from getting hit repeatedly by
> irqs while holding the global lock.

Thanks for taking a look!

I think a problem with this approach is that we risk having to contend
for the global lock at every CPU boundary in atomic contexts. Right
now we contend for the global lock once, and once we have it we go
through all CPUs to flush, only having to contend with updates taking
the percpu locks at this point. If we unconditionally release &
reacquire the global lock at every CPU boundary then we may contend
for it much more frequently with concurrent flushers.

On the memory controller side, concurrent flushers are already held
back to avoid a thundering herd problem on the global rstat lock, but
flushers from outside the memory controller can still compete together
or with a flusher from the memory controller. In this case, we risk
contending the global lock more and concurrent flushers taking a
longer period of time, which may end up causing multi-CPU stalls
anyway, right? Also, if we keep _irq when spinning for the lock, then
concurrent flushers still need to spin with irq disabled -- another
problem that this series tries to fix.

This is particularly a problem for flushers in atomic contexts. There
is a flusher in mem_cgroup_wb_stats() that flushes while holding
another spinlock, and a flusher in mem_cgroup_usage() that flushes
with irqs disabled. If flushing takes a longer period of time due to
repeated lock contention, it affects such atomic context negatively.

I am not sure how all of this matters in practice, it depends heavily
on the workloads and the configuration like you mentioned. I am just
pointing out the potential disadvantages of reacquiring the lock at
every CPU boundary in atomic contexts.

>
> Thanks.
>
> --
> tejun
Re: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock [ In reply to ]
On 3/24/23 03:22, Yosry Ahmed wrote:
> On Thu, Mar 23, 2023 at 6:39?PM Tejun Heo <tj@kernel.org> wrote:
>> Hello,
>>
>> On Thu, Mar 23, 2023 at 04:00:31AM +0000, Yosry Ahmed wrote:
>>> Currently, when sleeping is not allowed during rstat flushing, we hold
>>> the global rstat lock with interrupts disabled throughout the entire
>>> flush operation. Flushing in an O(# cgroups * # cpus) operation, and
>>> having interrupts disabled throughout is dangerous.
>>>
>>> For some contexts, we may not want to sleep, but can be interrupted
>>> (e.g. while holding a spinlock or RCU read lock). As such, do not
>>> disable interrupts throughout rstat flushing, only when holding the
>>> percpu lock. This breaks down the O(# cgroups * # cpus) duration with
>>> interrupts disabled to a series of O(# cgroups) durations.
>>>
>>> Furthermore, if a cpu spinning waiting for the global rstat lock, it
>>> doesn't need to spin with interrupts disabled anymore.
>> I'm generally not a fan of big spin locks w/o irq protection. They too often
>> become a source of unpredictable latency spikes. As you said, the global
>> rstat lock can be held for quite a while. Removing _irq makes irq latency
>> better on the CPU but on the other hand it makes a lot more likely that the
>> lock is gonna be held even longer, possibly significantly so depending on
>> the configuration and workload which will in turn stall other CPUs waiting
>> for the lock. Sure, irqs are being serviced quicker but if the cost is more
>> and longer !irq context multi-cpu stalls, what's the point?
>>
>> I don't think there's anything which requires the global lock to be held
>> throughout the entire flushing sequence and irq needs to be disabled when
>> grabbing the percpu lock anyway, so why not just release the global lock on
>> CPU boundaries instead? We don't really lose anything significant that way.
>> The durations of irq disabled sections are still about the same as in the
>> currently proposed solution at O(# cgroups) and we avoid the risk of holding
>> the global lock for too long unexpectedly from getting hit repeatedly by
>> irqs while holding the global lock.
> Thanks for taking a look!
>
> I think a problem with this approach is that we risk having to contend
> for the global lock at every CPU boundary in atomic contexts. Right
Isn't it the plan to just do a trylock in atomic contexts so that it
won't get stuck spinning for the lock for an indeterminate amount of time?
> now we contend for the global lock once, and once we have it we go
> through all CPUs to flush, only having to contend with updates taking
> the percpu locks at this point. If we unconditionally release &
> reacquire the global lock at every CPU boundary then we may contend
> for it much more frequently with concurrent flushers.

Note that with the use of qspinlock in all the major arches, the impact
of thundering herds of lockers are much less serious than before. There
are certainly some overhead in doing multiple lock acquires and
releases, but that shouldn't been too excessive.

I am all in for reducing lock hold time as much as possible as it will
improve the response time.

Cheers,
Longman
Re: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock [ In reply to ]
On Fri, Mar 24, 2023 at 7:12?AM Waiman Long <longman@redhat.com> wrote:
>
> On 3/24/23 03:22, Yosry Ahmed wrote:
> > On Thu, Mar 23, 2023 at 6:39?PM Tejun Heo <tj@kernel.org> wrote:
> >> Hello,
> >>
> >> On Thu, Mar 23, 2023 at 04:00:31AM +0000, Yosry Ahmed wrote:
> >>> Currently, when sleeping is not allowed during rstat flushing, we hold
> >>> the global rstat lock with interrupts disabled throughout the entire
> >>> flush operation. Flushing in an O(# cgroups * # cpus) operation, and
> >>> having interrupts disabled throughout is dangerous.
> >>>
> >>> For some contexts, we may not want to sleep, but can be interrupted
> >>> (e.g. while holding a spinlock or RCU read lock). As such, do not
> >>> disable interrupts throughout rstat flushing, only when holding the
> >>> percpu lock. This breaks down the O(# cgroups * # cpus) duration with
> >>> interrupts disabled to a series of O(# cgroups) durations.
> >>>
> >>> Furthermore, if a cpu spinning waiting for the global rstat lock, it
> >>> doesn't need to spin with interrupts disabled anymore.
> >> I'm generally not a fan of big spin locks w/o irq protection. They too often
> >> become a source of unpredictable latency spikes. As you said, the global
> >> rstat lock can be held for quite a while. Removing _irq makes irq latency
> >> better on the CPU but on the other hand it makes a lot more likely that the
> >> lock is gonna be held even longer, possibly significantly so depending on
> >> the configuration and workload which will in turn stall other CPUs waiting
> >> for the lock. Sure, irqs are being serviced quicker but if the cost is more
> >> and longer !irq context multi-cpu stalls, what's the point?
> >>
> >> I don't think there's anything which requires the global lock to be held
> >> throughout the entire flushing sequence and irq needs to be disabled when
> >> grabbing the percpu lock anyway, so why not just release the global lock on
> >> CPU boundaries instead? We don't really lose anything significant that way.
> >> The durations of irq disabled sections are still about the same as in the
> >> currently proposed solution at O(# cgroups) and we avoid the risk of holding
> >> the global lock for too long unexpectedly from getting hit repeatedly by
> >> irqs while holding the global lock.
> > Thanks for taking a look!
> >
> > I think a problem with this approach is that we risk having to contend
> > for the global lock at every CPU boundary in atomic contexts. Right
> Isn't it the plan to just do a trylock in atomic contexts so that it
> won't get stuck spinning for the lock for an indeterminate amount of time?

Not exactly. On the memory controller side, we currently only allow
one flusher at a time and force all flushers to flush the full
hierarchy, such that concurrent flushers can skip. This is done for
both atomic and non-atomic contexts.

For flushers outside the memory controller, they can still contend the
lock among themselves or with flushers in the memory controller. In
this case, instead of contending the lock once, they contend it at
each CPU boundary.

> > now we contend for the global lock once, and once we have it we go
> > through all CPUs to flush, only having to contend with updates taking
> > the percpu locks at this point. If we unconditionally release &
> > reacquire the global lock at every CPU boundary then we may contend
> > for it much more frequently with concurrent flushers.
>
> Note that with the use of qspinlock in all the major arches, the impact
> of thundering herds of lockers are much less serious than before. There
> are certainly some overhead in doing multiple lock acquires and
> releases, but that shouldn't been too excessive.

I ran some tests to measure this. Since I am using a cgroup v1
hierarchy, I cannot reproduce contention between memory controller
flushers and non-memory controller flushers, so I removed the "one
memory flusher only" restriction to have concurrent memory flushers
compete for the global rstat lock to measure the impact:

Before (only one flusher allowed to compete for the global rstat lock):
---cgroup_rstat_flush
|
--1.27%--cgroup_rstat_flush_locked
|
--0.94%--mem_cgroup_css_rstat_flush

After (concurrent flushers allowed to compete for the global rstat lock):
---cgroup_rstat_flush
|
|--4.94%--_raw_spin_lock
| |
| --4.94%--queued_spin_lock_slowpath
|
--0.92%--cgroup_rstat_flush_locked
|
--0.56%--mem_cgroup_css_rstat_flush

This was run with 20 processes trying to flush concurrently, so it may
be excessive, but it seems like in this case lock contention makes a
significant difference.

Again, this is not a regression for non-atomic flushers, as they
already compete for the lock at every CPU boundary, but for atomic
flushers that don't give up the lock at all today, it would be a
regression to start competing for the lock at every CPU boundary. This
patch series aims to minimize the number of atomic flushers (brings
them down to two, one of which is not common), so this may be fine.

My main concern is that for some flushers that this series converts
from atomic to non-atomic, we may notice a regression later and revert
it (e.g. refault path), which is why I have them in separate patches.
If we regress the atomic flushing path, it would be a larger surgery
to restore the performance for these paths -- which is why I would
rather keep the atomic path without excessive lock contention.

Thoughts?

>
> I am all in for reducing lock hold time as much as possible as it will
> improve the response time.
>
> Cheers,
> Longman
>
Re: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock [ In reply to ]
Hello,

On Fri, Mar 24, 2023 at 12:22:09AM -0700, Yosry Ahmed wrote:
> I think a problem with this approach is that we risk having to contend
> for the global lock at every CPU boundary in atomic contexts. Right
> now we contend for the global lock once, and once we have it we go
> through all CPUs to flush, only having to contend with updates taking
> the percpu locks at this point. If we unconditionally release &
> reacquire the global lock at every CPU boundary then we may contend
> for it much more frequently with concurrent flushers.
>
> On the memory controller side, concurrent flushers are already held
> back to avoid a thundering herd problem on the global rstat lock, but
> flushers from outside the memory controller can still compete together
> or with a flusher from the memory controller. In this case, we risk
> contending the global lock more and concurrent flushers taking a
> longer period of time, which may end up causing multi-CPU stalls
> anyway, right? Also, if we keep _irq when spinning for the lock, then
> concurrent flushers still need to spin with irq disabled -- another
> problem that this series tries to fix.
>
> This is particularly a problem for flushers in atomic contexts. There
> is a flusher in mem_cgroup_wb_stats() that flushes while holding
> another spinlock, and a flusher in mem_cgroup_usage() that flushes
> with irqs disabled. If flushing takes a longer period of time due to
> repeated lock contention, it affects such atomic context negatively.
>
> I am not sure how all of this matters in practice, it depends heavily
> on the workloads and the configuration like you mentioned. I am just
> pointing out the potential disadvantages of reacquiring the lock at
> every CPU boundary in atomic contexts.

So, I'm not too convinced by the arguments for a couple reasons:

* It's not very difficult to create conditions where a contented non-irq
protected spinlock is held unnecessarily long due to heavy IRQ irq load on
the holding CPU. This can easily extend the amount of time the lock is
held by multiple times if not orders of magnitude. That is likely a
significantly worse problem than the contention on the lock cacheline
which will lead to a lot more gradual degradation.

* If concurrent flushing is an actual problem, we need and can implement a
better solution. There's quite a bit of maneuvering room here given that
the flushing operations are mostly idempotent in close time proximity and
there's no real atomicity requirement across different segments of
flushing operations.

Thanks.

--
tejun
Re: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock [ In reply to ]
On Fri, Mar 24, 2023 at 6:54?PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Fri, Mar 24, 2023 at 12:22:09AM -0700, Yosry Ahmed wrote:
> > I think a problem with this approach is that we risk having to contend
> > for the global lock at every CPU boundary in atomic contexts. Right
> > now we contend for the global lock once, and once we have it we go
> > through all CPUs to flush, only having to contend with updates taking
> > the percpu locks at this point. If we unconditionally release &
> > reacquire the global lock at every CPU boundary then we may contend
> > for it much more frequently with concurrent flushers.
> >
> > On the memory controller side, concurrent flushers are already held
> > back to avoid a thundering herd problem on the global rstat lock, but
> > flushers from outside the memory controller can still compete together
> > or with a flusher from the memory controller. In this case, we risk
> > contending the global lock more and concurrent flushers taking a
> > longer period of time, which may end up causing multi-CPU stalls
> > anyway, right? Also, if we keep _irq when spinning for the lock, then
> > concurrent flushers still need to spin with irq disabled -- another
> > problem that this series tries to fix.
> >
> > This is particularly a problem for flushers in atomic contexts. There
> > is a flusher in mem_cgroup_wb_stats() that flushes while holding
> > another spinlock, and a flusher in mem_cgroup_usage() that flushes
> > with irqs disabled. If flushing takes a longer period of time due to
> > repeated lock contention, it affects such atomic context negatively.
> >
> > I am not sure how all of this matters in practice, it depends heavily
> > on the workloads and the configuration like you mentioned. I am just
> > pointing out the potential disadvantages of reacquiring the lock at
> > every CPU boundary in atomic contexts.
>
> So, I'm not too convinced by the arguments for a couple reasons:
>
> * It's not very difficult to create conditions where a contented non-irq
> protected spinlock is held unnecessarily long due to heavy IRQ irq load on
> the holding CPU. This can easily extend the amount of time the lock is
> held by multiple times if not orders of magnitude. That is likely a
> significantly worse problem than the contention on the lock cacheline
> which will lead to a lot more gradual degradation.

I agree that can be a problem, it depends on the specific workload and
configuration. The continuous lock contention at each CPU boundary
causes a regression (see my reply to Waiman), but I am not sure if
it's worse than the scenario you are describing.

>
> * If concurrent flushing is an actual problem, we need and can implement a
> better solution. There's quite a bit of maneuvering room here given that
> the flushing operations are mostly idempotent in close time proximity and
> there's no real atomicity requirement across different segments of
> flushing operations.

Concurrent flushing can be a problem for some workloads, especially in
the MM code we flush in the reclaim and refault paths. This is
currently mitigated by only allowing one flusher at a time from the
memcg side, but contention can still happen with flushing when a
cgroup is being freed or other flushers in other subsystems.

I tried allowing concurrent flushing by completely removing the global
rstat lock, and only depending on the percpu locks for
synchronization. For this to be correct the global stat counters need
to be atomic, this introduced a slow down for flushing in general. I
also noticed heavier lock contention on the percpu locks, since all
flushers try to acquire all locks in the same order. I even tried
implementing a simple retry scheme where we try to acquire the percpu
lock, and if we fail we queue the current cpu and move to the next
one. This ended up causing a little bit of slowness as well. Together
with the slowness introduced by atomic operations it seemed like a
significant regression in the simple flushing path.

Don't get me wrong, I am all for modifying the current approach, I
just want to make sure we are making the correct decision for *most*
workloads. Keep in mind that this series aims to minimize the number
of flushers from atomic contexts as well, and for non-atomic flushers
we allow giving up the lock at CPU boundaries anyway. The current
approach only keeps the lock held throughout for atomic flushers.

Any ideas here are welcome!

>
> Thanks.
>
> --
> tejun
Re: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock [ In reply to ]
On Fri, Mar 24, 2023 at 9:37?PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Fri, Mar 24, 2023 at 9:31?PM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > On Fri, Mar 24, 2023 at 7:18?PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > [...]
> > > Any ideas here are welcome!
> > >
> >
> > Let's move forward. It seems like we are not going to reach an
> > agreement on making cgroup_rstat_lock a non-irq lock. However there is
> > agreement on the memcg code of not flushing in irq context and the
> > cleanup Johannes has requested. Let's proceed with those for now. We
> > can come back to cgroup_rstat_lock later if we still see issues in
> > production.
>
> Even if we do not flush from irq context, we still flush from atomic
> contexts that will currently hold the lock with irqs disabled
> throughout the entire flush sequence. A primary purpose of this reason
> is to avoid that.
>
> We can either:
> (a) Proceed with the following approach of making cgroup_rstat_lock a
> non-irq lock.
> (b) Proceed with Tejun's suggestion of always releasing and
> reacquiring the lock at CPU boundaries, even for atomic flushes (if
> the spinlock needs a break ofc).
> (c) Something else.

(d) keep the status quo regarding cgroup_rstat_lock
(e) decouple the discussion of cgroup_rstat_lock from the agreed
improvements. Send the patches for the agreed ones and continue
discussing cgroup_rstat_lock.

1 2  View All