Mailing List Archive: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock

Re: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock [ In reply to ]

Mar 24, 2023, 10:15 PM

Post #26 of 34 (37 views)

On Fri, Mar 24, 2023 at 9:31?PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Fri, Mar 24, 2023 at 7:18?PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> [...]
> > Any ideas here are welcome!
> >
>
> Let's move forward. It seems like we are not going to reach an
> agreement on making cgroup_rstat_lock a non-irq lock. However there is
> agreement on the memcg code of not flushing in irq context and the
> cleanup Johannes has requested. Let's proceed with those for now. We
> can come back to cgroup_rstat_lock later if we still see issues in
> production.

Even if we do not flush from irq context, we still flush from atomic
contexts that will currently hold the lock with irqs disabled
throughout the entire flush sequence. A primary purpose of this reason
is to avoid that.

We can either:
(a) Proceed with the following approach of making cgroup_rstat_lock a
non-irq lock.
(b) Proceed with Tejun's suggestion of always releasing and
reacquiring the lock at CPU boundaries, even for atomic flushes (if
the spinlock needs a break ofc).
(c) Something else.

I am happy to proceed with any solution, but we need to address the
fact that interrupts are always disabled throughout the flush. My main
concern about Tejun's suggestion is atomic contexts having to contend
cgroup_rstat_lock much more than they do now, but it's still better
than what we have today.

>
> Tejun, do you have any concerns on adding WARN_ON_ONCE(!in_task()) in
> the rstat flushing code?

Re: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock [ In reply to ]

shakeelb at google

Mar 24, 2023, 10:15 PM

Post #27 of 34 (80 views)

Permalink

On Fri, Mar 24, 2023 at 7:18?PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
[...]
> Any ideas here are welcome!
>

Let's move forward. It seems like we are not going to reach an
agreement on making cgroup_rstat_lock a non-irq lock. However there is
agreement on the memcg code of not flushing in irq context and the
cleanup Johannes has requested. Let's proceed with those for now. We
can come back to cgroup_rstat_lock later if we still see issues in
production.

Tejun, do you have any concerns on adding WARN_ON_ONCE(!in_task()) in
the rstat flushing code?

Re: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock [ In reply to ]

yosryahmed at google

Mar 27, 2023, 6:15 PM

Post #28 of 34 (36 views)

Permalink

On Fri, Mar 24, 2023 at 9:46?PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Fri, Mar 24, 2023 at 9:37?PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Fri, Mar 24, 2023 at 9:31?PM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > On Fri, Mar 24, 2023 at 7:18?PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > > >
> > > [...]
> > > > Any ideas here are welcome!
> > > >
> > >
> > > Let's move forward. It seems like we are not going to reach an
> > > agreement on making cgroup_rstat_lock a non-irq lock. However there is
> > > agreement on the memcg code of not flushing in irq context and the
> > > cleanup Johannes has requested. Let's proceed with those for now. We
> > > can come back to cgroup_rstat_lock later if we still see issues in
> > > production.
> >
> > Even if we do not flush from irq context, we still flush from atomic
> > contexts that will currently hold the lock with irqs disabled
> > throughout the entire flush sequence. A primary purpose of this reason
> > is to avoid that.
> >
> > We can either:
> > (a) Proceed with the following approach of making cgroup_rstat_lock a
> > non-irq lock.
> > (b) Proceed with Tejun's suggestion of always releasing and
> > reacquiring the lock at CPU boundaries, even for atomic flushes (if
> > the spinlock needs a break ofc).
> > (c) Something else.
>
> (d) keep the status quo regarding cgroup_rstat_lock
> (e) decouple the discussion of cgroup_rstat_lock from the agreed
> improvements. Send the patches for the agreed ones and continue
> discussing cgroup_rstat_lock.

Ah, I lost sight of the fact that the rest of the patch series does
not strictly depend on this patch. I will respin the rest of the patch
series separately. Thanks, Shakeel.

Meanwhile, it would be useful to reach an agreement here to stop
acquiring the cgroup_rstat_lock for a long time with irq disabled in
atomic contexts.

Tejun, if having the lock be non-irq is a non-starter for you, I can
send a patch that instead gives up the lock and reacquires it at every
CPU boundary unconditionally -- or perhaps every N CPU boundaries to
avoid excessively releasing and reacquiring the lock.

Something like:

static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep)
{
...
for_each_possible_cpu(cpu) {
...
/* Always yield the at CPU boundaries to enable irqs */
spin_unlock_irq(&cgroup_rstat_lock);

/* if @may_sleep, play nice and yield if necessary */
if (may_sleep)
cond_resched();

spin_lock_irq(&cgroup_rstat_lock);
}
}

If you have other ideas to avoid disabling irq's for the entire flush
sequence I am also open to that.

Thanks!

Re: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock [ In reply to ]

tj at kernel

Mar 29, 2023, 12:15 PM

Post #29 of 34 (36 views)

Permalink

Hello, Yosry.

On Mon, Mar 27, 2023 at 04:23:13PM -0700, Yosry Ahmed wrote:
> Tejun, if having the lock be non-irq is a non-starter for you, I can

This is an actual hazard. We see in prod these unprotected locks causing
very big spikes in tail latencies and they can be tricky to root cause too
and given the way rstat lock is used it's highly likely to be involved in
those scenarios with the proposed change, so it's gonna be a nack from my
end.

> send a patch that instead gives up the lock and reacquires it at every
> CPU boundary unconditionally -- or perhaps every N CPU boundaries to
> avoid excessively releasing and reacquiring the lock.

I'd just do the simple thing and see whether there's any perf penalty before
making it complicated. I'd be pretty surprised if unlocking and relocking
the same spinlock adds any noticeable overhead here.

Thanks.

--
tejun

Re: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock [ In reply to ]

hughd at google

Mar 29, 2023, 2:15 PM

Post #30 of 34 (36 views)

Permalink

Hi Tejun,

On Wed, 29 Mar 2023, Tejun Heo wrote:
> On Mon, Mar 27, 2023 at 04:23:13PM -0700, Yosry Ahmed wrote:
> > Tejun, if having the lock be non-irq is a non-starter for you, I can
>
> This is an actual hazard. We see in prod these unprotected locks causing
> very big spikes in tail latencies and they can be tricky to root cause too
> and given the way rstat lock is used it's highly likely to be involved in
> those scenarios with the proposed change, so it's gonna be a nack from my
> end.

Butting in here, I'm fascinated. This is certainly not my area, I know
nothing about rstat, but this is the first time I ever heard someone
arguing for more disabling of interrupts rather than less.

An interrupt coming in while holding a contended resource can certainly
add to latencies, that I accept of course. But until now, I thought it
was agreed best practice to disable irqs only regretfully, when strictly
necessary.

If that has changed, I for one want to know about it. How should we
now judge which spinlocks should disable interrupts and which should not?
Page table locks are currently my main interest - should those be changed?

Thanks,
Hugh

Re: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock [ In reply to ]

tj at kernel

Mar 29, 2023, 2:15 PM

Post #31 of 34 (36 views)

Permalink

Hello, Hugh. How have you been?

On Wed, Mar 29, 2023 at 12:22:24PM -0700, Hugh Dickins wrote:
> Hi Tejun,
> Butting in here, I'm fascinated. This is certainly not my area, I know
> nothing about rstat, but this is the first time I ever heard someone
> arguing for more disabling of interrupts rather than less.
>
> An interrupt coming in while holding a contended resource can certainly
> add to latencies, that I accept of course. But until now, I thought it
> was agreed best practice to disable irqs only regretfully, when strictly
> necessary.
>
> If that has changed, I for one want to know about it. How should we
> now judge which spinlocks should disable interrupts and which should not?
> Page table locks are currently my main interest - should those be changed?

For rstat, it's a simple case because the global lock here wraps around
per-cpu locks which have to be irq-safe, so the only difference we get
between making the global irq-unsafe and keeping it so but releasing
inbetween is:

Global lock held: G
IRQ disabled: I
Percpu lock held: P

1. IRQ unsafe

GGGGGGGGGGGGGGG~~GGGGG
IIII IIII IIII ~~ IIII
PPPP PPPP PPPP ~~ PPPP

2. IRQ safe released inbetween cpus

GGGG GGGG GGGG ~~ GGGG
IIII IIII IIII ~~ IIII
PPPP PPPP PPPP ~~ PPPP

#2 seems like the obvious thing to do here given how the lock is used and
each P section may take a bit of time.

So, in the rstat case, the choice is, at least to me, obvious, but even for
more generic cases where the bulk of actual work isn't done w/ irq disabled,
I don't think the picture is as simple as "use the least protected variant
possible" anymore because the underlying hardware changed.

For an SMP kernel running on an UP system, "the least protected variant" is
the obvious choice to make because you don't lose anything by holding a
spinlock longer than necessary. However, as you increase the number of CPUs,
there rises a tradeoff between local irq servicing latency and global lock
contention.

Imagine a, say, 128 cpu system with a few cores servicing relatively high
frequency interrupts. Let's say there's a mildly hot lock. Usually, it shows
up in the system profile but only just. Let's say something happens and the
irq rate on those cores went up for some reason to the point where it
becomes a rather common occurrence when the lock is held on one of those
cpus, irqs are likely to intervene lengthening how long the lock is held,
sometimes, signficantly. Now because the lock is on average held for much
longer, it become a lot hotter as more CPUs would stall on it and depending
on luck or lack thereof these stalls can span many CPUs on the system for
quite a while. This is actually something we saw in production.

So, in general, there's a trade off between local irq service latency and
inducing global lock contention when using unprotected locks. With more and
more CPUs, the balance keeps shifting. The balance still very much depends
on the specifics of a given lock but yeah I think it's something we need to
be a lot more careful about now.

Thanks.

--
tejun

Re: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock [ In reply to ]

hughd at google

Mar 29, 2023, 2:15 PM

Post #32 of 34 (36 views)

Permalink

On Wed, 29 Mar 2023, Tejun Heo wrote:

> Hello, Hugh. How have you been?
>
> On Wed, Mar 29, 2023 at 12:22:24PM -0700, Hugh Dickins wrote:
> > Hi Tejun,
> > Butting in here, I'm fascinated. This is certainly not my area, I know
> > nothing about rstat, but this is the first time I ever heard someone
> > arguing for more disabling of interrupts rather than less.
> >
> > An interrupt coming in while holding a contended resource can certainly
> > add to latencies, that I accept of course. But until now, I thought it
> > was agreed best practice to disable irqs only regretfully, when strictly
> > necessary.
> >
> > If that has changed, I for one want to know about it. How should we
> > now judge which spinlocks should disable interrupts and which should not?
> > Page table locks are currently my main interest - should those be changed?
>
> For rstat, it's a simple case because the global lock here wraps around
> per-cpu locks which have to be irq-safe, so the only difference we get
> between making the global irq-unsafe and keeping it so but releasing
> inbetween is:
>
> Global lock held: G
> IRQ disabled: I
> Percpu lock held: P
>
> 1. IRQ unsafe
>
> GGGGGGGGGGGGGGG~~GGGGG
> IIII IIII IIII ~~ IIII
> PPPP PPPP PPPP ~~ PPPP
>
> 2. IRQ safe released inbetween cpus
>
> GGGG GGGG GGGG ~~ GGGG
> IIII IIII IIII ~~ IIII
> PPPP PPPP PPPP ~~ PPPP
>
> #2 seems like the obvious thing to do here given how the lock is used and
> each P section may take a bit of time.

Many thanks for the detailed response. I'll leave it to the rstat folks,
to agree or disagree with your analysis there.

>
> So, in the rstat case, the choice is, at least to me, obvious, but even for
> more generic cases where the bulk of actual work isn't done w/ irq disabled,
> I don't think the picture is as simple as "use the least protected variant
> possible" anymore because the underlying hardware changed.
>
> For an SMP kernel running on an UP system, "the least protected variant" is
> the obvious choice to make because you don't lose anything by holding a
> spinlock longer than necessary. However, as you increase the number of CPUs,
> there rises a tradeoff between local irq servicing latency and global lock
> contention.
>
> Imagine a, say, 128 cpu system with a few cores servicing relatively high
> frequency interrupts. Let's say there's a mildly hot lock. Usually, it shows
> up in the system profile but only just. Let's say something happens and the
> irq rate on those cores went up for some reason to the point where it
> becomes a rather common occurrence when the lock is held on one of those
> cpus, irqs are likely to intervene lengthening how long the lock is held,
> sometimes, signficantly. Now because the lock is on average held for much
> longer, it become a lot hotter as more CPUs would stall on it and depending
> on luck or lack thereof these stalls can span many CPUs on the system for
> quite a while. This is actually something we saw in production.
>
> So, in general, there's a trade off between local irq service latency and
> inducing global lock contention when using unprotected locks. With more and
> more CPUs, the balance keeps shifting. The balance still very much depends
> on the specifics of a given lock but yeah I think it's something we need to
> be a lot more careful about now.

And this looks a very plausible argument to me: I'll let it sink in.

But I hadn't heard that the RT folks were clamouring for more irq disabling:
perhaps they partition their machines with more care, and are not devotees
of high CPU counts.

What I hope is that others will chime in one way or the other -
it does sound as if a reappraisal of the balances is overdue.

Thanks,
Hugh (disabling interrupts for as long as he can)

Re: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock [ In reply to ]

yosryahmed at google

Mar 29, 2023, 10:15 PM

Post #33 of 34 (36 views)

Permalink

Thanks for a great discussion, Tejun and Hugh.

On Wed, Mar 29, 2023 at 1:38?PM Hugh Dickins <hughd@google.com> wrote:
>
> On Wed, 29 Mar 2023, Tejun Heo wrote:
>
> > Hello, Hugh. How have you been?
> >
> > On Wed, Mar 29, 2023 at 12:22:24PM -0700, Hugh Dickins wrote:
> > > Hi Tejun,
> > > Butting in here, I'm fascinated. This is certainly not my area, I know
> > > nothing about rstat, but this is the first time I ever heard someone
> > > arguing for more disabling of interrupts rather than less.
> > >
> > > An interrupt coming in while holding a contended resource can certainly
> > > add to latencies, that I accept of course. But until now, I thought it
> > > was agreed best practice to disable irqs only regretfully, when strictly
> > > necessary.
> > >
> > > If that has changed, I for one want to know about it. How should we
> > > now judge which spinlocks should disable interrupts and which should not?
> > > Page table locks are currently my main interest - should those be changed?
> >
> > For rstat, it's a simple case because the global lock here wraps around
> > per-cpu locks which have to be irq-safe, so the only difference we get
> > between making the global irq-unsafe and keeping it so but releasing
> > inbetween is:
> >
> > Global lock held: G
> > IRQ disabled: I
> > Percpu lock held: P
> >
> > 1. IRQ unsafe
> >
> > GGGGGGGGGGGGGGG~~GGGGG
> > IIII IIII IIII ~~ IIII
> > PPPP PPPP PPPP ~~ PPPP
> >
> > 2. IRQ safe released inbetween cpus
> >
> > GGGG GGGG GGGG ~~ GGGG
> > IIII IIII IIII ~~ IIII
> > PPPP PPPP PPPP ~~ PPPP
> >
> > #2 seems like the obvious thing to do here given how the lock is used and
> > each P section may take a bit of time.
>
> Many thanks for the detailed response. I'll leave it to the rstat folks,
> to agree or disagree with your analysis there.

Thanks for the analysis, Tejun, it does indeed make sense. I perf'd
releasing and reacquiring the lock at each CPU boundary and the
overhead seems to be minimal. It would be higher with contention, but
all memcg flushers should be held back by the memcg code, and flushers
outside memcg are not frequent (reading blkcg and cpu base stats from
user space, and when a cgroup is being removed).

I realized that after v2 of this patch series [1], we would only end
up with two atomic flushing contexts, mem_cgroup_wb_stats() and
mem_cgroup_usage(). The latter is already disabling irqs for other
reasons, so anything we do within the rstat core code doesn't really
help, it needs to be addressed separately. So only the call site in
mem_cgroup_wb_stats() would benefit from not having irqs disabled
throughout the flush.

I will hold off on sending a patch until I observe that this call site
is causing us pain and/or other atomic call sites emerge (or we have
to revert one of the ones we made non-atomic), so that we don't hurt
other flushers unnecessarily. Does this make sense to you?

[1] https://lore.kernel.org/linux-mm/20230328221644.803272-1-yosryahmed@google.com/

>
> >
> > So, in the rstat case, the choice is, at least to me, obvious, but even for
> > more generic cases where the bulk of actual work isn't done w/ irq disabled,
> > I don't think the picture is as simple as "use the least protected variant
> > possible" anymore because the underlying hardware changed.
> >
> > For an SMP kernel running on an UP system, "the least protected variant" is
> > the obvious choice to make because you don't lose anything by holding a
> > spinlock longer than necessary. However, as you increase the number of CPUs,
> > there rises a tradeoff between local irq servicing latency and global lock
> > contention.
> >
> > Imagine a, say, 128 cpu system with a few cores servicing relatively high
> > frequency interrupts. Let's say there's a mildly hot lock. Usually, it shows
> > up in the system profile but only just. Let's say something happens and the
> > irq rate on those cores went up for some reason to the point where it
> > becomes a rather common occurrence when the lock is held on one of those
> > cpus, irqs are likely to intervene lengthening how long the lock is held,
> > sometimes, signficantly. Now because the lock is on average held for much
> > longer, it become a lot hotter as more CPUs would stall on it and depending
> > on luck or lack thereof these stalls can span many CPUs on the system for
> > quite a while. This is actually something we saw in production.
> >
> > So, in general, there's a trade off between local irq service latency and
> > inducing global lock contention when using unprotected locks. With more and
> > more CPUs, the balance keeps shifting. The balance still very much depends
> > on the specifics of a given lock but yeah I think it's something we need to
> > be a lot more careful about now.
>
> And this looks a very plausible argument to me: I'll let it sink in.
>
> But I hadn't heard that the RT folks were clamouring for more irq disabling:
> perhaps they partition their machines with more care, and are not devotees
> of high CPU counts.
>
> What I hope is that others will chime in one way or the other -
> it does sound as if a reappraisal of the balances is overdue.
>
> Thanks,
> Hugh (disabling interrupts for as long as he can)

Re: [RFC PATCH 1/7] cgroup: rstat: only disable interrupts for the percpu lock [ In reply to ]

tj at kernel

Mar 30, 2023, 8:15 PM

Post #34 of 34 (36 views)

Permalink

Hello, Hugh.

On Wed, Mar 29, 2023 at 01:38:48PM -0700, Hugh Dickins wrote:
> > So, in general, there's a trade off between local irq service latency and
> > inducing global lock contention when using unprotected locks. With more and
> > more CPUs, the balance keeps shifting. The balance still very much depends
> > on the specifics of a given lock but yeah I think it's something we need to
> > be a lot more careful about now.
>
> And this looks a very plausible argument to me: I'll let it sink in.

Another somewhat relevant change is that flipping irq on/off used to be
relatively expensive on older x86 cpus. I forget all details about when and
how much but they should be a lot cheaper now. No idea about !x86 cpus tho.

> But I hadn't heard that the RT folks were clamouring for more irq disabling:
> perhaps they partition their machines with more care, and are not devotees
> of high CPU counts.

I think RT folks care a lot more about raw IRQ disables. These shouldn't
actually disable IRQs on RT kernels.

Thanks.

--
tejun