Mailing List Archive

[PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality
Hello,

tl;dr
=====

Unbound workqueues used to spray work items inside each NUMA node, which
isn't great on CPUs w/ multiple L3 caches. This patchset implements
mechanisms to improve and configure execution locality.

Long patchset, so a bit of traffic control:

PeterZ
------

It uses p->wake_cpu to best-effort migrate workers on wake-ups. Migrating on
wake-ups fits the bill and it's cheaper and fits the semantics in that
workqueue is hinting the scheduler which CPU has the most continuity rather
than forcing the specific migration. However, p->wake_cpu is only used in
scheduler proper, so this would be introducing a new way of using the field.

Please see 0021 and 0024.

Linus, PeterZ
-------------

Most of the patchset are workqueue internal plumbing and probably aren't
terribly interesting. Howver, the performance picture turned out less
straight-forward than I had hoped, mostly likely due to loss of
work-conservation from scheduler in high fan-out scenarios. I'll describe it
in this cover letter. Please read on.

In terms of patches, 0021-0024 are probably the interesting ones.

Brian Norris, Nathan Huckleberry and others experiencing wq perf problems
-------------------------------------------------------------------------

Can you please test this patchset and see whether the performance problems
are resolved? After the patchset, unbound workqueues default to
soft-affining on cache boundaries, which should hopefully resolve the issues
that you guys have been seeing on recent kernels on heterogeneous CPUs.

If you want to try different settings, please read:

https://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git/tree/Documentation/core-api/workqueue.rst?h=affinity-scopes-v1&id=e8f3505e69a526cc5fe40a4da5d443b7f9231016#n350

Alasdair Kergon, Mike Snitzer, DM folks
---------------------------------------

I ran fio on top of dm-crypt to compare performance of different
configurations. It mostly behaved as expected but please let me know if
anything doens't look right. Also, DM_CRYPT_SAME_CPU can now be implemented
by applying strict "cpu" scope on the unbound workqueue and it would make
sense to add WQ_SYSFS to the kcryptd workqueue so that users can tune the
settings on the fly.

Lai
---

I'd really appreciate if you could go over the series. 0003, 0007, 0009,
0018-0021, 0023-0024 are the interesting ones. 0003, 0019, 0020 and 0024 are
the potentially risky ones. I've been testing them pretty extensively for
over a week but the series can really use your eyes.


Introduction
============

Recently, IIRC, with chromeos moving to a newer kernel, there have been a
flurry of reports on unbound workqueues showing high execution latency and
low bandwidth. AFAICT, all were on heterogeneous CPUs with multiple L3
caches. Please see the discussions in the following threads.

https://lkml.kernel.org/r/CAHk-=wgE9kORADrDJ4nEsHHLirqPCZ1tGaEPAZejHdZ03qCOGg@mail.gmail.com
https://lkml.kernel.org/r/ZFvpJb9Dh0FCkLQA@google.com

While details from the second thread indicate that the problem is unlikely
to originate solely from workqueue, workqueue does have an underlying
problem that can be amplified through behavior changes in other subsystems.

Being an async execution mechanism, workqueue inherently disconnects issuer
and executor. Worker pools are shared across workqueues whenever possible
further severing the connection. While this disconnection is true for all
workqueues, for per-cpu workqueues, this doesn't harm locality because they
still all end up on the same CPU. For unbound workqueues tho, we end up
spraying work items across CPUs which blows away all locality.

This was already a noticeable issue with NUMA machines and 4c16bd327c74
("workqueue: implement NUMA affinity for unbound workqueues") implemented
NUMA support which segments unbound workqueues on NUMA node boundaries so
that each subset that's contained within a NUMA node is served by a separate
worker pool. Work items still get sprayed but they get sprayed only inside
its NUMA node.

This has been mostly fine but CPUs became a lot more complex with many more
cores and multiple L3 caches inside a single node with differeing distances
across them, and it looks like it's high time to improve workqueue's
locality awareness.


Design
======

There are two basic ideas.

1. Expand the segmenting so that unbound workqueues can be segmented on
different boundaries including L3 caches.

2. The knoweldge about CPU topology is already in the scheduler. Instead of
trying to steer things on its own, workqueue can inform the scheduler of
the to-be-executed work item's locality and get out of its way. (Linus's
suggestion)

#2 is nice in that it takes out workqueue out of the businness that it's not
good at. Even with #2, #1 is likely still useful as persistent locality of
workers affect cache hit ratio and memory distance on stack and shared data
structures.

Given the above, it could be that workqueue can do something like the
following and keep everyone more or less happy:

* Segment unbound workqueues according to L3 cache boundaries
(cpus_share_cache()).

* Don't hard-affine workers but make a best-effort attempt to locate them on
the same CPUs as the work items they are about to execute.

This way, the hot workers would stay mostly affine to the L3 cache domain
they're attached to and the scheduler would have the full knowledge of the
preferred locality when starting execution of a work item. As the workers
aren't strictly confined, the scheduler is free to move them outside that L3
cache domain or even across NUMA nodes according to the hardware
configuration and load situation. In short, we get both locality and
work-conservation.

Unfortunately, this didn't work out due to the pronounced inverse
correlation between locality and work-conservation, which will be
illustrated in the following "Results" section. As this meant that there is
not going to be one good enough behavior for everyone, I ended up
implementing a number of different behaviors that can be selected
dynamically per-workqueue.

There are three parameters:

* Affinity scope: Determines how unbound workqueues are segmented. Five
possible values.

CPU: Every CPU in its own scope. The unbound workqueue behaves like a
per-cpu one.

SMT: SMT siblings in the same scope.

CACHE: cpus_share_cache() defines the scopes, usually on L3 boundaries.

NUMA: NUMA nodes define the scopes. This is the same behavior as before
this patchset.

SYSTEM: A single scope across the whole system.

* Affinty strictness: If a scope is strict, every worker is confined inside
its scope and the scheduler can't move it outside. If non-strict, a worker
is brought back into its scope (called repatriation) when it starts to
execute a work item but then can be moved outside the scope if the
scheduler deems that the better course of action.

* Localization: If a workqueue enables localization, a worker is moved to
the CPU that issued the work item at the start of execution. Afterwards,
the scheduler can migrate the task according to the above affinity scope
and strictness settings. Interestingly, this turned out to be not very
useful in practice. It's implemented in the last two patches but they
aren't for upstream inclusion, at least for now. See the "Results"
section.

Combined, they allow a wide variety of configurations. It's easy to emulate
WQ_CPU_INTENSIVE per-cpu workqueues or the behavior before this patchset.
The settings can be changed anytime through both apply_workqueue_attrs() and
the sysfs interface allowing easy and flexible tuning.

Let's see how different configurations perform.


Experiment Setup
================

Here's the relevant part of the experiment setup.

* Ryzen 9 3900x - 12 cores / 24 threads spread across 4 L3 caches.
Core-to-core latencies across L3 caches are ~2.6x worse than within each
L3 cache. ie. it's worse but not hugely so. This means that the impact of
L3 cache locality is noticeable in these experiments but may be subdued
compared to other setups.

* The CPU's clock boost is disabled, so that all the clocks are running very
close to the stock 3.8Ghz for test consistency.

* dm-crypt setup with default parameters on Samsung 980 PRO SSD.

* Each combination is run five times.

Three fio scenarios are run directly against the dm-crypt device.

HIGH: Enough issuers and work spread across the machine

fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \
--ioengine=libaio --iodepth=64 --runtime=60 --numjobs=24 --time_based \
--group_reporting --name=iops-test-job --verify=sha512

MED: Fewer issuers, enough work for saturation

fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \
--ioengine=libaio --iodepth=64 --runtime=60 --numjobs=8 --time_based \
--group_reporting --name=iops-test-job --verify=sha512

LOW: Even fewer issuers, not enough work to saturate

fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \
--ioengine=libaio --iodepth=64 --runtime=60 --numjobs=4 --time_based \
--group_reporting --name=iops-test-job --verify=sha512

They are the same except for the number of issuers. The issuers generate
data every time, write, read back and then verifiy the content. Each
issuer can have 64 IOs in flight.

Five workqueue configurations are used:

SYSTEM A single scope across the system. All workers belong to the
same pool and work items are sprayed all over. On this CPU,
this is the same behavior as before this patchset.

CACHE Four scopes, each serving a L3 cache domain. The scopes are
non-strict, so workers starts inside their scopes but the
scheduler can move them outside as needed.

CACHE+STRICT The same as CACHE but the scopes are strict. Workers are
confined within their scopes.

SYSTEM+LOCAL A single scope across the system but with localization
turned on. A worker is moved to the issuing CPU of the work
item it's about to execute.

CACHE+LOCAL Four scopes each serving a L3 cache domain with localization
turned on. A worker is moved to the issuing CPU of the work
item it's about to execute and is a lot more likely to
be already inside the matching L3 cache domain.


Results
=======

1. HIGH: Enough issuers and work spread across the machine

Bandwidth (MiBps) CPU util (%) vs. SYSTEM BW (%)
----------------------------------------------------------------------
SYSTEM 1159.40 ±1.34 99.31 ±0.02 0.00
CACHE 1166.40 ±0.89 99.34 ±0.01 +0.60
CACHE+STRICT 1166.00 ±0.71 99.35 ±0.01 +0.57
SYSTEM+LOCAL 1157.40 ±1.67 99.50 ±0.01 -0.17
CACHE+LOCAL 1160.40 ±0.55 99.50 ±0.02 +0.09

The differences are small. Here are some observations.

* CACHE and CACHE+STRICT are clearly performing better. The difference is
not big but consistent, which isn't too surprising given that these caches
are pretty close. The CACHE ones are doing more work for the same number
of CPU cycles compared to SYSTEM.

This scenario is the best case for CACHE+STRICT, so it's encouraging that
the non-strict variant can match.

* SYSTEM/CACHE+LOCAL aren't doing horribly but not better. They burned
slightly more CPUs, possibly from the scheduler needing to move away the
workers more frequently.


2. MED: Fewer issuers, enough work for saturation

Bandwidth (MiBps) CPU util (%) vs. SYSTEM BW (%)
----------------------------------------------------------------------
SYSTEM 1155.40 ±0.89 97.41 ±0.05 0.00
CACHE 1154.40 ±1.14 96.15 ±0.09 -0.09
CACHE+STRICT 1112.00 ±4.64 93.26 ±0.35 -3.76
SYSTEM+LOCAL 1066.80 ±2.17 86.70 ±0.10 -7.67
CACHE+LOCAL 1034.60 ±1.52 83.00 ±0.07 -10.46

There are still eight issuers and plenty of work to go around. However, now,
depending on the configuration, we're starting to see a significant loss of
work-conservation where CPUs sit idle while there's work to do.

* CACHE is doing okay. It's just a bit slower. Further testing may be needed
to definitively confirm the bandwidth gap but the CPU util difference
seems real, so while minute, it did lose a bit of work-conservation.
Comparing it to CACHE+STRICT, it's starting to show the benefits of
non-strict scopes.

* CACHE+STRICT's lower bw and utilization make sense. It'd be easy to
mis-locate issuers so that some cache domains are temporarily underloaded.
With scheduler not being allowed to move tasks around, there will be some
utilization bubbles.

* SYSTEM/CACHE+LOCAL are not doing great with significantly lower
utilizations. There are more than enough work to do in the system
(changing --numjobs to 6, for example, doesn't change the picture much for
SYSTEM) but the kernel is seemingly unable to dispatch workers to idle
CPUs fast enough.


3. LOW: Even fewer issuers, not enough work to saturate

Bandwidth (MiBps) CPU util (%) vs. SYSTEM BW (%)
----------------------------------------------------------------------
SYSTEM 993.60 ±1.82 75.49 ±0.06 0.00
CACHE 973.40 ±1.52 74.90 ±0.07 -2.03
CACHE+STRICT 828.20 ±4.49 66.84 ±0.29 -16.65
SYSTEM+LOCAL 884.20 ±18.39 64.14 ±1.12 -11.01
CACHE+LOCAL 860.60 ±4.72 61.46 ±0.24 -13.39

Now, there isn't enough work to be done to saturate the whole system. This
amplifies the pattern which started to show up in the MED scenario.

* CACHE is now noticeably slower but not crazily so.

* CACHE+STRICT shows the severe work-conservation limitation of strict
scopes.

* SYSTEM/CACHE+LOCAL show the same pattern as in the MED scenario but with
larger work-conservation deficits.


Conclusions
===========

The tradeoff between efficiency gains from improved locality and bandwidth
loss from work-conservation deficit poses a conundrum. The scenarios used
for testing, while limited, are plausiable representations of what heavy
workqueue usage might look like in practice, and they clearly show that
there is no one straight-forward option which can cover all cases.

* The efficiency advantage of the CACHE over SYSTEM is, while consistent and
noticeable, small. However, the impact is dependent on the distances
between the scopes and may be more pronounced in processors with more
complex topologies.

While the loss of work-conservation in certain scenarios hurts, it is a
lot better than CACHE+STRICT and maximizing workqueue utilization is
unlikely to be the common case anyway. As such, CACHE is the default
affinity scope for unbound pools.

There's some risk to this decision and we might have to reassess depending
on how this works out in the wild.

* In all three scenarios, +LOCAL didn't show any advantage. There may be
scenarios where the boost from L1/2 cache locality is visible but the
severe work-conservation deficits really hurt.

I haven't looked into it too deeply here but the behavior where CPUs
sitting idle while there's work to do is in line with what we (Meta) have
been seeing with CFS across multiple workloads and I believe Google has
similar experiences. Tuning load balancing and migration costs through
debugfs helps but not fully. Deeper discussion on this topic is a bit out
of scope here and there are on-going efforts on this front, so we can
revisit later.

So, at least in the current state, +LOCAL doesn't make sense. Let's table
it for now. I put the two patches to implement it at the end of the series
w/o SOB.


PATCHES
=======

This series contains the following patches.

0001-workqueue-Drop-the-special-locking-rule-for-worker-f.patch
0002-workqueue-Cleanups-around-process_scheduled_works.patch
0003-workqueue-Not-all-work-insertion-needs-to-wake-up-a-.patch
0004-workqueue-Rename-wq-cpu_pwqs-to-wq-cpu_pwq.patch
0005-workqueue-Relocate-worker-and-work-management-functi.patch
0006-workqueue-Remove-module-param-disable_numa-and-sysfs.patch
0007-workqueue-Use-a-kthread_worker-to-release-pool_workq.patch
0008-workqueue-Make-per-cpu-pool_workqueues-allocated-and.patch
0009-workqueue-Make-unbound-workqueues-to-use-per-cpu-poo.patch
0010-workqueue-Rename-workqueue_attrs-no_numa-to-ordered.patch
0011-workqueue-Rename-NUMA-related-names-to-use-pod-inste.patch
0012-workqueue-Move-wq_pod_init-below-workqueue_init.patch
0013-workqueue-Initialize-unbound-CPU-pods-later-in-the-b.patch
0014-workqueue-Generalize-unbound-CPU-pods.patch
0015-workqueue-Add-tools-workqueue-wq_dump.py-which-print.patch
0016-workqueue-Modularize-wq_pod_type-initialization.patch
0017-workqueue-Add-multiple-affinity-scopes-and-interface.patch
0018-workqueue-Factor-out-work-to-worker-assignment-and-c.patch
0019-workqueue-Factor-out-need_more_worker-check-and-work.patch
0020-workqueue-Add-workqueue_attrs-__pod_cpumask.patch
0021-workqueue-Implement-non-strict-affinity-scope-for-un.patch
0022-workqueue-Add-Affinity-Scopes-and-Performance-sectio.patch
0023-workqueue-Add-pool_workqueue-cpu.patch
0024-workqueue-Implement-localize-to-issuing-CPU-for-unbo.patch

This series is also available in the following git branch:

git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git affinity-scopes-v1

diffstat follows. Thanks.

Documentation/admin-guide/kernel-parameters.txt | 21
Documentation/core-api/workqueue.rst | 379 ++++++++++-
include/linux/workqueue.h | 82 ++
init/main.c | 1
kernel/workqueue.c | 1613 +++++++++++++++++++++++++++-----------------------
kernel/workqueue_internal.h | 2
tools/workqueue/wq_dump.py | 180 +++++
tools/workqueue/wq_monitor.py | 29
8 files changed, 1525 insertions(+), 782 deletions(-)
Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality [ In reply to ]
On Thu, May 18, 2023 at 5:17?PM Tejun Heo <tj@kernel.org> wrote:
>
> Most of the patchset are workqueue internal plumbing and probably aren't
> terribly interesting. Howver, the performance picture turned out less
> straight-forward than I had hoped, mostly likely due to loss of
> work-conservation from scheduler in high fan-out scenarios. I'll describe it
> in this cover letter. Please read on.

So my reaction here is that I think your benchmarking was about
throughput, but the recent changes that triggered this discussion were
about latency for random small stuff.

Maybe your "LOW" tests might eb close to that, but looking at that fio
benchmark line you quoted, I don't think so.

IOW, I think that what the fsverity code ended up seeing was literally
*serial* IO that was fast enough that it was better done on the local
CPU immediately, and that that was the reason for why it wanted to
remove WQ_UNBOUND.

IOW, I think you should go even lower than your "LOW", and test
basically "--iodepth=1" to a ramdisk. A load where schedulign to any
other CPU is literally *always* a mistake, because the IO is basically
entirely synchronous, and it's better to just do the work on the same
CPU and be done with it.

That may sound like an outlier thing, but I don't think it's
necessarily even all that odd. I think that "depth=1" is likely the
common case for many real loads.

That commit f959325e6ac3 ("fsverity: Remove WQ_UNBOUND from fsverity
read workqueue") really talks about startup costs. They are about
things like "page in the executable", which is all almost 100%
serialized with no parallelism at all. Even read-ahead ends up being
serial, in that it's likely one single contiguous IO.

Yes, latency tends to be harder to benchmark than throughput, but I
really think latency trumps throughput 95% of the time. And all your
benchmark loads looked like throughput loads to me: they just weren't
using *all* the CPU capacity you had.

Yes, writeback can have lovely throughput behavior and saturate the IO
because you have lots of parallelism. But reads are often 100% serial
for one thread, and often you don't *have* more than one thread.

So I think your "not enough work to saturate" is still ludicrously
over-the-top. You should not aim for "not enough work to saturate 24
threads". You should aim for "basically completely single-threaded".
Judging by your "CPU utilization of 60-70%", I think your "LOW" is off
by at least an order of magnitude.

Linus
Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality [ In reply to ]
Hello, Linus.

On Thu, May 18, 2023 at 05:41:29PM -0700, Linus Torvalds wrote:
...
> That commit f959325e6ac3 ("fsverity: Remove WQ_UNBOUND from fsverity
> read workqueue") really talks about startup costs. They are about
> things like "page in the executable", which is all almost 100%
> serialized with no parallelism at all. Even read-ahead ends up being
> serial, in that it's likely one single contiguous IO.

I should have explained my thought process better here. I don't think the
fsverity and other similar recent reports on heterogeneous ARM CPUs are
caused directly by workqueue. Please take a look at the following message
from Brian Norris.

https://lore.kernel.org/all/ZFvpJb9Dh0FCkLQA@google.com/T/#u

While it's difficult to tell anything definitive from the message, the
reported performance ordering is

4.19 > pinning worker to a CPU >> fixating CPU freq > kthread_worker

where the difference between 4.19 and pinning to one CPU is pretty small.
So, it does line up with other reports in that the source of higher
latencies and lower performance are from work items getting sprayed across
CPUs.

However, the two kernel versions tested are 4.19 and 5.15. I audited the
commits in-between and didn't spot anything which would materially change
unbound workers' affinities or how unbound work items would be assigned to
them.

Also, f959325e6ac3 ("fsverity: Remove WQ_UNBOUND from fsverity read
workqueue") is reporting 30 fold increase in scheduler latency, which I take
to be the time from work item being queued to start of exectuion. That's
unlikely to be just from crossing a cache boundary. There must be other
interactions (e.g. with some powersaving state transitions).

That said, workqueue, by spraying work items across cache boundaries, does
provide a condition in which this sort of problems can be significantly
amplified. workqueue isn't and can't fix the root cause of these problems;
however,

* The current workqueue behavior is silly on machines with multiple L3
caches such as recent AMD CPUs w/ chiplets and heterogeneous ARM ones.
Improving workqueue locality is likely to lessen the severity of the
recently reported latency problems, possibly to the extent that it won't
matter anymore.

* The fact that the remedy people are going for is switching to percpu
workqueue is bad. That is going to hurt other use cases, often severely,
and their only solution would be reverting back to unbound workqueues.

So, my expectation with the posted patchset is that most of the severe
chrome problems will go away, hopefully. Even in the case that that's not
sufficient, unbound workqueues now provide enough flexibility to work around
these problems without resorting to switching to per-cpu workqueues thus
avoiding affecting other use cases negatively.

The performance benchmarks are not directed towards the reported latency
problems. The reported magnitude seems very hardware specific to me and I
have no way of reproducing. The benchmarks are more to justify switching the
default boundaries from NUMA to cache.

> Yes, latency tends to be harder to benchmark than throughput, but I
> really think latency trumps throughput 95% of the time. And all your
> benchmark loads looked like throughput loads to me: they just weren't
> using *all* the CPU capacity you had.
>
> Yes, writeback can have lovely throughput behavior and saturate the IO
> because you have lots of parallelism. But reads are often 100% serial
> for one thread, and often you don't *have* more than one thread.
>
> So I think your "not enough work to saturate" is still ludicrously
> over-the-top. You should not aim for "not enough work to saturate 24
> threads". You should aim for "basically completely single-threaded".
> Judging by your "CPU utilization of 60-70%", I think your "LOW" is off
> by at least an order of magnitude.


More Experiments
================

I ran several more test sets. An extra workqueue configuration CPU+STRICT is
added which is very similar to using CPU_INTENSIVE per-cpu workqueues. Also,
CPU util is now per-single-CPU, ie. 100% indicates using one full CPU rather
than all CPUs because otherwise the numbers were too small. The tests are
run against dm-crypt on a tmpfs backed loop device.


4. LLOW

Similar benchmark as before but --bs is now 512 bytes and --iodepth and
--numjobs are dialed down to 4 and 1, respectively. Compared to LOW, both IO
size and concurrency are 64 times lower.

taskset 0x8 fio --filename=/dev/dm-1 --direct=1 --rw=randrw --bs=512 \
--ioengine=libaio --iodepth=4 --runtime=60 --numjobs=1 \
--time_based --group_reporting --name=iops-test-job --verify=sha512

Note that fio is pinned to one CPU because otherwise I was getting random
bi-modal behaviors depending on what the scheduler was doing making
comparisons difficult.

Bandwidth (MiBps) CPU util (%) vs. SYSTEM BW (%)
----------------------------------------------------------------------
SYSTEM 55.08 ?0.29 290.80 ?0.64 0.00
CACHE 64.42 ?0.47 291.51 ?0.30 +16.96
CACHE+STRICT 65.18 ?1.14 303.74 ?1.79 +18.34
CPU+STRICT 32.86 ?0.34 159.05 ?0.37 -48.99
SYSTEM+LOCAL 56.76 ?0.30 286.78 ?0.50 +3.05
CACHE+LOCAL 57.74 ?0.11 291.65 ?0.80 +4.83

The polarities of +LOCAL's relative BWs flipped showing gains over SYSTEM.
However, the fact that they're significantly worse than CACHE didn't change.

This is 4 in-flight 512 byte IOs, entirely plausible in any size systems.
Even at this low level of concurrency, the downside of using per-cpu
workqueue (CPU+STRICT) is clear.


5. SYNC

Let's push it further. It's now single threaded synchronous read/write(2)'s
w/ 512 byte blocks. If we're going to be able to extract meaningful gains
from localizing to the issuing CPU, this should show it.

taskset 0x8 fio --filename=/dev/dm-1 --direct=1 --rw=randrw --bs=512 \
--ioengine=io_uring --iodepth=1 --runtime=60 --numjobs=1 \
--time_based --group_reporting --name=iops-test-job --verify=sha512

Bandwidth (MiBps) CPU util (%) vs. SYSTEM BW (%)
----------------------------------------------------------------------
SYSTEM 18.64 ?0.19 88.41 ?0.47 0.00
CACHE 21.46 ?0.11 91.39 ?0.26 +15.13
CACHE+STRICT 20.80 ?0.23 90.68 ?0.15 +11.59
CPU+STRICT 21.12 ?0.61 95.29 ?0.51 -1.58
SYSTEM+LOCAL 20.80 ?0.20 90.12 ?0.34 +11.59
CACHE+LOCAL 21.46 ?0.09 93.18 ?0.04 +15.13

Now +LOCAL's are doing quite a bit better than SYSTEM but still can't beat
CACHE. Interestingly, CPU+STRICT performs noticeably worse than others while
occupying CPU for longer. I haven't thought too much why this would be,
maybe because the benefits of using resources on other CPUs beat the
overhead of doing so?


Summary
=======

The overall conclusion doesn't change much. +LOCAL's fare better with lower
concurrency but still can't beat CACHE. I tried other combinations and
against SSD too. Nothing significantly deviates from the overall pattern.

I'm sure we can concoct a workload which uses significantly less than one
full CPU (so that utilizing other CPU's resources don't bring enough gain)
and can stay within L1/2 to maximize the benefit of not having to go through
L3, but it looks like that's going to take some stretching. At least on the
CPU I'm testing, it looks like letting loose in each L3 domain is the right
thing to do.

And, ARM folks, AFAICS, workqueue is unlikely to be the root cause of the
significant regressions observed after switching to recent kernel. However,
this patchset should hopefully alleviate the symptoms significantly, or,
failing that, at least make working around a lot easier. So, please test and
please don't switch CPU-heavy unbound workqueues to per-cpu. That's
guaranteed to hurt other heavier-weight usages.

Thanks.

--
tejun
Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality [ In reply to ]
Oh, a bit of addition.

Once below saturation, latency and bw are mostly the two sides of the same
coin but just to be sure, here are latency results. The single-threaded sync
IO is run with 1ms interval between IOs.

taskset 0x8 fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=512 \
--ioengine=sync --iodepth=1 --runtime=60 --numjobs=1 --time_based \
--group_reporting --name=iops-test-job --verify=sha512 --thinktime=1ms

SYSTEM

read: IOPS=480, BW=240KiB/s (246kB/s)(14.1MiB/60001msec)
clat (usec): min=8, max=401, avg=30.96, stdev= 9.60
lat (usec): min=8, max=401, avg=31.01, stdev= 9.60
clat percentiles (usec):
| 1.00th=[ 11], 5.00th=[ 13], 10.00th=[ 25], 20.00th=[ 27],
| 30.00th=[ 28], 40.00th=[ 29], 50.00th=[ 29], 60.00th=[ 30],
| 70.00th=[ 32], 80.00th=[ 42], 90.00th=[ 44], 95.00th=[ 44],
| 99.00th=[ 46], 99.50th=[ 46], 99.90th=[ 56], 99.95th=[ 71],
| 99.99th=[ 253]
bw ( KiB/s): min= 214, max= 265, per=99.85%, avg=240.29, stdev=11.35, samples=119
iops : min= 428, max= 530, avg=480.59, stdev=22.70, samples=119

CPU_STRICT

read: IOPS=474, BW=237KiB/s (243kB/s)(385KiB/1624msec)
clat (usec): min=9, max=240, avg=28.00, stdev=11.20
lat (usec): min=9, max=240, avg=28.05, stdev=11.20
clat percentiles (usec):
| 1.00th=[ 12], 5.00th=[ 26], 10.00th=[ 26], 20.00th=[ 26],
| 30.00th=[ 27], 40.00th=[ 28], 50.00th=[ 28], 60.00th=[ 28],
| 70.00th=[ 29], 80.00th=[ 30], 90.00th=[ 31], 95.00th=[ 31],
| 99.00th=[ 32], 99.50th=[ 50], 99.90th=[ 241], 99.95th=[ 241],
| 99.99th=[ 241]

CACHE

read: IOPS=479, BW=240KiB/s (245kB/s)(14.0MiB/60002msec)
clat (nsec): min=7874, max=75922, avg=13342.34, stdev=6906.53
lat (nsec): min=7904, max=75952, avg=13386.08, stdev=6906.60
clat percentiles (nsec):
| 1.00th=[ 8384], 5.00th=[ 8896], 10.00th=[ 9152], 20.00th=[ 9408],
| 30.00th=[ 9536], 40.00th=[ 9920], 50.00th=[10432], 60.00th=[10688],
| 70.00th=[11072], 80.00th=[13632], 90.00th=[27264], 95.00th=[28288],
| 99.00th=[30592], 99.50th=[30848], 99.90th=[41216], 99.95th=[56064],
| 99.99th=[74240]
bw ( KiB/s): min= 204, max= 269, per=99.69%, avg=239.67, stdev=11.02, samples=119
iops : min= 408, max= 538, avg=479.34, stdev=22.04, samples=119


It's a bit confusing because fio switched to printing nsecs for CACHE but
CPU_STRICT (per-cpu)'s average completion latency is, expectedly, better
than SYSTEM - 28ms vs. 31ms, but CACHE's is way better at 13.3ms.

Thanks.

--
tejun
Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality [ In reply to ]
On Fri, May 19, 2023 at 4:03?PM Tejun Heo <tj@kernel.org> wrote:
>
> Oh, a bit of addition.
>
> Once below saturation, latency and bw are mostly the two sides of the same
> coin but just to be sure, here are latency results.

Ok, looks good, consider me convinced.

I still would like to hear from the actual arm crowd that caused this
all and have those odd BIG.little cases.

Linus
Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality [ In reply to ]
On Thu, May 18, 2023 at 02:16:45PM -1000, Tejun Heo wrote:

> Most of the patchset are workqueue internal plumbing and probably aren't
> terribly interesting. Howver, the performance picture turned out less
> straight-forward than I had hoped, mostly likely due to loss of
> work-conservation from scheduler in high fan-out scenarios. I'll describe it
> in this cover letter. Please read on.

> Here's the relevant part of the experiment setup.
>
> * Ryzen 9 3900x - 12 cores / 24 threads spread across 4 L3 caches.
> Core-to-core latencies across L3 caches are ~2.6x worse than within each
> L3 cache. ie. it's worse but not hugely so. This means that the impact of
> L3 cache locality is noticeable in these experiments but may be subdued
> compared to other setups.

*blink*, 12 cores with 4 LLCs ? that's a grand total of 3 cores / 6
threads per LLC. That's puny.

/me goes google a bit.. So these are Zen2 things which nominally have 4
cores per CCX which has 16M of L3, but these must binned parts with only
3 functional cores per CCX.

Zen3 then went to 8 cores per CCX with double the L3.

> 2. MED: Fewer issuers, enough work for saturation
>
> Bandwidth (MiBps) CPU util (%) vs. SYSTEM BW (%)
> ----------------------------------------------------------------------
> SYSTEM 1155.40 ?0.89 97.41 ?0.05 0.00
> CACHE 1154.40 ?1.14 96.15 ?0.09 -0.09
> CACHE+STRICT 1112.00 ?4.64 93.26 ?0.35 -3.76
> SYSTEM+LOCAL 1066.80 ?2.17 86.70 ?0.10 -7.67
> CACHE+LOCAL 1034.60 ?1.52 83.00 ?0.07 -10.46
>
> There are still eight issuers and plenty of work to go around. However, now,
> depending on the configuration, we're starting to see a significant loss of
> work-conservation where CPUs sit idle while there's work to do.
>
> * CACHE is doing okay. It's just a bit slower. Further testing may be needed
> to definitively confirm the bandwidth gap but the CPU util difference
> seems real, so while minute, it did lose a bit of work-conservation.
> Comparing it to CACHE+STRICT, it's starting to show the benefits of
> non-strict scopes.

So wakeup based placement is mostly all about LLC, and given this thing
has dinky small LLCs it will pile up on the one LLC you target and leave
the others idle until the regular idle balancer decides to make an
appearance and move some around.

But if these are fairly short running tasks, I can well see that not
going to help much.


Much of this was tuned back when there was 1 L3 per Node; something
which is still more or less true for Intel but clearly not for these
things.


The below is a bit crude and completely untested, but it might help. The
flip side of that coin is of course that people are going to complain
about how selecting a CPU is more expensive now and how this hurts their
performance :/

Basically it will try and iterate all L3s in a node; wakeup will still
refuse to cross node boundaries.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 48b6f0ca13ac..ddb7f16a07a9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7027,6 +7027,33 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
return idle_cpu;
}

+static int
+select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
+{
+ struct sched_domain *sd_node = rcu_dereference(per_cpu(sd_node, target));
+ struct sched_group *sg;
+
+ if (!sd_node || sd_node == sd)
+ return -1;
+
+ sg = sd_node->groups;
+ do {
+ int cpu = cpumask_first(sched_group_span(sg));
+ struct sched_domain *sd_child;
+
+ sd_child = per_cpu(sd_llc, cpu);
+ if (sd_child != sd) {
+ int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);
+ if ((unsigned int)i < nr_cpumask_bits)
+ return i;
+ }
+
+ sg = sg->next;
+ } while (sg != sd_node->groups);
+
+ return -1;
+}
+
/*
* Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
* the task fits. If no CPU is big enough, but there are idle ones, try to
@@ -7199,6 +7226,12 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
if ((unsigned)i < nr_cpumask_bits)
return i;

+ if (sched_feat(SIS_NODE)) {
+ i = select_idle_node(p, sd, target);
+ if ((unsigned)i < nr_cpumask_bits)
+ return i;
+ }
+
return target;
}

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..f965cd4a981e 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
*/
SCHED_FEAT(SIS_PROP, false)
SCHED_FEAT(SIS_UTIL, true)
+SCHED_FEAT(SIS_NODE, true)

/*
* Issue a WARN when we do multiple update_rq_clock() calls
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 678446251c35..d2e0e2e496a6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1826,6 +1826,7 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
DECLARE_PER_CPU(int, sd_llc_size);
DECLARE_PER_CPU(int, sd_llc_id);
DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
+DECLARE_PER_CPU(struct sched_domain __rcu *, sd_node);
DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index ca4472281c28..d94cbc2164ca 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -667,6 +667,7 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
DEFINE_PER_CPU(int, sd_llc_size);
DEFINE_PER_CPU(int, sd_llc_id);
DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
+DEFINE_PER_CPU(struct sched_domain __rcu *, sd_node);
DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
@@ -691,6 +692,18 @@ static void update_top_cache_domain(int cpu)
per_cpu(sd_llc_id, cpu) = id;
rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);

+ while (sd && sd->parent) {
+ /*
+ * SD_NUMA is the first domain spanning nodes, therefore @sd
+ * must be the domain that spans a single node.
+ */
+ if (sd->parent->flags & SD_NUMA)
+ break;
+
+ sd = sd->parent;
+ }
+ rcu_assign_pointer(per_cpu(sd_node, cpu), sd);
+
sd = lowest_flag_domain(cpu, SD_NUMA);
rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality [ In reply to ]
On Tue, 23 May 2023 at 13:18, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, May 18, 2023 at 02:16:45PM -1000, Tejun Heo wrote:
>
> > Most of the patchset are workqueue internal plumbing and probably aren't
> > terribly interesting. Howver, the performance picture turned out less
> > straight-forward than I had hoped, mostly likely due to loss of
> > work-conservation from scheduler in high fan-out scenarios. I'll describe it
> > in this cover letter. Please read on.
>
> > Here's the relevant part of the experiment setup.
> >
> > * Ryzen 9 3900x - 12 cores / 24 threads spread across 4 L3 caches.
> > Core-to-core latencies across L3 caches are ~2.6x worse than within each
> > L3 cache. ie. it's worse but not hugely so. This means that the impact of
> > L3 cache locality is noticeable in these experiments but may be subdued
> > compared to other setups.
>
> *blink*, 12 cores with 4 LLCs ? that's a grand total of 3 cores / 6
> threads per LLC. That's puny.
>
> /me goes google a bit.. So these are Zen2 things which nominally have 4
> cores per CCX which has 16M of L3, but these must binned parts with only
> 3 functional cores per CCX.
>
> Zen3 then went to 8 cores per CCX with double the L3.
>
> > 2. MED: Fewer issuers, enough work for saturation
> >
> > Bandwidth (MiBps) CPU util (%) vs. SYSTEM BW (%)
> > ----------------------------------------------------------------------
> > SYSTEM 1155.40 ?0.89 97.41 ?0.05 0.00
> > CACHE 1154.40 ?1.14 96.15 ?0.09 -0.09
> > CACHE+STRICT 1112.00 ?4.64 93.26 ?0.35 -3.76
> > SYSTEM+LOCAL 1066.80 ?2.17 86.70 ?0.10 -7.67
> > CACHE+LOCAL 1034.60 ?1.52 83.00 ?0.07 -10.46
> >
> > There are still eight issuers and plenty of work to go around. However, now,
> > depending on the configuration, we're starting to see a significant loss of
> > work-conservation where CPUs sit idle while there's work to do.
> >
> > * CACHE is doing okay. It's just a bit slower. Further testing may be needed
> > to definitively confirm the bandwidth gap but the CPU util difference
> > seems real, so while minute, it did lose a bit of work-conservation.
> > Comparing it to CACHE+STRICT, it's starting to show the benefits of
> > non-strict scopes.
>
> So wakeup based placement is mostly all about LLC, and given this thing
> has dinky small LLCs it will pile up on the one LLC you target and leave
> the others idle until the regular idle balancer decides to make an
> appearance and move some around.
>
> But if these are fairly short running tasks, I can well see that not
> going to help much.
>
>
> Much of this was tuned back when there was 1 L3 per Node; something
> which is still more or less true for Intel but clearly not for these
> things.
>
>
> The below is a bit crude and completely untested, but it might help. The
> flip side of that coin is of course that people are going to complain
> about how selecting a CPU is more expensive now and how this hurts their
> performance :/
>
> Basically it will try and iterate all L3s in a node; wakeup will still
> refuse to cross node boundaries.

That remember me some discussion about system with fast on die
interconnect where we would like to run wider than llc at wakeup (i.e.
DIE level) something like the CLUSTER level but on the other side of
MC

Another possibility to investigate would be that each wakeup of a
worker is mostly unrelated to the previous one and it cares only
waker. so we should use -1 for the prev_cpu

>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 48b6f0ca13ac..ddb7f16a07a9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7027,6 +7027,33 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> return idle_cpu;
> }
>
> +static int
> +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> +{
> + struct sched_domain *sd_node = rcu_dereference(per_cpu(sd_node, target));
> + struct sched_group *sg;
> +
> + if (!sd_node || sd_node == sd)
> + return -1;
> +
> + sg = sd_node->groups;
> + do {
> + int cpu = cpumask_first(sched_group_span(sg));
> + struct sched_domain *sd_child;
> +
> + sd_child = per_cpu(sd_llc, cpu);
> + if (sd_child != sd) {
> + int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);
> + if ((unsigned int)i < nr_cpumask_bits)
> + return i;
> + }
> +
> + sg = sg->next;
> + } while (sg != sd_node->groups);
> +
> + return -1;
> +}
> +
> /*
> * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
> * the task fits. If no CPU is big enough, but there are idle ones, try to
> @@ -7199,6 +7226,12 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> if ((unsigned)i < nr_cpumask_bits)
> return i;
>
> + if (sched_feat(SIS_NODE)) {
> + i = select_idle_node(p, sd, target);
> + if ((unsigned)i < nr_cpumask_bits)
> + return i;
> + }
> +
> return target;
> }
>
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index ee7f23c76bd3..f965cd4a981e 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
> */
> SCHED_FEAT(SIS_PROP, false)
> SCHED_FEAT(SIS_UTIL, true)
> +SCHED_FEAT(SIS_NODE, true)
>
> /*
> * Issue a WARN when we do multiple update_rq_clock() calls
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 678446251c35..d2e0e2e496a6 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1826,6 +1826,7 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> DECLARE_PER_CPU(int, sd_llc_size);
> DECLARE_PER_CPU(int, sd_llc_id);
> DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_node);
> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index ca4472281c28..d94cbc2164ca 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -667,6 +667,7 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> DEFINE_PER_CPU(int, sd_llc_size);
> DEFINE_PER_CPU(int, sd_llc_id);
> DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_node);
> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
> @@ -691,6 +692,18 @@ static void update_top_cache_domain(int cpu)
> per_cpu(sd_llc_id, cpu) = id;
> rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
>
> + while (sd && sd->parent) {
> + /*
> + * SD_NUMA is the first domain spanning nodes, therefore @sd
> + * must be the domain that spans a single node.
> + */
> + if (sd->parent->flags & SD_NUMA)
> + break;
> +
> + sd = sd->parent;
> + }
> + rcu_assign_pointer(per_cpu(sd_node, cpu), sd);
> +
> sd = lowest_flag_domain(cpu, SD_NUMA);
> rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
>
Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality [ In reply to ]
On Mon, May 22, 2023 at 6:51?PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Ok, looks good, consider me convinced.

Ugh, and I just got the erofs pull, which made me aware of how there's
*another* hack for "worker threads don't work well on Android", which
now doubled down on setting special scheduler flags for them too.

See commit cf7f2732b4b8 ("erofs: use HIPRI by default if per-cpu
kthreads are enabled"), but the whole deeper problem goes back much
further (commit 3fffb589b9a6: "erofs: add per-cpu threads for
decompression as an option").

See also

https://lore.kernel.org/lkml/CAB=BE-SBtO6vcoyLNA9F-9VaN5R0t3o_Zn+FW8GbO6wyUqFneQ@mail.gmail.com/

I really hate how we have random drivers and filesystems doing random
workarounds for "kthread workers don't work well enough, so add random
tweaks".

> I still would like to hear from the actual arm crowd that caused this
> all and have those odd BIG.little cases.

Sandeep, any chance that you could try out Tejun's series with plain
workers, and compare it to the percpu threads?

See

https://lore.kernel.org/lkml/20230519001709.2563-1-tj@kernel.org/

for the beginning of this thread. The aim really is to try to figure
out - and hopefully fix - the fact that Android loads seem to trigger
all these random hacks that don't make any sense on a high level, and
seem to be workarounds rather than real fixes. Scheduling percpu
workers is a horrible horrible fix and not at all good in general, and
it would be much better if we could just improve workqueue behavior in
the odd Android situation.

Linus
Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality [ In reply to ]
On Tue, 2023-05-23 at 10:59 -0700, Linus Torvalds wrote:
>
> I really hate how we have random drivers and filesystems doing random
> workarounds for "kthread workers don't work well enough, so add
> random
> tweaks".

Part of this seems to be due to the way CFS works.

CFS policy seems to make sense for a lot of workloads, but
there are some cases with kworkers where the CFS policies
just don't work quite right. Unfortunately the scheduler
problem space is not all that well explored, and it isn't
clear what the desired behavior of a scheduler should be
in every case.
Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality [ In reply to ]
>
> Sandeep, any chance that you could try out Tejun's series with plain
> workers, and compare it to the percpu threads?
>
> See
>
> https://lore.kernel.org/lkml/20230519001709.2563-1-tj@kernel.org/
>
Hi Linus,
I will try out the series and report back with the results.

Thanks,
Sandeep.
Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality [ In reply to ]
On Tue, May 23, 2023 at 06:12:45PM +0200, Vincent Guittot wrote:

> Another possibility to investigate would be that each wakeup of a
> worker is mostly unrelated to the previous one and it cares only
> waker. so we should use -1 for the prev_cpu

Tejun is actually overriding p->wake_cpu in this series to target a
specific LLC -- with the explicit purpose to keep the workers near
enough.

But the problem is that with lots of short tasks we then overload the
LLC and are not running long enough for the idle load-balancer to spread
things, leading to idle time.

And that is specific to this lots of little LLC topologies.
Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality [ In reply to ]
On Wed, 24 May 2023 at 09:35, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, May 23, 2023 at 06:12:45PM +0200, Vincent Guittot wrote:
>
> > Another possibility to investigate would be that each wakeup of a
> > worker is mostly unrelated to the previous one and it cares only
> > waker. so we should use -1 for the prev_cpu
>
> Tejun is actually overriding p->wake_cpu in this series to target a
> specific LLC -- with the explicit purpose to keep the workers near
> enough.

yes, so -1 for prev_cpu was a good way to forgot the irrelevant prev
cpu without trying to abuse in order to wake it up close to the waker

>
> But the problem is that with lots of short tasks we then overload the
> LLC and are not running long enough for the idle load-balancer to spread
> things, leading to idle time.

I expect to not pile up workers in the same LLC if we keep the
workqueue cpu affinity to the die. The worker will wake up in the LLC
of the callers and callers should be spread on the die

>
> And that is specific to this lots of little LLC topologies.
Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality [ In reply to ]
Hello,

On Tue, May 23, 2023 at 01:18:18PM +0200, Peter Zijlstra wrote:
> > * Ryzen 9 3900x - 12 cores / 24 threads spread across 4 L3 caches.
> > Core-to-core latencies across L3 caches are ~2.6x worse than within each
> > L3 cache. ie. it's worse but not hugely so. This means that the impact of
> > L3 cache locality is noticeable in these experiments but may be subdued
> > compared to other setups.
>
> *blink*, 12 cores with 4 LLCs ? that's a grand total of 3 cores / 6
> threads per LLC. That's puny.
>
> /me goes google a bit.. So these are Zen2 things which nominally have 4
> cores per CCX which has 16M of L3, but these must binned parts with only
> 3 functional cores per CCX.

Yeah.

> Zen3 then went to 8 cores per CCX with double the L3.

Yeah, they basically doubled each L3 domain.

> > 2. MED: Fewer issuers, enough work for saturation
> >
> > Bandwidth (MiBps) CPU util (%) vs. SYSTEM BW (%)
> > ----------------------------------------------------------------------
> > SYSTEM 1155.40 ?0.89 97.41 ?0.05 0.00
> > CACHE 1154.40 ?1.14 96.15 ?0.09 -0.09
> > CACHE+STRICT 1112.00 ?4.64 93.26 ?0.35 -3.76
> > SYSTEM+LOCAL 1066.80 ?2.17 86.70 ?0.10 -7.67
> > CACHE+LOCAL 1034.60 ?1.52 83.00 ?0.07 -10.46
> >
> > There are still eight issuers and plenty of work to go around. However, now,
> > depending on the configuration, we're starting to see a significant loss of
> > work-conservation where CPUs sit idle while there's work to do.
> >
> > * CACHE is doing okay. It's just a bit slower. Further testing may be needed
> > to definitively confirm the bandwidth gap but the CPU util difference
> > seems real, so while minute, it did lose a bit of work-conservation.
> > Comparing it to CACHE+STRICT, it's starting to show the benefits of
> > non-strict scopes.
>
> So wakeup based placement is mostly all about LLC, and given this thing
> has dinky small LLCs it will pile up on the one LLC you target and leave
> the others idle until the regular idle balancer decides to make an
> appearance and move some around.

While this processor configuration isn't the most typical, I have a
difficult time imgaining newer chiplet designs behaving much differently.
The core problem is that while there is benefit to improving locality within
each L3 domain, the distance across L3 domains isn't that far either, so
unless loads get balanced aggressively across L3 domains while staying
within L3 when immediately possible, you lose capacity.

> But if these are fairly short running tasks, I can well see that not
> going to help much.

Right, the tasks themselves aren't necessarily short-running but they do
behave that way due to discontinuation and repatriation across work item
boundaries.

> Much of this was tuned back when there was 1 L3 per Node; something
> which is still more or less true for Intel but clearly not for these
> things.

Yeah, the same for workqueue. It assumed that the distance within a NUMA
node is indistinguishiable, which no longer holds.

> The below is a bit crude and completely untested, but it might help. The
> flip side of that coin is of course that people are going to complain
> about how selecting a CPU is more expensive now and how this hurts their
> performance :/
>
> Basically it will try and iterate all L3s in a node; wakeup will still
> refuse to cross node boundaries.
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 48b6f0ca13ac..ddb7f16a07a9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7027,6 +7027,33 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> return idle_cpu;
> }
>
> +static int
> +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> +{
> + struct sched_domain *sd_node = rcu_dereference(per_cpu(sd_node, target));

I had to rename the local variable because it was making the global percpu
one during initialization but after that the result looks really good. I did
the same HIGH, MED, LOW scenarios. +SN indicates that SIS_NODE is turned on.
The machine set-up was slightly different so the baseline numbers aren't
directly comparable to the original results; however, the relative bw %
comparisons should hold.

RESULTS
=======

1. HIGH: Enough issuers and work spread across the machine

Bandwidth (MiBps) CPU util (%) vs. SYSTEM BW (%)
----------------------------------------------------------------------
SYSTEM 1162.80 ?0.45 99.29 ?0.03 0.00
CACHE 1169.60 ?0.55 99.35 ?0.02 +0.58
CACHE+SN 1169.00 ?0.00 99.37 ?0.01 +0.53
CACHE+SN+LOCAL 1165.00 ?0.71 99.48 ?0.03 +0.19

This is an over-saturation scenario and the CPUs aren't having any trouble
finding work to do no matter what. The slight gain is mostly likely from
improved L3 locality and +SN doesn't behave noticeably differently from
CACHE.


2. MED: Fewer issuers, enough work for saturation

Bandwidth (MiBps) CPU util (%) vs. SYSTEM BW (%)
----------------------------------------------------------------------
SYSTEM 1157.20 ?0.84 97.34 ?0.07 0.00
CACHE 1155.60 ?1.34 96.09 ?0.10 -0.14
CACHE+SN 1163.80 ?0.45 96.90 ?0.07 +0.57
CACHE+SN+LOCAL 1052.00 ?1.00 85.84 ?0.11 -0.09

+LOCAL is still sad but CACHE+SN is now maintaining similar gain over SYSTEM
similar to the HIGH scenario. Compared to CACHE, CACHE_SN shows slight but
discernable increases both in bandwidth and CPU util, which is great.


3. LOW: Even fewer issuers, not enough work to saturate

Bandwidth (MiBps) CPU util (%) vs. SYSTEM BW (%)
----------------------------------------------------------------------
SYSTEM 995.00 ?4.42 75.60 ?0.27 0.00
CACHE 971.00 ?3.39 74.86 ?0.18 -2.41
CACHE+SN 998.40 ?4.04 74.91 ?0.27 +0.34
CACHE+SN+LOCAL 957.60 ?2.51 69.80 ?0.17 -3.76

+LOCAL still sucks but CACHE+SN wins over SYSTEM albeit within a single
sigma. It's clear that CACHE+SN outperforms CACHE by a significant margin
and makes up the loss compared to SYSTEM.


CONCLUSION
==========

With the SIS_NODE enabled, there's no downside to CACHE. It always
outperforms or matches SYSTEM. It's possible that the overhead of searching
further for an idle CPU is more pronounced on bigger machines but most
likely so will be the gains. This looks like a no brainer improvement to me.

Thanks.

--
tejun
Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality [ In reply to ]
On Thu, May 25, 2023 at 03:12:45PM -1000, Tejun Heo wrote:

> CONCLUSION
> ==========
>
> With the SIS_NODE enabled, there's no downside to CACHE. It always
> outperforms or matches SYSTEM. It's possible that the overhead of searching
> further for an idle CPU is more pronounced on bigger machines but most
> likely so will be the gains. This looks like a no brainer improvement to me.

OK, looking at it again, I think it can be done a little simpler, but it
should be mostly the same.

I'll go queue the below in sched/core, we'll see if anything comes up
negative.

---
Subject: sched/fair: Multi-LLC select_idle_sibling()
From: Peter Zijlstra <peterz@infradead.org>
Date: Tue May 30 13:20:46 CEST 2023

Tejun reported that when he targets workqueues towards a specific LLC
on his Zen2 machine with 3 cores / LLC and 4 LLCs in total, he gets
significant idle time.

This is, of course, because of how select_idle_sibling() will not
consider anything outside of the local LLC, and since all these tasks
are short running the periodic idle load balancer is ineffective.

And while it is good to keep work cache local, it is better to not
have significant idle time. Therefore, have select_idle_sibling() try
other LLCs inside the same node when the local one comes up empty.

Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++++
kernel/sched/features.h | 1 +
2 files changed, 39 insertions(+)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7028,6 +7028,38 @@ static int select_idle_cpu(struct task_s
}

/*
+ * For the multiple-LLC per node case, make sure to try the other LLC's if the
+ * local LLC comes up empty.
+ */
+static int
+select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
+{
+ struct sched_domain *parent = sd->parent;
+ struct sched_group *sg;
+
+ /* Make sure to not cross nodes. */
+ if (!parent || parent->flags & SD_NUMA)
+ return -1;
+
+ sg = parent->groups;
+ do {
+ int cpu = cpumask_first(sched_group_span(sg));
+ struct sched_domain *sd_child;
+
+ sd_child = per_cpu(sd_llc, cpu);
+ if (sd_child != sd) {
+ int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);
+ if ((unsigned)i < nr_cpumask_bits)
+ return i;
+ }
+
+ sg = sg->next;
+ } while (sg != parent->groups);
+
+ return -1;
+}
+
+/*
* Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
* the task fits. If no CPU is big enough, but there are idle ones, try to
* maximize capacity.
@@ -7199,6 +7231,12 @@ static int select_idle_sibling(struct ta
if ((unsigned)i < nr_cpumask_bits)
return i;

+ if (sched_feat(SIS_NODE)) {
+ i = select_idle_node(p, sd, target);
+ if ((unsigned)i < nr_cpumask_bits)
+ return i;
+ }
+
return target;
}

--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
*/
SCHED_FEAT(SIS_PROP, false)
SCHED_FEAT(SIS_UTIL, true)
+SCHED_FEAT(SIS_NODE, true)

/*
* Issue a WARN when we do multiple update_rq_clock() calls
Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality [ In reply to ]
Hello Vincent,

On Tue, May 23, 2023 at 06:12:45PM +0200, Vincent Guittot wrote:
> On Tue, 23 May 2023 at 13:18, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > So wakeup based placement is mostly all about LLC, and given this thing
> > has dinky small LLCs it will pile up on the one LLC you target and leave
> > the others idle until the regular idle balancer decides to make an
> > appearance and move some around.
> >
> > But if these are fairly short running tasks, I can well see that not
> > going to help much.
> >
> >
> > Much of this was tuned back when there was 1 L3 per Node; something
> > which is still more or less true for Intel but clearly not for these
> > things.
> >
> >
> > The below is a bit crude and completely untested, but it might help. The
> > flip side of that coin is of course that people are going to complain
> > about how selecting a CPU is more expensive now and how this hurts their
> > performance :/
> >
> > Basically it will try and iterate all L3s in a node; wakeup will still
> > refuse to cross node boundaries.
>
> That remember me some discussion about system with fast on die
> interconnect where we would like to run wider than llc at wakeup (i.e.
> DIE level) something like the CLUSTER level but on the other side of
> MC
>

Adding Libo Chen who was a part of this discussion. IIRC, the problem was
that there was no MC domain on that system, which would have made the
SMT domain to be the sd_llc. But since the core is single threaded,
the SMT domain would be degnerated thus leaving no domain which has
the SD_SHARE_PKG_RESOURCES flag.

If I understand correctly, Peter's patch won't help in such a
situation.

However, it should help POWER10 which has the SMT domain as the LLC
and previously it was observed that moving the wakeup search to the
parent domain was helpful (Ref:
https://lore.kernel.org/lkml/1617341874-1205-1-git-send-email-ego@linux.vnet.ibm.com/)


--
Thanks and Regards
gautham.


> Another possibility to investigate would be that each wakeup of a
> worker is mostly unrelated to the previous one and it cares only
> waker. so we should use -1 for the prev_cpu
>
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 48b6f0ca13ac..ddb7f16a07a9 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7027,6 +7027,33 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> > return idle_cpu;
> > }
> >
> > +static int
> > +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> > +{
> > + struct sched_domain *sd_node = rcu_dereference(per_cpu(sd_node, target));
> > + struct sched_group *sg;
> > +
> > + if (!sd_node || sd_node == sd)
> > + return -1;
> > +
> > + sg = sd_node->groups;
> > + do {
> > + int cpu = cpumask_first(sched_group_span(sg));
> > + struct sched_domain *sd_child;
> > +
> > + sd_child = per_cpu(sd_llc, cpu);
> > + if (sd_child != sd) {
> > + int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);
> > + if ((unsigned int)i < nr_cpumask_bits)
> > + return i;
> > + }
> > +
> > + sg = sg->next;
> > + } while (sg != sd_node->groups);
> > +
> > + return -1;
> > +}
> > +
> > /*
> > * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
> > * the task fits. If no CPU is big enough, but there are idle ones, try to
> > @@ -7199,6 +7226,12 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> > if ((unsigned)i < nr_cpumask_bits)
> > return i;
> >
> > + if (sched_feat(SIS_NODE)) {
> > + i = select_idle_node(p, sd, target);
> > + if ((unsigned)i < nr_cpumask_bits)
> > + return i;
> > + }
> > +
> > return target;
> > }
> >
> > diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> > index ee7f23c76bd3..f965cd4a981e 100644
> > --- a/kernel/sched/features.h
> > +++ b/kernel/sched/features.h
> > @@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
> > */
> > SCHED_FEAT(SIS_PROP, false)
> > SCHED_FEAT(SIS_UTIL, true)
> > +SCHED_FEAT(SIS_NODE, true)
> >
> > /*
> > * Issue a WARN when we do multiple update_rq_clock() calls
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 678446251c35..d2e0e2e496a6 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1826,6 +1826,7 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> > DECLARE_PER_CPU(int, sd_llc_size);
> > DECLARE_PER_CPU(int, sd_llc_id);
> > DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> > +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_node);
> > DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> > DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> > DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index ca4472281c28..d94cbc2164ca 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -667,6 +667,7 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
> > DEFINE_PER_CPU(int, sd_llc_size);
> > DEFINE_PER_CPU(int, sd_llc_id);
> > DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
> > +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_node);
> > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
> > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
> > DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
> > @@ -691,6 +692,18 @@ static void update_top_cache_domain(int cpu)
> > per_cpu(sd_llc_id, cpu) = id;
> > rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
> >
> > + while (sd && sd->parent) {
> > + /*
> > + * SD_NUMA is the first domain spanning nodes, therefore @sd
> > + * must be the domain that spans a single node.
> > + */
> > + if (sd->parent->flags & SD_NUMA)
> > + break;
> > +
> > + sd = sd->parent;
> > + }
> > + rcu_assign_pointer(per_cpu(sd_node, cpu), sd);
> > +
> > sd = lowest_flag_domain(cpu, SD_NUMA);
> > rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
> >
Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality [ In reply to ]
On 6/4/23 9:46 PM, Gautham R. Shenoy wrote:
> Hello Vincent,
>
> On Tue, May 23, 2023 at 06:12:45PM +0200, Vincent Guittot wrote:
>> On Tue, 23 May 2023 at 13:18, Peter Zijlstra <peterz@infradead.org> wrote:
>>> So wakeup based placement is mostly all about LLC, and given this thing
>>> has dinky small LLCs it will pile up on the one LLC you target and leave
>>> the others idle until the regular idle balancer decides to make an
>>> appearance and move some around.
>>>
>>> But if these are fairly short running tasks, I can well see that not
>>> going to help much.
>>>
>>>
>>> Much of this was tuned back when there was 1 L3 per Node; something
>>> which is still more or less true for Intel but clearly not for these
>>> things.
>>>
>>>
>>> The below is a bit crude and completely untested, but it might help. The
>>> flip side of that coin is of course that people are going to complain
>>> about how selecting a CPU is more expensive now and how this hurts their
>>> performance :/
>>>
>>> Basically it will try and iterate all L3s in a node; wakeup will still
>>> refuse to cross node boundaries.
>> That remember me some discussion about system with fast on die
>> interconnect where we would like to run wider than llc at wakeup (i.e.
>> DIE level) something like the CLUSTER level but on the other side of
>> MC
>>
> Adding Libo Chen who was a part of this discussion. IIRC, the problem was
> that there was no MC domain on that system, which would have made the
> SMT domain to be the sd_llc. But since the core is single threaded,
> the SMT domain would be degnerated thus leaving no domain which has
> the SD_SHARE_PKG_RESOURCES flag.
>
> If I understand correctly, Peter's patch won't help in such a
> situation.

Yes, you have some ARM platforms that have no L3 cache (they have SLCs though which are memory-side caches)
and no hyperthreading, so the lowest domain level is DIE and every single wakee task goes back to previous
CPU no matter what because LLC doesn't exist. For such platforms, we would want to expand the search range
similar to AMD to take advantage of idle cores on the same DIE.

I think we need a new abstraction here to accommodate all these different cache topologies,
replace SD_LLC with, for example, SD_WAKEUP and allow wider search range independent of LLC.


Libo


> However, it should help POWER10 which has the SMT domain as the LLC
> and previously it was observed that moving the wakeup search to the
> parent domain was helpful (Ref:
> https://urldefense.com/v3/__https://lore.kernel.org/lkml/1617341874-1205-1-git-send-email-ego@linux.vnet.ibm.com/__;!!ACWV5N9M2RV99hQ!MWwLTGkdtGeyndhqBm2g_RRyVZQP9lTDPPEawQldKgmaE0QL20ET4F_2mpJcd_ghyOAGyvrk3Blc5FWXCvKbnFk$ )
>
>
> --
> Thanks and Regards
> gautham.
>
>
>> Another possibility to investigate would be that each wakeup of a
>> worker is mostly unrelated to the previous one and it cares only
>> waker. so we should use -1 for the prev_cpu
>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 48b6f0ca13ac..ddb7f16a07a9 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -7027,6 +7027,33 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>> return idle_cpu;
>>> }
>>>
>>> +static int
>>> +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
>>> +{
>>> + struct sched_domain *sd_node = rcu_dereference(per_cpu(sd_node, target));
>>> + struct sched_group *sg;
>>> +
>>> + if (!sd_node || sd_node == sd)
>>> + return -1;
>>> +
>>> + sg = sd_node->groups;
>>> + do {
>>> + int cpu = cpumask_first(sched_group_span(sg));
>>> + struct sched_domain *sd_child;
>>> +
>>> + sd_child = per_cpu(sd_llc, cpu);
>>> + if (sd_child != sd) {
>>> + int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);
>>> + if ((unsigned int)i < nr_cpumask_bits)
>>> + return i;
>>> + }
>>> +
>>> + sg = sg->next;
>>> + } while (sg != sd_node->groups);
>>> +
>>> + return -1;
>>> +}
>>> +
>>> /*
>>> * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
>>> * the task fits. If no CPU is big enough, but there are idle ones, try to
>>> @@ -7199,6 +7226,12 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>>> if ((unsigned)i < nr_cpumask_bits)
>>> return i;
>>>
>>> + if (sched_feat(SIS_NODE)) {
>>> + i = select_idle_node(p, sd, target);
>>> + if ((unsigned)i < nr_cpumask_bits)
>>> + return i;
>>> + }
>>> +
>>> return target;
>>> }
>>>
>>> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
>>> index ee7f23c76bd3..f965cd4a981e 100644
>>> --- a/kernel/sched/features.h
>>> +++ b/kernel/sched/features.h
>>> @@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
>>> */
>>> SCHED_FEAT(SIS_PROP, false)
>>> SCHED_FEAT(SIS_UTIL, true)
>>> +SCHED_FEAT(SIS_NODE, true)
>>>
>>> /*
>>> * Issue a WARN when we do multiple update_rq_clock() calls
>>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>>> index 678446251c35..d2e0e2e496a6 100644
>>> --- a/kernel/sched/sched.h
>>> +++ b/kernel/sched/sched.h
>>> @@ -1826,6 +1826,7 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>> DECLARE_PER_CPU(int, sd_llc_size);
>>> DECLARE_PER_CPU(int, sd_llc_id);
>>> DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>>> +DECLARE_PER_CPU(struct sched_domain __rcu *, sd_node);
>>> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>>> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>>> DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
>>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>>> index ca4472281c28..d94cbc2164ca 100644
>>> --- a/kernel/sched/topology.c
>>> +++ b/kernel/sched/topology.c
>>> @@ -667,6 +667,7 @@ DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>> DEFINE_PER_CPU(int, sd_llc_size);
>>> DEFINE_PER_CPU(int, sd_llc_id);
>>> DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>>> +DEFINE_PER_CPU(struct sched_domain __rcu *, sd_node);
>>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
>>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
>>> @@ -691,6 +692,18 @@ static void update_top_cache_domain(int cpu)
>>> per_cpu(sd_llc_id, cpu) = id;
>>> rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
>>>
>>> + while (sd && sd->parent) {
>>> + /*
>>> + * SD_NUMA is the first domain spanning nodes, therefore @sd
>>> + * must be the domain that spans a single node.
>>> + */
>>> + if (sd->parent->flags & SD_NUMA)
>>> + break;
>>> +
>>> + sd = sd->parent;
>>> + }
>>> + rcu_assign_pointer(per_cpu(sd_node, cpu), sd);
>>> +
>>> sd = lowest_flag_domain(cpu, SD_NUMA);
>>> rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
>>>
Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality [ In reply to ]
Hi,

On Thu, May 18, 2023 at 02:16:45PM -1000, Tejun Heo wrote:
> In terms of patches, 0021-0024 are probably the interesting ones.
>
> Brian Norris, Nathan Huckleberry and others experiencing wq perf problems
> -------------------------------------------------------------------------
>
> Can you please test this patchset and see whether the performance problems
> are resolved? After the patchset, unbound workqueues default to
> soft-affining on cache boundaries, which should hopefully resolve the issues
> that you guys have been seeing on recent kernels on heterogeneous CPUs.
>
> If you want to try different settings, please read:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git/tree/Documentation/core-api/workqueue.rst?h=affinity-scopes-v1&id=e8f3505e69a526cc5fe40a4da5d443b7f9231016#n350

Thanks for the CC; my colleague tried out your patches (ported to 5.15
with some minor difficulty), and aside from some crashes (already noted
by others, although we didn't pull the proposed v2 fixes), he didn't
notice a significant change in performance on our particular test system
and WiFi-throughput workload. I don't think we expected a lot though,
per the discussion at:

https://lore.kernel.org/all/ZFvpJb9Dh0FCkLQA@google.com/

Brian
Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality [ In reply to ]
Hello,

On Mon, Jun 12, 2023 at 04:56:06PM -0700, Brian Norris wrote:
> Thanks for the CC; my colleague tried out your patches (ported to 5.15
> with some minor difficulty), and aside from some crashes (already noted
> by others, although we didn't pull the proposed v2 fixes), he didn't

Yeah, there were a few subtle bugs that v2 fixes.

> notice a significant change in performance on our particular test system
> and WiFi-throughput workload. I don't think we expected a lot though,
> per the discussion at:
>
> https://lore.kernel.org/all/ZFvpJb9Dh0FCkLQA@google.com/

That's disappointing. I was actually expecting that the default behavior
would restrain migrations across L3 boundaries strong enough to make a
meaningful difference. Can you enable WQ_SYSFS and test the following
configs?

1. affinity_scope = cache, affinity_strict = 1

2. affinity_scope = cpu, affinity_strict = 0

3. affinity_scope = cpu, affinity_strict = 1

#3 basically turns it into a percpu workqueue, so it should perform more or
less the same as a percpu workqueue without affecting everyone else.

Any chance you can post the toplogy details on the affected setup? How are
the caches and cores laid out?

Thanks.

--
tejun
Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality [ In reply to ]
Hi Tejun,

On Tue, Jun 13, 2023 at 10:48?AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Mon, Jun 12, 2023 at 04:56:06PM -0700, Brian Norris wrote:
> > Thanks for the CC; my colleague tried out your patches (ported to 5.15
> > with some minor difficulty), and aside from some crashes (already noted
> > by others, although we didn't pull the proposed v2 fixes), he didn't
>
> Yeah, there were a few subtle bugs that v2 fixes.
>
> > notice a significant change in performance on our particular test system
> > and WiFi-throughput workload. I don't think we expected a lot though,
> > per the discussion at:
> >
> > https://lore.kernel.org/all/ZFvpJb9Dh0FCkLQA@google.com/
>
> That's disappointing. I was actually expecting that the default behavior
> would restrain migrations across L3 boundaries strong enough to make a
> meaningful difference. Can you enable WQ_SYSFS and test the following
> configs?
>
> 1. affinity_scope = cache, affinity_strict = 1
>
> 2. affinity_scope = cpu, affinity_strict = 0
>
> 3. affinity_scope = cpu, affinity_strict = 1

I pulled down v2 series and tried these settings on our 5.15 kernel.
Unfortunately none of them showed significant improvement on the
throughput. It's hard to tell which one is the best because of the
noise, but the throughput is still all far from our 4.19 kernel or
simply pinning everything to a single core.

All the 4 settings (3 settings listed above plus the default) yields
results between 90 to 120 Mbps, while pinning tasks to a single core
consistently reaches >250 Mbps.
>
> #3 basically turns it into a percpu workqueue, so it should perform more or
> less the same as a percpu workqueue without affecting everyone else.
>
> Any chance you can post the toplogy details on the affected setup? How are
> the caches and cores laid out?

The core layout is listed at [1], and I'm not familiar with its cache
configuration either.

[1]: https://lore.kernel.org/all/ZFvpJb9Dh0FCkLQA@google.com/

Best regards,
Pin-yen
>
> Thanks.
>
> --
> tejun
Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality [ In reply to ]
Hello, Pin-yen.

On Tue, Jun 13, 2023 at 05:26:48PM +0800, Pin-yen Lin wrote:
...
> > 1. affinity_scope = cache, affinity_strict = 1
> >
> > 2. affinity_scope = cpu, affinity_strict = 0
> >
> > 3. affinity_scope = cpu, affinity_strict = 1
>
> I pulled down v2 series and tried these settings on our 5.15 kernel.
> Unfortunately none of them showed significant improvement on the
> throughput. It's hard to tell which one is the best because of the
> noise, but the throughput is still all far from our 4.19 kernel or
> simply pinning everything to a single core.
>
> All the 4 settings (3 settings listed above plus the default) yields
> results between 90 to 120 Mbps, while pinning tasks to a single core
> consistently reaches >250 Mbps.

I find that perplexing given that switching to a per-cpu workqueue remedies
the situation quite a bit, which is how this patchset came to be. #3 is the
same as per-cpu workqueue, so if you're seeing noticeably different
performance numbers between #3 and per-cpu workqueue, there's something
wrong with either the code or test setup.

Also, if you have to ping to a single or some subset of CPUs, you can just
set WQ_SYSFS for the workqueue and set affinities in its sysfs interface
instead of hard-coding the workaround for a specific hardware.

Thanks.

--
tejun
Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality [ In reply to ]
On Wed, 21 Jun 2023 at 12:16, Tejun Heo <tj@kernel.org> wrote:
>
> I find that perplexing given that switching to a per-cpu workqueue remedies
> the situation quite a bit, which is how this patchset came to be. #3 is the
> same as per-cpu workqueue, so if you're seeing noticeably different
> performance numbers between #3 and per-cpu workqueue, there's something
> wrong with either the code or test setup.

Or maybe there's some silly thinko in the wq code that is hidden by
the percpu code.

For example, WQ_UNBOUND triggers a lot of other overhead at least on
wq allocation and free. Maybe some of that stuff then indirectly
affects workqueue execution even when strict cpu affinity is set.

Pin-Yen Li - can you do a system-wide profile of the two cases (the
percpu case vs the "strict cpu affinity" one), to see if something
stands out?

Linus
Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality [ In reply to ]
Hi Linus and Tejun,

On Thu, Jun 22, 2023 at 3:32?AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, 21 Jun 2023 at 12:16, Tejun Heo <tj@kernel.org> wrote:
> >
> > I find that perplexing given that switching to a per-cpu workqueue remedies
> > the situation quite a bit, which is how this patchset came to be. #3 is the
> > same as per-cpu workqueue, so if you're seeing noticeably different
> > performance numbers between #3 and per-cpu workqueue, there's something
> > wrong with either the code or test setup.
>
In our case, per-cpu workqueue (removing WQ_UNBOUND) doesn't bring us
better results. But given that pinning tasks to a single CPU core
helps, we thought that the regression is related to the behavior of
WQ_UNBOUND. Our findings are listed in [1].

We already use WQ_SYSFS and the sysfs interface to pin the tasks, but
thanks for the suggestion.

[1]: https://lore.kernel.org/all/ZFvpJb9Dh0FCkLQA@google.com/

> Or maybe there's some silly thinko in the wq code that is hidden by
> the percpu code.
>
> For example, WQ_UNBOUND triggers a lot of other overhead at least on
> wq allocation and free. Maybe some of that stuff then indirectly
> affects workqueue execution even when strict cpu affinity is set.
>
> Pin-Yen Li - can you do a system-wide profile of the two cases (the
> percpu case vs the "strict cpu affinity" one), to see if something
> stands out?

The two actually have similar performances, so I guess the profiling
is not interesting for you. Please let me know if you want to see any
data and I'll be happy to collect them and update here.

Best regards,
Pin-yen
>
> Linus
Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality [ In reply to ]
Hello, Pin-yen.

On Thu, Jun 29, 2023 at 05:49:27PM +0800, Pin-yen Lin wrote:
> > > I find that perplexing given that switching to a per-cpu workqueue remedies
> > > the situation quite a bit, which is how this patchset came to be. #3 is the
> > > same as per-cpu workqueue, so if you're seeing noticeably different
> > > performance numbers between #3 and per-cpu workqueue, there's something
> > > wrong with either the code or test setup.
> >
> In our case, per-cpu workqueue (removing WQ_UNBOUND) doesn't bring us
> better results. But given that pinning tasks to a single CPU core
> helps, we thought that the regression is related to the behavior of
> WQ_UNBOUND. Our findings are listed in [1].

I see.

> We already use WQ_SYSFS and the sysfs interface to pin the tasks, but
> thanks for the suggestion.

Yeah, I have no idea why there's such a drastic regression and the only way
to alleviate that is pinning the execution to a single CPU, which is also
different from other reports. It seems plausible that there are some
scheduling behavior changes and that's interacting negatively with power
saving but I'm sure that's the first thing you guys looked into.

From workqueue's POV, I'm afraid using WQ_SYSFS and pinning it as needed
seems like about what can be done for your specific case.

Thanks.

--
tejun
Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality [ In reply to ]
Hello,

On Thu, May 18, 2023 at 02:16:45PM -1000, Tejun Heo wrote:
> Unbound workqueues used to spray work items inside each NUMA node, which
> isn't great on CPUs w/ multiple L3 caches. This patchset implements
> mechanisms to improve and configure execution locality.

The patchset shows minor perf improvements for some but more importantly
gives users more control over worker placement which helps working around
some of the recently reported performance regressions. Prateek reported
concerning regressions with tbench but I couldn't reproduce it and can't see
how tbench would be affected at all given the benchmark doesn't involve
workqueue operations in any noticeable way.

Assuming that the tbench difference was a testing artifact, I'm applying the
patchset to wq/for-6.6 so that it can receive wider testing. Prateek, I'd
really appreciate if you could repeat the test and see whether the
difference persists.

Thanks.

--
tejun
Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality [ In reply to ]
Hello Tejun,

On 8/8/2023 6:52 AM, Tejun Heo wrote:
> Hello,
>
> On Thu, May 18, 2023 at 02:16:45PM -1000, Tejun Heo wrote:
>> Unbound workqueues used to spray work items inside each NUMA node, which
>> isn't great on CPUs w/ multiple L3 caches. This patchset implements
>> mechanisms to improve and configure execution locality.
>
> The patchset shows minor perf improvements for some but more importantly
> gives users more control over worker placement which helps working around
> some of the recently reported performance regressions. Prateek reported
> concerning regressions with tbench but I couldn't reproduce it and can't see
> how tbench would be affected at all given the benchmark doesn't involve
> workqueue operations in any noticeable way.
>
> Assuming that the tbench difference was a testing artifact, I'm applying the
> patchset to wq/for-6.6 so that it can receive wider testing. Prateek, I'd
> really appreciate if you could repeat the test and see whether the
> difference persists.

Sure. I'll retest with for-6.6 branch. Will post the results here once the
tests are done. I'll repeat the same - test with the defaults and the ones
that show any difference in results, I'll rerun them with various affinity
scopes.

Let me know if you have any suggestions.

--
Thanks and Regards,
Prateek

1 2  View All