Mailing List Archive: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality

[PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality

May 18, 2023, 6:15 PM

Post #1 of 27 (350 views)

Hello,

tl;dr
=====

Unbound workqueues used to spray work items inside each NUMA node, which
isn't great on CPUs w/ multiple L3 caches. This patchset implements
mechanisms to improve and configure execution locality.

Long patchset, so a bit of traffic control:

PeterZ
------

It uses p->wake_cpu to best-effort migrate workers on wake-ups. Migrating on
wake-ups fits the bill and it's cheaper and fits the semantics in that
workqueue is hinting the scheduler which CPU has the most continuity rather
than forcing the specific migration. However, p->wake_cpu is only used in
scheduler proper, so this would be introducing a new way of using the field.

Please see 0021 and 0024.

Linus, PeterZ
-------------

Most of the patchset are workqueue internal plumbing and probably aren't
terribly interesting. Howver, the performance picture turned out less
straight-forward than I had hoped, mostly likely due to loss of
work-conservation from scheduler in high fan-out scenarios. I'll describe it
in this cover letter. Please read on.

In terms of patches, 0021-0024 are probably the interesting ones.

Brian Norris, Nathan Huckleberry and others experiencing wq perf problems
-------------------------------------------------------------------------

Can you please test this patchset and see whether the performance problems
are resolved? After the patchset, unbound workqueues default to
soft-affining on cache boundaries, which should hopefully resolve the issues
that you guys have been seeing on recent kernels on heterogeneous CPUs.

If you want to try different settings, please read:

https://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git/tree/Documentation/core-api/workqueue.rst?h=affinity-scopes-v1&id=e8f3505e69a526cc5fe40a4da5d443b7f9231016#n350

Alasdair Kergon, Mike Snitzer, DM folks
---------------------------------------

I ran fio on top of dm-crypt to compare performance of different
configurations. It mostly behaved as expected but please let me know if
anything doens't look right. Also, DM_CRYPT_SAME_CPU can now be implemented
by applying strict "cpu" scope on the unbound workqueue and it would make
sense to add WQ_SYSFS to the kcryptd workqueue so that users can tune the
settings on the fly.

Lai
---

I'd really appreciate if you could go over the series. 0003, 0007, 0009,
0018-0021, 0023-0024 are the interesting ones. 0003, 0019, 0020 and 0024 are
the potentially risky ones. I've been testing them pretty extensively for
over a week but the series can really use your eyes.

Introduction
============

Recently, IIRC, with chromeos moving to a newer kernel, there have been a
flurry of reports on unbound workqueues showing high execution latency and
low bandwidth. AFAICT, all were on heterogeneous CPUs with multiple L3
caches. Please see the discussions in the following threads.

https://lkml.kernel.org/r/CAHk-=wgE9kORADrDJ4nEsHHLirqPCZ1tGaEPAZejHdZ03qCOGg@mail.gmail.com
https://lkml.kernel.org/r/ZFvpJb9Dh0FCkLQA@google.com

While details from the second thread indicate that the problem is unlikely
to originate solely from workqueue, workqueue does have an underlying
problem that can be amplified through behavior changes in other subsystems.

Being an async execution mechanism, workqueue inherently disconnects issuer
and executor. Worker pools are shared across workqueues whenever possible
further severing the connection. While this disconnection is true for all
workqueues, for per-cpu workqueues, this doesn't harm locality because they
still all end up on the same CPU. For unbound workqueues tho, we end up
spraying work items across CPUs which blows away all locality.

This was already a noticeable issue with NUMA machines and 4c16bd327c74
("workqueue: implement NUMA affinity for unbound workqueues") implemented
NUMA support which segments unbound workqueues on NUMA node boundaries so
that each subset that's contained within a NUMA node is served by a separate
worker pool. Work items still get sprayed but they get sprayed only inside
its NUMA node.

This has been mostly fine but CPUs became a lot more complex with many more
cores and multiple L3 caches inside a single node with differeing distances
across them, and it looks like it's high time to improve workqueue's
locality awareness.

Design
======

There are two basic ideas.

1. Expand the segmenting so that unbound workqueues can be segmented on
different boundaries including L3 caches.

2. The knoweldge about CPU topology is already in the scheduler. Instead of
trying to steer things on its own, workqueue can inform the scheduler of
the to-be-executed work item's locality and get out of its way. (Linus's
suggestion)

#2 is nice in that it takes out workqueue out of the businness that it's not
good at. Even with #2, #1 is likely still useful as persistent locality of
workers affect cache hit ratio and memory distance on stack and shared data
structures.

Given the above, it could be that workqueue can do something like the
following and keep everyone more or less happy:

* Segment unbound workqueues according to L3 cache boundaries
(cpus_share_cache()).

* Don't hard-affine workers but make a best-effort attempt to locate them on
the same CPUs as the work items they are about to execute.

This way, the hot workers would stay mostly affine to the L3 cache domain
they're attached to and the scheduler would have the full knowledge of the
preferred locality when starting execution of a work item. As the workers
aren't strictly confined, the scheduler is free to move them outside that L3
cache domain or even across NUMA nodes according to the hardware
configuration and load situation. In short, we get both locality and
work-conservation.

Unfortunately, this didn't work out due to the pronounced inverse
correlation between locality and work-conservation, which will be
illustrated in the following "Results" section. As this meant that there is
not going to be one good enough behavior for everyone, I ended up
implementing a number of different behaviors that can be selected
dynamically per-workqueue.

There are three parameters:

* Affinity scope: Determines how unbound workqueues are segmented. Five
possible values.

CPU: Every CPU in its own scope. The unbound workqueue behaves like a
per-cpu one.

SMT: SMT siblings in the same scope.

CACHE: cpus_share_cache() defines the scopes, usually on L3 boundaries.

NUMA: NUMA nodes define the scopes. This is the same behavior as before
this patchset.

SYSTEM: A single scope across the whole system.

* Affinty strictness: If a scope is strict, every worker is confined inside
its scope and the scheduler can't move it outside. If non-strict, a worker
is brought back into its scope (called repatriation) when it starts to
execute a work item but then can be moved outside the scope if the
scheduler deems that the better course of action.

* Localization: If a workqueue enables localization, a worker is moved to
the CPU that issued the work item at the start of execution. Afterwards,
the scheduler can migrate the task according to the above affinity scope
and strictness settings. Interestingly, this turned out to be not very
useful in practice. It's implemented in the last two patches but they
aren't for upstream inclusion, at least for now. See the "Results"
section.

Combined, they allow a wide variety of configurations. It's easy to emulate
WQ_CPU_INTENSIVE per-cpu workqueues or the behavior before this patchset.
The settings can be changed anytime through both apply_workqueue_attrs() and
the sysfs interface allowing easy and flexible tuning.

Let's see how different configurations perform.

Experiment Setup
================

Here's the relevant part of the experiment setup.

* Ryzen 9 3900x - 12 cores / 24 threads spread across 4 L3 caches.
Core-to-core latencies across L3 caches are ~2.6x worse than within each
L3 cache. ie. it's worse but not hugely so. This means that the impact of
L3 cache locality is noticeable in these experiments but may be subdued
compared to other setups.

* The CPU's clock boost is disabled, so that all the clocks are running very
close to the stock 3.8Ghz for test consistency.

* dm-crypt setup with default parameters on Samsung 980 PRO SSD.

* Each combination is run five times.

Three fio scenarios are run directly against the dm-crypt device.

HIGH: Enough issuers and work spread across the machine

fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \
--ioengine=libaio --iodepth=64 --runtime=60 --numjobs=24 --time_based \
--group_reporting --name=iops-test-job --verify=sha512

MED: Fewer issuers, enough work for saturation

fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \
--ioengine=libaio --iodepth=64 --runtime=60 --numjobs=8 --time_based \
--group_reporting --name=iops-test-job --verify=sha512

LOW: Even fewer issuers, not enough work to saturate

fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=32k \
--ioengine=libaio --iodepth=64 --runtime=60 --numjobs=4 --time_based \
--group_reporting --name=iops-test-job --verify=sha512

They are the same except for the number of issuers. The issuers generate
data every time, write, read back and then verifiy the content. Each
issuer can have 64 IOs in flight.

Five workqueue configurations are used:

SYSTEM A single scope across the system. All workers belong to the
same pool and work items are sprayed all over. On this CPU,
this is the same behavior as before this patchset.

CACHE Four scopes, each serving a L3 cache domain. The scopes are
non-strict, so workers starts inside their scopes but the
scheduler can move them outside as needed.

CACHE+STRICT The same as CACHE but the scopes are strict. Workers are
confined within their scopes.

SYSTEM+LOCAL A single scope across the system but with localization
turned on. A worker is moved to the issuing CPU of the work
item it's about to execute.

CACHE+LOCAL Four scopes each serving a L3 cache domain with localization
turned on. A worker is moved to the issuing CPU of the work
item it's about to execute and is a lot more likely to
be already inside the matching L3 cache domain.

Results
=======

1. HIGH: Enough issuers and work spread across the machine

Bandwidth (MiBps) CPU util (%) vs. SYSTEM BW (%)
----------------------------------------------------------------------
SYSTEM 1159.40 ±1.34 99.31 ±0.02 0.00
CACHE 1166.40 ±0.89 99.34 ±0.01 +0.60
CACHE+STRICT 1166.00 ±0.71 99.35 ±0.01 +0.57
SYSTEM+LOCAL 1157.40 ±1.67 99.50 ±0.01 -0.17
CACHE+LOCAL 1160.40 ±0.55 99.50 ±0.02 +0.09

The differences are small. Here are some observations.

* CACHE and CACHE+STRICT are clearly performing better. The difference is
not big but consistent, which isn't too surprising given that these caches
are pretty close. The CACHE ones are doing more work for the same number
of CPU cycles compared to SYSTEM.

This scenario is the best case for CACHE+STRICT, so it's encouraging that
the non-strict variant can match.

* SYSTEM/CACHE+LOCAL aren't doing horribly but not better. They burned
slightly more CPUs, possibly from the scheduler needing to move away the
workers more frequently.

2. MED: Fewer issuers, enough work for saturation

Bandwidth (MiBps) CPU util (%) vs. SYSTEM BW (%)
----------------------------------------------------------------------
SYSTEM 1155.40 ±0.89 97.41 ±0.05 0.00
CACHE 1154.40 ±1.14 96.15 ±0.09 -0.09
CACHE+STRICT 1112.00 ±4.64 93.26 ±0.35 -3.76
SYSTEM+LOCAL 1066.80 ±2.17 86.70 ±0.10 -7.67
CACHE+LOCAL 1034.60 ±1.52 83.00 ±0.07 -10.46

There are still eight issuers and plenty of work to go around. However, now,
depending on the configuration, we're starting to see a significant loss of
work-conservation where CPUs sit idle while there's work to do.

* CACHE is doing okay. It's just a bit slower. Further testing may be needed
to definitively confirm the bandwidth gap but the CPU util difference
seems real, so while minute, it did lose a bit of work-conservation.
Comparing it to CACHE+STRICT, it's starting to show the benefits of
non-strict scopes.

* CACHE+STRICT's lower bw and utilization make sense. It'd be easy to
mis-locate issuers so that some cache domains are temporarily underloaded.
With scheduler not being allowed to move tasks around, there will be some
utilization bubbles.

* SYSTEM/CACHE+LOCAL are not doing great with significantly lower
utilizations. There are more than enough work to do in the system
(changing --numjobs to 6, for example, doesn't change the picture much for
SYSTEM) but the kernel is seemingly unable to dispatch workers to idle
CPUs fast enough.

3. LOW: Even fewer issuers, not enough work to saturate

Bandwidth (MiBps) CPU util (%) vs. SYSTEM BW (%)
----------------------------------------------------------------------
SYSTEM 993.60 ±1.82 75.49 ±0.06 0.00
CACHE 973.40 ±1.52 74.90 ±0.07 -2.03
CACHE+STRICT 828.20 ±4.49 66.84 ±0.29 -16.65
SYSTEM+LOCAL 884.20 ±18.39 64.14 ±1.12 -11.01
CACHE+LOCAL 860.60 ±4.72 61.46 ±0.24 -13.39

Now, there isn't enough work to be done to saturate the whole system. This
amplifies the pattern which started to show up in the MED scenario.

* CACHE is now noticeably slower but not crazily so.

* CACHE+STRICT shows the severe work-conservation limitation of strict
scopes.

* SYSTEM/CACHE+LOCAL show the same pattern as in the MED scenario but with
larger work-conservation deficits.

Conclusions
===========

The tradeoff between efficiency gains from improved locality and bandwidth
loss from work-conservation deficit poses a conundrum. The scenarios used
for testing, while limited, are plausiable representations of what heavy
workqueue usage might look like in practice, and they clearly show that
there is no one straight-forward option which can cover all cases.

* The efficiency advantage of the CACHE over SYSTEM is, while consistent and
noticeable, small. However, the impact is dependent on the distances
between the scopes and may be more pronounced in processors with more
complex topologies.

While the loss of work-conservation in certain scenarios hurts, it is a
lot better than CACHE+STRICT and maximizing workqueue utilization is
unlikely to be the common case anyway. As such, CACHE is the default
affinity scope for unbound pools.

There's some risk to this decision and we might have to reassess depending
on how this works out in the wild.

* In all three scenarios, +LOCAL didn't show any advantage. There may be
scenarios where the boost from L1/2 cache locality is visible but the
severe work-conservation deficits really hurt.

I haven't looked into it too deeply here but the behavior where CPUs
sitting idle while there's work to do is in line with what we (Meta) have
been seeing with CFS across multiple workloads and I believe Google has
similar experiences. Tuning load balancing and migration costs through
debugfs helps but not fully. Deeper discussion on this topic is a bit out
of scope here and there are on-going efforts on this front, so we can
revisit later.

So, at least in the current state, +LOCAL doesn't make sense. Let's table
it for now. I put the two patches to implement it at the end of the series
w/o SOB.

PATCHES
=======

This series contains the following patches.

0001-workqueue-Drop-the-special-locking-rule-for-worker-f.patch
0002-workqueue-Cleanups-around-process_scheduled_works.patch
0003-workqueue-Not-all-work-insertion-needs-to-wake-up-a-.patch
0004-workqueue-Rename-wq-cpu_pwqs-to-wq-cpu_pwq.patch
0005-workqueue-Relocate-worker-and-work-management-functi.patch
0006-workqueue-Remove-module-param-disable_numa-and-sysfs.patch
0007-workqueue-Use-a-kthread_worker-to-release-pool_workq.patch
0008-workqueue-Make-per-cpu-pool_workqueues-allocated-and.patch
0009-workqueue-Make-unbound-workqueues-to-use-per-cpu-poo.patch
0010-workqueue-Rename-workqueue_attrs-no_numa-to-ordered.patch
0011-workqueue-Rename-NUMA-related-names-to-use-pod-inste.patch
0012-workqueue-Move-wq_pod_init-below-workqueue_init.patch
0013-workqueue-Initialize-unbound-CPU-pods-later-in-the-b.patch
0014-workqueue-Generalize-unbound-CPU-pods.patch
0015-workqueue-Add-tools-workqueue-wq_dump.py-which-print.patch
0016-workqueue-Modularize-wq_pod_type-initialization.patch
0017-workqueue-Add-multiple-affinity-scopes-and-interface.patch
0018-workqueue-Factor-out-work-to-worker-assignment-and-c.patch
0019-workqueue-Factor-out-need_more_worker-check-and-work.patch
0020-workqueue-Add-workqueue_attrs-__pod_cpumask.patch
0021-workqueue-Implement-non-strict-affinity-scope-for-un.patch
0022-workqueue-Add-Affinity-Scopes-and-Performance-sectio.patch
0023-workqueue-Add-pool_workqueue-cpu.patch
0024-workqueue-Implement-localize-to-issuing-CPU-for-unbo.patch

This series is also available in the following git branch:

git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git affinity-scopes-v1

diffstat follows. Thanks.

Documentation/admin-guide/kernel-parameters.txt | 21
Documentation/core-api/workqueue.rst | 379 ++++++++++-
include/linux/workqueue.h | 82 ++
init/main.c | 1
kernel/workqueue.c | 1613 +++++++++++++++++++++++++++-----------------------
kernel/workqueue_internal.h | 2
tools/workqueue/wq_dump.py | 180 +++++
tools/workqueue/wq_monitor.py | 29
8 files changed, 1525 insertions(+), 782 deletions(-)

Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality [ In reply to ]

torvalds at linux-foundation

May 18, 2023, 6:15 PM

Post #2 of 27 (350 views)