Mailing List Archive

1 2  View All
Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality [ In reply to ]
Hello,

On Tue, Aug 08, 2023 at 08:28:53AM +0530, K Prateek Nayak wrote:
> > Assuming that the tbench difference was a testing artifact, I'm applying the
> > patchset to wq/for-6.6 so that it can receive wider testing. Prateek, I'd
> > really appreciate if you could repeat the test and see whether the
> > difference persists.
>
> Sure. I'll retest with for-6.6 branch. Will post the results here once the
> tests are done. I'll repeat the same - test with the defaults and the ones
> that show any difference in results, I'll rerun them with various affinity
> scopes.

I think it'd be helpful to pick one benchmark setup which shows clear
difference and repeat it multiple times while taking measures to avoid
systemic biases (e.g. instead of running all of control followed by all of
test, separate them into several segments and interleave them).

Thanks.

--
tejun
Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality [ In reply to ]
Hello Tejun,

On 8/8/2023 8:28 AM, K Prateek Nayak wrote:
> Hello Tejun,
>
> On 8/8/2023 6:52 AM, Tejun Heo wrote:
>> Hello,
>>
>> On Thu, May 18, 2023 at 02:16:45PM -1000, Tejun Heo wrote:
>>> Unbound workqueues used to spray work items inside each NUMA node, which
>>> isn't great on CPUs w/ multiple L3 caches. This patchset implements
>>> mechanisms to improve and configure execution locality.
>>
>> The patchset shows minor perf improvements for some but more importantly
>> gives users more control over worker placement which helps working around
>> some of the recently reported performance regressions. Prateek reported
>> concerning regressions with tbench but I couldn't reproduce it and can't see
>> how tbench would be affected at all given the benchmark doesn't involve
>> workqueue operations in any noticeable way.
>>
>> Assuming that the tbench difference was a testing artifact, I'm applying the
>> patchset to wq/for-6.6 so that it can receive wider testing. Prateek, I'd
>> really appreciate if you could repeat the test and see whether the
>> difference persists.
>
> Sure. I'll retest with for-6.6 branch. Will post the results here once the
> tests are done. I'll repeat the same - test with the defaults and the ones
> that show any difference in results, I'll rerun them with various affinity
> scopes.

Sorry I'm lagging on the test queue but following are the results of the
standard benchmarks running on a dual socket 3rd Generation EPYC system
(2 x 64C/128T)

tl;dr

- No noticeable difference in performance.
- The netperf and tbench regression are gone now and the base numbers too
are much higher than before (sorry for the false alarm!)

Following are the results:

base: affinity-scopes-v2 branch at commit 18c8ae813156 ("workqueue:
Disable per-cpu CPU hog detection when wq_cpu_intensive_thresh_us
is 0")

affinity-scope: affinity-scopes-v2 branch at commit a4da9f618d3e
("workqueue: Add "Affinity Scopes and Performance" section to]
documentation")

==================================================================
Test : hackbench
Units : Normalized time in seconds
Interpretation: Lower is better
Statistic : AMean
==================================================================
Case: base[pct imp](CV) affinity-scope[pct imp](CV)
1-groups 1.00 [ -0.00]( 1.76) 0.99 [ 0.56]( 3.02)
2-groups 1.00 [ -0.00]( 1.52) 1.01 [ -0.94]( 2.36)
4-groups 1.00 [ -0.00]( 1.49) 1.02 [ -2.20]( 1.91)
8-groups 1.00 [ -0.00]( 1.12) 1.00 [ -0.00]( 0.93)
16-groups 1.00 [ -0.00]( 3.64) 1.01 [ -0.87]( 2.66)


==================================================================
Test : tbench
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: base[pct imp](CV) affinity-scope[pct imp](CV)
1 1.00 [ 0.00]( 0.47) 1.00 [ -0.21]( 1.03)
2 1.00 [ 0.00]( 0.10) 1.00 [ 0.00]( 0.45)
4 1.00 [ 0.00]( 1.60) 1.00 [ -0.18]( 0.83)
8 1.00 [ 0.00]( 0.13) 1.00 [ -0.26]( 0.59)
16 1.00 [ 0.00]( 1.69) 1.02 [ 2.05]( 1.08)
32 1.00 [ 0.00]( 0.35) 1.00 [ -0.36]( 2.47)
64 1.00 [ 0.00]( 0.43) 1.00 [ 0.45]( 2.54)
128 1.00 [ 0.00]( 0.31) 0.99 [ -0.82]( 0.58)
256 1.00 [ 0.00]( 1.81) 0.98 [ -1.84]( 1.80)
512 1.00 [ 0.00]( 0.54) 1.00 [ 0.04]( 0.06)
1024 1.00 [ 0.00]( 0.13) 1.01 [ 1.01]( 0.42)


==================================================================
Test : stream-10
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: base[pct imp](CV) affinity-scope[pct imp](CV)
Copy 1.00 [ 0.00]( 6.45) 1.03 [ 2.50]( 5.75)
Scale 1.00 [ 0.00]( 6.21) 1.03 [ 3.36]( 0.75)
Add 1.00 [ 0.00]( 6.10) 1.04 [ 4.23]( 1.81)
Triad 1.00 [ 0.00]( 7.24) 1.03 [ 3.49]( 3.41)


==================================================================
Test : stream-100
Units : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic : HMean
==================================================================
Test: base[pct imp](CV) affinity-scope[pct imp](CV)
Copy 1.00 [ 0.00]( 1.98) 1.00 [ 0.40]( 2.57)
Scale 1.00 [ 0.00]( 4.88) 1.00 [ -0.07]( 5.11)
Add 1.00 [ 0.00]( 4.60) 1.00 [ 0.23]( 5.21)
Triad 1.00 [ 0.00]( 6.21) 1.03 [ 2.85]( 2.55)


==================================================================
Test : netperf
Units : Normalized Througput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: base[pct imp](CV) affinity-scope[pct imp](CV)
1-clients 1.00 [ 0.00]( 1.84) 1.01 [ 0.99]( 0.72)
2-clients 1.00 [ 0.00]( 0.64) 1.01 [ 0.53]( 0.77)
4-clients 1.00 [ 0.00]( 0.75) 1.01 [ 0.54]( 0.96)
8-clients 1.00 [ 0.00]( 0.83) 1.00 [ -0.21]( 1.03)
16-clients 1.00 [ 0.00]( 0.75) 1.00 [ 0.31]( 0.81)
32-clients 1.00 [ 0.00]( 0.82) 1.00 [ 0.12]( 1.57)
64-clients 1.00 [ 0.00]( 2.30) 1.00 [ -0.28]( 2.39)
128-clients 1.00 [ 0.00]( 2.54) 0.99 [ -1.01]( 2.61)
256-clients 1.00 [ 0.00]( 4.37) 1.01 [ 1.23]( 2.69)
512-clients 1.00 [ 0.00](48.73) 1.01 [ 0.99](46.07)


==================================================================
Test : schbench
Units : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic : Median
==================================================================
#workers: base[pct imp](CV) affinity-scope[pct imp](CV)
1 1.00 [ -0.00]( 2.28) 1.00 [ -0.00]( 2.28)
2 1.00 [ -0.00]( 8.55) 0.96 [ 4.00]( 4.17)
4 1.00 [ -0.00]( 3.81) 0.94 [ 6.45]( 8.78)
8 1.00 [ -0.00]( 2.78) 0.97 [ 2.78]( 4.81)
16 1.00 [ -0.00]( 1.22) 0.96 [ 4.26]( 1.27)
32 1.00 [ -0.00]( 2.02) 0.97 [ 2.63]( 3.99)
64 1.00 [ -0.00]( 5.65) 0.99 [ 0.62]( 1.65)
128 1.00 [ -0.00]( 5.17) 0.98 [ 1.91]( 8.12)
256 1.00 [ -0.00](10.79) 1.07 [ -6.82]( 7.18)
512 1.00 [ -0.00]( 1.24) 0.99 [ 0.54]( 1.37)



==================================================================
Test : Unixbench
Units : Various, Througput
Interpretation: Higher is better
Statistic : AMean, Hmean (Specified)
==================================================================
base affinity-scope
Hmean unixbench-dhry2reg-1 40947261.77 ( 0.00%) 41078213.81 ( 0.32%)
Hmean unixbench-dhry2reg-512 6243140251.68 ( 0.00%) 6240938691.75 ( -0.04%)
Amean unixbench-syscall-1 2932806.37 ( 0.00%) 2871035.50 * 2.11%*
Amean unixbench-syscall-512 7689448.00 ( 0.00%) 8406697.27 * 9.33%*
Hmean unixbench-pipe-1 2577667.42 ( 0.00%) 2497979.59 * -3.09%*
Hmean unixbench-pipe-512 363366036.45 ( 0.00%) 356991588.20 * -1.75%*
Hmean unixbench-spawn-1 4446.97 ( 0.00%) 4760.91 * 7.06%*
Hmean unixbench-spawn-512 68983.49 ( 0.00%) 68464.78 * -0.75%*
Hmean unixbench-execl-1 3894.20 ( 0.00%) 3857.78 ( -0.94%)
Hmean unixbench-execl-512 12716.76 ( 0.00%) 13067.63 ( 2.76%)


==================================================================
Test : tbench (Various Affinity Scopes)
Units : Normalized throughput
Interpretation: Higher is better
Statistic : AMean
==================================================================
Clients: base[pct imp](CV) cpu[pct imp](CV) smt[pct imp](CV) cache[pct imp](CV) numa[pct imp](CV) system[pct imp](CV)
1 1.00 [ 0.00]( 0.47) 1.00 [ 0.11]( 0.95) 1.00 [ 0.23]( 1.97) 1.01 [ 1.01]( 0.29) 1.00 [ 0.07]( 0.57) 1.01 [ 1.36]( 0.36)
2 1.00 [ 0.00]( 0.10) 1.01 [ 1.14]( 0.27) 0.99 [ -0.84]( 0.51) 1.01 [ 1.05]( 0.50) 1.00 [ 0.24]( 0.75) 1.00 [ -0.29]( 1.22)
4 1.00 [ 0.00]( 1.60) 1.02 [ 2.07]( 1.42) 1.02 [ 1.65]( 0.46) 1.02 [ 2.45]( 0.83) 1.00 [ 0.36]( 1.33) 1.02 [ 2.37]( 0.57)
8 1.00 [ 0.00]( 0.13) 1.00 [ -0.02]( 0.61) 1.00 [ 0.14]( 0.57) 1.01 [ 0.88]( 0.33) 1.00 [ -0.26]( 0.30) 1.01 [ 0.90]( 1.48)
16 1.00 [ 0.00]( 1.69) 1.03 [ 3.10]( 0.69) 1.04 [ 3.66]( 1.36) 1.02 [ 2.36]( 0.62) 1.02 [ 1.61]( 1.63) 1.04 [ 3.77]( 1.00)
32 1.00 [ 0.00]( 0.35) 0.97 [ -3.49]( 0.62) 0.97 [ -3.21]( 0.77) 1.00 [ -0.24]( 3.77) 0.96 [ -4.08]( 4.43) 0.97 [ -2.81]( 3.50)
64 1.00 [ 0.00]( 0.43) 1.00 [ 0.20]( 1.66) 0.99 [ -0.61]( 0.81) 1.03 [ 2.87]( 0.55) 1.02 [ 2.16]( 2.31) 0.98 [ -2.32]( 3.63)
128 1.00 [ 0.00]( 0.31) 1.01 [ 1.44]( 1.33) 1.01 [ 0.72]( 0.46) 1.01 [ 1.33]( 0.67) 1.00 [ 0.38]( 0.58) 1.01 [ 1.44]( 1.35)
256 1.00 [ 0.00]( 1.81) 0.98 [ -2.10]( 1.05) 0.97 [ -2.50]( 0.42) 0.97 [ -3.46]( 0.91) 0.99 [ -0.79]( 0.85) 0.96 [ -3.83]( 0.29)
512 1.00 [ 0.00]( 0.54) 1.00 [ 0.37]( 1.12) 0.99 [ -1.33]( 0.44) 1.00 [ -0.19]( 0.94) 1.01 [ 0.87]( 1.05) 0.99 [ -1.08]( 0.12)
1024 1.00 [ 0.00]( 0.13) 1.01 [ 1.10]( 0.49) 1.00 [ 0.47]( 0.28) 1.00 [ 0.33]( 0.73) 1.00 [ 0.48]( 0.69) 1.00 [ 0.01]( 0.47)

==================================================================

ycsb-mongodb and DeathStarBench do not see any difference in
performance. I'll go and test more NPS modes / more machines.
Meanwhile, please feel free to add:

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

--
Thanks and Regards,
Prateek

1 2  View All