Mailing List Archive

[PATCH] fix scheduler regression from "sched/fair: Rework load_balance()"
Hi everyone,

We’re validating a new kernel in the fleet, and compared with v5.2,
performance is ~2-3% lower for some of our workloads. After some
digging, Johannes found that our involuntary context switch rate was ~2x
higher, and we were leaving a CPU idle a higher percentage of the time,
even though the workload was trying to saturate the system.

We were able to reproduce the problem with schbench, and Johannes
bisected down to:

commit 0b0695f2b34a4afa3f6e9aa1ff0e5336d8dad912
Author: Vincent Guittot <vincent.guittot@linaro.org>
Date: Fri Oct 18 15:26:31 2019 +0200

sched/fair: Rework load_balance()

Our working theory is the load balancing changes are leaving processes
behind busy CPUs instead of moving them onto idle ones. I made a few
schbench modifications to make this easier to demonstrate:

https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git/

My VM has 40 cpus (20 cores, 2 threads per core), and my schbench
command line is:

schbench -t 20 -r 0 -c 1000000 -s 1000 -i 30 -z 120

This has two message threads, and 20 workers per message thread. Once
woken up, the workers think for a full second, which means you’ll have
some long latencies if you’re stuck behind one of these workers in the
runqueue. The message thread does a little bit of work and then sleeps,
so we end up with 40 threads hammering full blast on the CPU and 2
threads popping in and out of idle.

schbench times the delay from when a message thread wakes a worker to
when the worker runs. On a good kernel, the output looks like this:

Latency percentiles (usec) runtime 1290 (s) (3280 total samples)
50.0th: 155 (1653 samples)
75.0th: 189 (808 samples)
90.0th: 216 (501 samples)
95.0th: 227 (163 samples)
*99.0th: 256 (123 samples)
99.5th: 1510 (16 samples)
99.9th: 3132 (13 samples)
min=21, max=3286

With 0b0695f2b34a, we get this:

Latency percentiles (usec) runtime 1440 (s) (4480 total samples)
50.0th: 147 (2261 samples)
75.0th: 182 (1116 samples)
90.0th: 205 (671 samples)
95.0th: 224 (215 samples)
*99.0th: 12240 (173 samples) <—— much higher p99 and up
99.5th: 12752 (22 samples)
99.9th: 13104 (18 samples)
min=21, max=13172

Since the idea is to fully load the machine with schbench, use schbench
-t <your_num_cpus/2>, and make sure the box doesn’t have other stuff
running in the background. I used a VM because it ended up giving more
consistent results on our kernel test machines, which have some periodic
noise running in the background.

We’ve tried a few different approaches, but don’t quite have a solid
fix yet. I thought I’d kick off the discussion with my most useful
hunks so far:

diff a/kernel/sched/fair.c b/kernel/sched/fair.c
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c

-chris
Re: [PATCH] fix scheduler regression from "sched/fair: Rework load_balance()" [ In reply to ]
Hi Chris

On Sat, 24 Oct 2020 at 01:49, Chris Mason <clm@fb.com> wrote:
>
> Hi everyone,
>
> We’re validating a new kernel in the fleet, and compared with v5.2,

Which version are you using ?
several improvements have been added since v5.5 and the rework of load_balance

> performance is ~2-3% lower for some of our workloads. After some
> digging, Johannes found that our involuntary context switch rate was ~2x
> higher, and we were leaving a CPU idle a higher percentage of the time,
> even though the workload was trying to saturate the system.
>
> We were able to reproduce the problem with schbench, and Johannes
> bisected down to:
>
> commit 0b0695f2b34a4afa3f6e9aa1ff0e5336d8dad912
> Author: Vincent Guittot <vincent.guittot@linaro.org>
> Date: Fri Oct 18 15:26:31 2019 +0200
>
> sched/fair: Rework load_balance()
>
> Our working theory is the load balancing changes are leaving processes
> behind busy CPUs instead of moving them onto idle ones. I made a few
> schbench modifications to make this easier to demonstrate:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git/
>
> My VM has 40 cpus (20 cores, 2 threads per core), and my schbench
> command line is:

What is the topology ? are they all part of the same LLC ?

>
> schbench -t 20 -r 0 -c 1000000 -s 1000 -i 30 -z 120
>
> This has two message threads, and 20 workers per message thread. Once
> woken up, the workers think for a full second, which means you’ll have
> some long latencies if you’re stuck behind one of these workers in the
> runqueue. The message thread does a little bit of work and then sleeps,
> so we end up with 40 threads hammering full blast on the CPU and 2
> threads popping in and out of idle.
>
> schbench times the delay from when a message thread wakes a worker to
> when the worker runs. On a good kernel, the output looks like this:
>
> Latency percentiles (usec) runtime 1290 (s) (3280 total samples)
> 50.0th: 155 (1653 samples)
> 75.0th: 189 (808 samples)
> 90.0th: 216 (501 samples)
> 95.0th: 227 (163 samples)
> *99.0th: 256 (123 samples)
> 99.5th: 1510 (16 samples)
> 99.9th: 3132 (13 samples)
> min=21, max=3286
>
> With 0b0695f2b34a, we get this:
>
> Latency percentiles (usec) runtime 1440 (s) (4480 total samples)
> 50.0th: 147 (2261 samples)
> 75.0th: 182 (1116 samples)
> 90.0th: 205 (671 samples)
> 95.0th: 224 (215 samples)
> *99.0th: 12240 (173 samples) <—— much higher p99 and up
> 99.5th: 12752 (22 samples)
> 99.9th: 13104 (18 samples)
> min=21, max=13172
>
> Since the idea is to fully load the machine with schbench, use schbench
> -t <your_num_cpus/2>, and make sure the box doesn’t have other stuff
> running in the background. I used a VM because it ended up giving more
> consistent results on our kernel test machines, which have some periodic
> noise running in the background.
>
> We’ve tried a few different approaches, but don’t quite have a solid
> fix yet. I thought I’d kick off the discussion with my most useful
> hunks so far:
>
> diff a/kernel/sched/fair.c b/kernel/sched/fair.c
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
>
> -chris
Re: [PATCH] fix scheduler regression from "sched/fair: Rework load_balance()" [ In reply to ]
On 26 Oct 2020, at 4:39, Vincent Guittot wrote:

> Hi Chris
>
> On Sat, 24 Oct 2020 at 01:49, Chris Mason <clm@fb.com> wrote:
>>
>> Hi everyone,
>>
>> We’re validating a new kernel in the fleet, and compared with v5.2,
>
> Which version are you using ?
> several improvements have been added since v5.5 and the rework of
> load_balance

We’re validating v5.6, but all of the numbers referenced in this patch
are against v5.9. I usually try to back port my way to victory on this
kind of thing, but mainline seems to behave exactly the same as
0b0695f2b34a wrt this benchmark.

>
>> performance is ~2-3% lower for some of our workloads. After some
>> digging, Johannes found that our involuntary context switch rate was
>> ~2x
>> higher, and we were leaving a CPU idle a higher percentage of the
>> time,
>> even though the workload was trying to saturate the system.
>>
>> We were able to reproduce the problem with schbench, and Johannes
>> bisected down to:
>>
>> commit 0b0695f2b34a4afa3f6e9aa1ff0e5336d8dad912
>> Author: Vincent Guittot <vincent.guittot@linaro.org>
>> Date: Fri Oct 18 15:26:31 2019 +0200
>>
>> sched/fair: Rework load_balance()
>>
>> Our working theory is the load balancing changes are leaving
>> processes
>> behind busy CPUs instead of moving them onto idle ones. I made a few
>> schbench modifications to make this easier to demonstrate:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git/
>>
>> My VM has 40 cpus (20 cores, 2 threads per core), and my schbench
>> command line is:
>
> What is the topology ? are they all part of the same LLC ?

We’ve seen the regression on both single socket and dual socket bare
metal intel systems. On the VM I reproduced with, I saw similar
latencies with and without siblings configured into the topology.

-chris
Re: [PATCH] fix scheduler regression from "sched/fair: Rework load_balance()" [ In reply to ]
Le lundi 26 oct. 2020 à 08:45:27 (-0400), Chris Mason a écrit :
> On 26 Oct 2020, at 4:39, Vincent Guittot wrote:
>
> > Hi Chris
> >
> > On Sat, 24 Oct 2020 at 01:49, Chris Mason <clm@fb.com> wrote:
> > >
> > > Hi everyone,
> > >
> > > We’re validating a new kernel in the fleet, and compared with v5.2,
> >
> > Which version are you using ?
> > several improvements have been added since v5.5 and the rework of
> > load_balance
>
> We’re validating v5.6, but all of the numbers referenced in this patch are
> against v5.9. I usually try to back port my way to victory on this kind of
> thing, but mainline seems to behave exactly the same as 0b0695f2b34a wrt
> this benchmark.

ok. Thanks for the confirmation

I have been able to reproduce the problem on my setup.

Could you try the fix below ?

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9049,7 +9049,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
* emptying busiest.
*/
if (local->group_type == group_has_spare) {
- if (busiest->group_type > group_fully_busy) {
+ if ((busiest->group_type > group_fully_busy) &&
+ (busiest->group_weight > 1)) {
/*
* If busiest is overloaded, try to fill spare
* capacity. This might end up creating spare capacity


When we calculate an imbalance at te smallest level, ie between CPUs (group_weight == 1),
we should try to spread tasks on cpus instead of trying to fill spare capacity.


>
> >
> > > performance is ~2-3% lower for some of our workloads. After some
> > > digging, Johannes found that our involuntary context switch rate was
> > > ~2x
> > > higher, and we were leaving a CPU idle a higher percentage of the
> > > time,
> > > even though the workload was trying to saturate the system.
> > >
> > > We were able to reproduce the problem with schbench, and Johannes
> > > bisected down to:
> > >
> > > commit 0b0695f2b34a4afa3f6e9aa1ff0e5336d8dad912
> > > Author: Vincent Guittot <vincent.guittot@linaro.org>
> > > Date: Fri Oct 18 15:26:31 2019 +0200
> > >
> > > sched/fair: Rework load_balance()
> > >
> > > Our working theory is the load balancing changes are leaving
> > > processes
> > > behind busy CPUs instead of moving them onto idle ones. I made a few
> > > schbench modifications to make this easier to demonstrate:
> > >
> > > https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git/
> > >
> > > My VM has 40 cpus (20 cores, 2 threads per core), and my schbench
> > > command line is:
> >
> > What is the topology ? are they all part of the same LLC ?
>
> We’ve seen the regression on both single socket and dual socket bare metal
> intel systems. On the VM I reproduced with, I saw similar latencies with
> and without siblings configured into the topology.
>
> -chris
Re: [PATCH] fix scheduler regression from "sched/fair: Rework load_balance()" [ In reply to ]
On Mon, 2020-10-26 at 15:24 +0100, Vincent Guittot wrote:
> Le lundi 26 oct. 2020 à 08:45:27 (-0400), Chris Mason a écrit :
> > On 26 Oct 2020, at 4:39, Vincent Guittot wrote:
> >
> > > Hi Chris
> > >
> > > On Sat, 24 Oct 2020 at 01:49, Chris Mason <clm@fb.com> wrote:
> > > > Hi everyone,
> > > >
> > > > We’re validating a new kernel in the fleet, and compared with
> > > > v5.2,
> > >
> > > Which version are you using ?
> > > several improvements have been added since v5.5 and the rework of
> > > load_balance
> >
> > We’re validating v5.6, but all of the numbers referenced in this
> > patch are
> > against v5.9. I usually try to back port my way to victory on this
> > kind of
> > thing, but mainline seems to behave exactly the same as
> > 0b0695f2b34a wrt
> > this benchmark.
>
> ok. Thanks for the confirmation
>
> I have been able to reproduce the problem on my setup.
>
> Could you try the fix below ?
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9049,7 +9049,8 @@ static inline void calculate_imbalance(struct
> lb_env *env, struct sd_lb_stats *s
> * emptying busiest.
> */
> if (local->group_type == group_has_spare) {
> - if (busiest->group_type > group_fully_busy) {
> + if ((busiest->group_type > group_fully_busy) &&
> + (busiest->group_weight > 1)) {
> /*
> * If busiest is overloaded, try to fill
> spare
> * capacity. This might end up creating spare
> capacity
>
>
> When we calculate an imbalance at te smallest level, ie between CPUs
> (group_weight == 1),
> we should try to spread tasks on cpus instead of trying to fill spare
> capacity.

Should we also spread tasks when balancing between
multi-threaded CPU cores on the same socket?

Say we have groups of CPUs
(0, 2) and CPUs (1, 3),
with CPU 2 idle, and 3 tasks spread between CPUs
1 & 3.

Since they are all on the same LLC, and the task
wakeup code has absolutely no hesitation in moving
them around, should the load balancer also try to
keep tasks within a socket spread across all CPUs?

--
All Rights Reversed.
Re: [PATCH] fix scheduler regression from "sched/fair: Rework load_balance()" [ In reply to ]
On Mon, 26 Oct 2020 at 15:38, Rik van Riel <riel@surriel.com> wrote:
>
> On Mon, 2020-10-26 at 15:24 +0100, Vincent Guittot wrote:
> > Le lundi 26 oct. 2020 à 08:45:27 (-0400), Chris Mason a écrit :
> > > On 26 Oct 2020, at 4:39, Vincent Guittot wrote:
> > >
> > > > Hi Chris
> > > >
> > > > On Sat, 24 Oct 2020 at 01:49, Chris Mason <clm@fb.com> wrote:
> > > > > Hi everyone,
> > > > >
> > > > > We’re validating a new kernel in the fleet, and compared with
> > > > > v5.2,
> > > >
> > > > Which version are you using ?
> > > > several improvements have been added since v5.5 and the rework of
> > > > load_balance
> > >
> > > We’re validating v5.6, but all of the numbers referenced in this
> > > patch are
> > > against v5.9. I usually try to back port my way to victory on this
> > > kind of
> > > thing, but mainline seems to behave exactly the same as
> > > 0b0695f2b34a wrt
> > > this benchmark.
> >
> > ok. Thanks for the confirmation
> >
> > I have been able to reproduce the problem on my setup.
> >
> > Could you try the fix below ?
> >
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9049,7 +9049,8 @@ static inline void calculate_imbalance(struct
> > lb_env *env, struct sd_lb_stats *s
> > * emptying busiest.
> > */
> > if (local->group_type == group_has_spare) {
> > - if (busiest->group_type > group_fully_busy) {
> > + if ((busiest->group_type > group_fully_busy) &&
> > + (busiest->group_weight > 1)) {
> > /*
> > * If busiest is overloaded, try to fill
> > spare
> > * capacity. This might end up creating spare
> > capacity
> >
> >
> > When we calculate an imbalance at te smallest level, ie between CPUs
> > (group_weight == 1),
> > we should try to spread tasks on cpus instead of trying to fill spare
> > capacity.
>
> Should we also spread tasks when balancing between
> multi-threaded CPU cores on the same socket?

My explanation is probably misleading. In fact we already try to
spread tasks. we just use spare capacity instead of nr_running when
there is more than 1 CPU in the group and the group is overloaded.
Using spare capacity is a bit more conservative because it tries to
not pull more utilization than spare capacity

>
> Say we have groups of CPUs
> (0, 2) and CPUs (1, 3),
> with CPU 2 idle, and 3 tasks spread between CPUs
> 1 & 3.
>
> Since they are all on the same LLC, and the task
> wakeup code has absolutely no hesitation in moving
> them around, should the load balancer also try to
> keep tasks within a socket spread across all CPUs?
>
> --
> All Rights Reversed.
Re: [PATCH] fix scheduler regression from "sched/fair: Rework load_balance()" [ In reply to ]
On Mon, 2020-10-26 at 15:56 +0100, Vincent Guittot wrote:
> On Mon, 26 Oct 2020 at 15:38, Rik van Riel <riel@surriel.com> wrote:
> > On Mon, 2020-10-26 at 15:24 +0100, Vincent Guittot wrote:
> > > Le lundi 26 oct. 2020 à 08:45:27 (-0400), Chris Mason a écrit :
> > > > On 26 Oct 2020, at 4:39, Vincent Guittot wrote:
> > > >
> > > > > Hi Chris
> > > > >
> > > > > On Sat, 24 Oct 2020 at 01:49, Chris Mason <clm@fb.com> wrote:
> > > > > > Hi everyone,
> > > > > >
> > > > > > We’re validating a new kernel in the fleet, and compared
> > > > > > with
> > > > > > v5.2,
> > > > >
> > > > > Which version are you using ?
> > > > > several improvements have been added since v5.5 and the
> > > > > rework of
> > > > > load_balance
> > > >
> > > > We’re validating v5.6, but all of the numbers referenced in
> > > > this
> > > > patch are
> > > > against v5.9. I usually try to back port my way to victory on
> > > > this
> > > > kind of
> > > > thing, but mainline seems to behave exactly the same as
> > > > 0b0695f2b34a wrt
> > > > this benchmark.
> > >
> > > ok. Thanks for the confirmation
> > >
> > > I have been able to reproduce the problem on my setup.
> > >
> > > Could you try the fix below ?
> > >
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -9049,7 +9049,8 @@ static inline void
> > > calculate_imbalance(struct
> > > lb_env *env, struct sd_lb_stats *s
> > > * emptying busiest.
> > > */
> > > if (local->group_type == group_has_spare) {
> > > - if (busiest->group_type > group_fully_busy) {
> > > + if ((busiest->group_type > group_fully_busy) &&
> > > + (busiest->group_weight > 1)) {
> > > /*
> > > * If busiest is overloaded, try to fill
> > > spare
> > > * capacity. This might end up creating
> > > spare
> > > capacity
> > >
> > >
> > > When we calculate an imbalance at te smallest level, ie between
> > > CPUs
> > > (group_weight == 1),
> > > we should try to spread tasks on cpus instead of trying to fill
> > > spare
> > > capacity.
> >
> > Should we also spread tasks when balancing between
> > multi-threaded CPU cores on the same socket?
>
> My explanation is probably misleading. In fact we already try to
> spread tasks. we just use spare capacity instead of nr_running when
> there is more than 1 CPU in the group and the group is overloaded.
> Using spare capacity is a bit more conservative because it tries to
> not pull more utilization than spare capacity

Could utilization estimates be off, either lagging or
simply having a wrong estimate for a task, resulting
in no task getting pulled sometimes, while doing a
migrate_task imbalance always moves over something?

Within an LLC we might not need to worry too much
about spare capacity, considering select_idle_sibling
doesn't give a hoot about capacity, either.

--
All Rights Reversed.
Re: [PATCH] fix scheduler regression from "sched/fair: Rework load_balance()" [ In reply to ]
On 26 Oct 2020, at 10:24, Vincent Guittot wrote:

> Le lundi 26 oct. 2020 à 08:45:27 (-0400), Chris Mason a écrit :
>> On 26 Oct 2020, at 4:39, Vincent Guittot wrote:
>>
>>> Hi Chris
>>>
>>> On Sat, 24 Oct 2020 at 01:49, Chris Mason <clm@fb.com> wrote:
>>>>
>>>> Hi everyone,
>>>>
>>>> We’re validating a new kernel in the fleet, and compared with
>>>> v5.2,
>>>
>>> Which version are you using ?
>>> several improvements have been added since v5.5 and the rework of
>>> load_balance
>>
>> We’re validating v5.6, but all of the numbers referenced in this
>> patch are
>> against v5.9. I usually try to back port my way to victory on this
>> kind of
>> thing, but mainline seems to behave exactly the same as 0b0695f2b34a
>> wrt
>> this benchmark.
>
> ok. Thanks for the confirmation
>
> I have been able to reproduce the problem on my setup.

Thanks for taking a look! Can I ask what parameters you used on
schbench, and what kind of results you saw? Mostly I’m trying to make
sure it’s a useful tool, but also the patch didn’t change things
here.

>
> Could you try the fix below ?
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9049,7 +9049,8 @@ static inline void calculate_imbalance(struct
> lb_env *env, struct sd_lb_stats *s
> * emptying busiest.
> */
> if (local->group_type == group_has_spare) {
> - if (busiest->group_type > group_fully_busy) {
> + if ((busiest->group_type > group_fully_busy) &&
> + (busiest->group_weight > 1)) {
> /*
> * If busiest is overloaded, try to fill spare
> * capacity. This might end up creating spare
> capacity
>
>
> When we calculate an imbalance at te smallest level, ie between CPUs
> (group_weight == 1),
> we should try to spread tasks on cpus instead of trying to fill spare
> capacity.

With this patch on top of v5.9, my latencies are unchanged. I’m
building against current Linus now just in case I’m missing other
fixes.

-chris
Re: [PATCH] fix scheduler regression from "sched/fair: Rework load_balance()" [ In reply to ]
Le lundi 26 oct. 2020 à 11:05:35 (-0400), Chris Mason a écrit :
>
>
> On 26 Oct 2020, at 10:24, Vincent Guittot wrote:
>
> > Le lundi 26 oct. 2020 à 08:45:27 (-0400), Chris Mason a écrit :
> > > On 26 Oct 2020, at 4:39, Vincent Guittot wrote:
> > >
> > > > Hi Chris
> > > >
> > > > On Sat, 24 Oct 2020 at 01:49, Chris Mason <clm@fb.com> wrote:
> > > > >
> > > > > Hi everyone,
> > > > >
> > > > > We’re validating a new kernel in the fleet, and compared
> > > > > with v5.2,
> > > >
> > > > Which version are you using ?
> > > > several improvements have been added since v5.5 and the rework of
> > > > load_balance
> > >
> > > We’re validating v5.6, but all of the numbers referenced in this
> > > patch are
> > > against v5.9. I usually try to back port my way to victory on this
> > > kind of
> > > thing, but mainline seems to behave exactly the same as 0b0695f2b34a
> > > wrt
> > > this benchmark.
> >
> > ok. Thanks for the confirmation
> >
> > I have been able to reproduce the problem on my setup.
>
> Thanks for taking a look! Can I ask what parameters you used on schbench,
> and what kind of results you saw? Mostly I’m trying to make sure it’s a
> useful tool, but also the patch didn’t change things here.
>

with latest tip/sched/core on my dual quad cores:
schbench -t 4 -r 10 -c 1000000 -s 1000
Latency percentiles (usec)
50.0th: 16
75.0th: 23
90.0th: 32
95.0th: 41
*99.0th: 15120
99.5th: 15120
99.9th: 15120
min=0, max=15130

with the patch :
schbench -t 4 -r 10 -c 1000000 -s 1000
Latency percentiles (usec)
50.0th: 28
75.0th: 32
90.0th: 36
95.0th: 56
*99.0th: 1310
99.5th: 1310
99.9th: 1310
min=0, max=1309

> >
> > Could you try the fix below ?
> >
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9049,7 +9049,8 @@ static inline void calculate_imbalance(struct
> > lb_env *env, struct sd_lb_stats *s
> > * emptying busiest.
> > */
> > if (local->group_type == group_has_spare) {
> > - if (busiest->group_type > group_fully_busy) {
> > + if ((busiest->group_type > group_fully_busy) &&
> > + (busiest->group_weight > 1)) {
> > /*
> > * If busiest is overloaded, try to fill spare
> > * capacity. This might end up creating spare
> > capacity
> >
> >
> > When we calculate an imbalance at te smallest level, ie between CPUs
> > (group_weight == 1),
> > we should try to spread tasks on cpus instead of trying to fill spare
> > capacity.
>
> With this patch on top of v5.9, my latencies are unchanged. I’m building
> against current Linus now just in case I’m missing other fixes.
>

I can't remember any changes in mainline that would make a difference

I had another way to fix it but it could impact more other UC and the improvement
was smaller

---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ebe15e36f336..415927885228 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7707,7 +7707,7 @@ static int detach_tasks(struct lb_env *env)
case migrate_util:
util = task_util_est(p);

- if (util > env->imbalance)
+ if ((util >> env->sd->nr_balance_failed) > env->imbalance)
goto next;

env->imbalance -= util;
--


>
> -chris
Re: [PATCH] fix scheduler regression from "sched/fair: Rework load_balance()" [ In reply to ]
On 26 Oct 2020, at 11:05, Chris Mason wrote:

> On 26 Oct 2020, at 10:24, Vincent Guittot wrote:
>
>>
>> Could you try the fix below ?
>>
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -9049,7 +9049,8 @@ static inline void calculate_imbalance(struct
>> lb_env *env, struct sd_lb_stats *s
>> * emptying busiest.
>> */
>> if (local->group_type == group_has_spare) {
>> - if (busiest->group_type > group_fully_busy) {
>> + if ((busiest->group_type > group_fully_busy) &&
>> + (busiest->group_weight > 1)) {
>> /*
>> * If busiest is overloaded, try to fill
>> spare
>> * capacity. This might end up creating spare
>> capacity
>>
>>
>> When we calculate an imbalance at te smallest level, ie between CPUs
>> (group_weight == 1),
>> we should try to spread tasks on cpus instead of trying to fill spare
>> capacity.
>
> With this patch on top of v5.9, my latencies are unchanged. I’m
> building against current Linus now just in case I’m missing other
> fixes.
>

I reran things to make sure the nothing changed on my test box this
weekend:

5.4.0-rc1-00009-gfcf0553db6f4 (last good kernel)
Latency percentiles (usec) runtime 30 (s) (1000 total samples)
50.0th: 180 (502 samples)
75.0th: 227 (251 samples)
90.0th: 268 (147 samples)
95.0th: 300 (50 samples)
*99.0th: 338 (41 samples)
99.5th: 344 (4 samples)
99.9th: 1186 (5 samples)
min=25, max=1185

5.4.0-rc1-00010-g0b0695f2b34a (first bad kernel)
Latency percentiles (usec) runtime 150 (s) (960 total samples)
50.0th: 166 (488 samples)
75.0th: 210 (232 samples)
90.0th: 254 (145 samples)
95.0th: 299 (47 samples)
*99.0th: 12688 (39 samples)
99.5th: 13008 (5 samples)
99.9th: 13104 (4 samples)
min=24, max=13100

3650b228f83adda7e5ee532e2b90429c03f7b9ec (v5.10-rc1) + your patch

Latency percentiles (usec) runtime 30 (s) (1000 total samples)
50.0th: 169 (505 samples)
75.0th: 210 (246 samples)
90.0th: 267 (151 samples)
95.0th: 305 (48 samples)
*99.0th: 12656 (40 samples)
99.5th: 12944 (5 samples)
99.9th: 13168 (5 samples)
min=44, max=13155

-chris
Re: [PATCH] fix scheduler regression from "sched/fair: Rework load_balance()" [ In reply to ]
On Mon, 26 Oct 2020 at 16:04, Rik van Riel <riel@surriel.com> wrote:
>
> On Mon, 2020-10-26 at 15:56 +0100, Vincent Guittot wrote:
> > On Mon, 26 Oct 2020 at 15:38, Rik van Riel <riel@surriel.com> wrote:
> > > On Mon, 2020-10-26 at 15:24 +0100, Vincent Guittot wrote:
> > > > Le lundi 26 oct. 2020 à 08:45:27 (-0400), Chris Mason a écrit :
> > > > > On 26 Oct 2020, at 4:39, Vincent Guittot wrote:
> > > > >
> > > > > > Hi Chris
> > > > > >
> > > > > > On Sat, 24 Oct 2020 at 01:49, Chris Mason <clm@fb.com> wrote:
> > > > > > > Hi everyone,
> > > > > > >
> > > > > > > We’re validating a new kernel in the fleet, and compared
> > > > > > > with
> > > > > > > v5.2,
> > > > > >
> > > > > > Which version are you using ?
> > > > > > several improvements have been added since v5.5 and the
> > > > > > rework of
> > > > > > load_balance
> > > > >
> > > > > We’re validating v5.6, but all of the numbers referenced in
> > > > > this
> > > > > patch are
> > > > > against v5.9. I usually try to back port my way to victory on
> > > > > this
> > > > > kind of
> > > > > thing, but mainline seems to behave exactly the same as
> > > > > 0b0695f2b34a wrt
> > > > > this benchmark.
> > > >
> > > > ok. Thanks for the confirmation
> > > >
> > > > I have been able to reproduce the problem on my setup.
> > > >
> > > > Could you try the fix below ?
> > > >
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -9049,7 +9049,8 @@ static inline void
> > > > calculate_imbalance(struct
> > > > lb_env *env, struct sd_lb_stats *s
> > > > * emptying busiest.
> > > > */
> > > > if (local->group_type == group_has_spare) {
> > > > - if (busiest->group_type > group_fully_busy) {
> > > > + if ((busiest->group_type > group_fully_busy) &&
> > > > + (busiest->group_weight > 1)) {
> > > > /*
> > > > * If busiest is overloaded, try to fill
> > > > spare
> > > > * capacity. This might end up creating
> > > > spare
> > > > capacity
> > > >
> > > >
> > > > When we calculate an imbalance at te smallest level, ie between
> > > > CPUs
> > > > (group_weight == 1),
> > > > we should try to spread tasks on cpus instead of trying to fill
> > > > spare
> > > > capacity.
> > >
> > > Should we also spread tasks when balancing between
> > > multi-threaded CPU cores on the same socket?
> >
> > My explanation is probably misleading. In fact we already try to
> > spread tasks. we just use spare capacity instead of nr_running when
> > there is more than 1 CPU in the group and the group is overloaded.
> > Using spare capacity is a bit more conservative because it tries to
> > not pull more utilization than spare capacity
>
> Could utilization estimates be off, either lagging or
> simply having a wrong estimate for a task, resulting
> in no task getting pulled sometimes, while doing a
> migrate_task imbalance always moves over something?

task and cpu utilization are not always up to fully synced and may lag
a bit which explains that sometimes LB can fail to migrate for a small
diff

>
> Within an LLC we might not need to worry too much
> about spare capacity, considering select_idle_sibling
> doesn't give a hoot about capacity, either.
>
> --
> All Rights Reversed.
Re: [PATCH] fix scheduler regression from "sched/fair: Rework load_balance()" [ In reply to ]
On Mon, 26 Oct 2020 at 16:42, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Mon, 26 Oct 2020 at 16:04, Rik van Riel <riel@surriel.com> wrote:
> >
> > On Mon, 2020-10-26 at 15:56 +0100, Vincent Guittot wrote:
> > > On Mon, 26 Oct 2020 at 15:38, Rik van Riel <riel@surriel.com> wrote:
> > > > On Mon, 2020-10-26 at 15:24 +0100, Vincent Guittot wrote:
> > > > > Le lundi 26 oct. 2020 à 08:45:27 (-0400), Chris Mason a écrit :
> > > > > > On 26 Oct 2020, at 4:39, Vincent Guittot wrote:
> > > > > >
> > > > > > > Hi Chris
> > > > > > >
> > > > > > > On Sat, 24 Oct 2020 at 01:49, Chris Mason <clm@fb.com> wrote:
> > > > > > > > Hi everyone,
> > > > > > > >
> > > > > > > > We’re validating a new kernel in the fleet, and compared
> > > > > > > > with
> > > > > > > > v5.2,
> > > > > > >
> > > > > > > Which version are you using ?
> > > > > > > several improvements have been added since v5.5 and the
> > > > > > > rework of
> > > > > > > load_balance
> > > > > >
> > > > > > We’re validating v5.6, but all of the numbers referenced in
> > > > > > this
> > > > > > patch are
> > > > > > against v5.9. I usually try to back port my way to victory on
> > > > > > this
> > > > > > kind of
> > > > > > thing, but mainline seems to behave exactly the same as
> > > > > > 0b0695f2b34a wrt
> > > > > > this benchmark.
> > > > >
> > > > > ok. Thanks for the confirmation
> > > > >
> > > > > I have been able to reproduce the problem on my setup.
> > > > >
> > > > > Could you try the fix below ?
> > > > >
> > > > > --- a/kernel/sched/fair.c
> > > > > +++ b/kernel/sched/fair.c
> > > > > @@ -9049,7 +9049,8 @@ static inline void
> > > > > calculate_imbalance(struct
> > > > > lb_env *env, struct sd_lb_stats *s
> > > > > * emptying busiest.
> > > > > */
> > > > > if (local->group_type == group_has_spare) {
> > > > > - if (busiest->group_type > group_fully_busy) {
> > > > > + if ((busiest->group_type > group_fully_busy) &&
> > > > > + (busiest->group_weight > 1)) {
> > > > > /*
> > > > > * If busiest is overloaded, try to fill
> > > > > spare
> > > > > * capacity. This might end up creating
> > > > > spare
> > > > > capacity
> > > > >
> > > > >
> > > > > When we calculate an imbalance at te smallest level, ie between
> > > > > CPUs
> > > > > (group_weight == 1),
> > > > > we should try to spread tasks on cpus instead of trying to fill
> > > > > spare
> > > > > capacity.
> > > >
> > > > Should we also spread tasks when balancing between
> > > > multi-threaded CPU cores on the same socket?
> > >
> > > My explanation is probably misleading. In fact we already try to
> > > spread tasks. we just use spare capacity instead of nr_running when
> > > there is more than 1 CPU in the group and the group is overloaded.
> > > Using spare capacity is a bit more conservative because it tries to
> > > not pull more utilization than spare capacity
> >
> > Could utilization estimates be off, either lagging or
> > simply having a wrong estimate for a task, resulting
> > in no task getting pulled sometimes, while doing a
> > migrate_task imbalance always moves over something?
>
> task and cpu utilization are not always up to fully synced and may lag
> a bit which explains that sometimes LB can fail to migrate for a small
> diff

And also from util_est which reports the max utilization of the task
to be sure that LB migrates a task on a cpu that will have enough
available capacity

>
> >
> > Within an LLC we might not need to worry too much
> > about spare capacity, considering select_idle_sibling
> > doesn't give a hoot about capacity, either.
> >
> > --
> > All Rights Reversed.
Re: [PATCH] fix scheduler regression from "sched/fair: Rework load_balance()" [ In reply to ]
On Mon, 26 Oct 2020 16:42:14 +0100
Vincent Guittot <vincent.guittot@linaro.org> wrote:
> On Mon, 26 Oct 2020 at 16:04, Rik van Riel <riel@surriel.com> wrote:

> > Could utilization estimates be off, either lagging or
> > simply having a wrong estimate for a task, resulting
> > in no task getting pulled sometimes, while doing a
> > migrate_task imbalance always moves over something?
>
> task and cpu utilization are not always up to fully synced and may lag
> a bit which explains that sometimes LB can fail to migrate for a small
> diff

OK, running with this little snippet below, I see latencies
improve back to near where they used to be:

Latency percentiles (usec) runtime 150 (s)
50.0th: 13
75.0th: 31
90.0th: 69
95.0th: 90
*99.0th: 761
99.5th: 2268
99.9th: 9104
min=1, max=16158

I suspect the right/cleaner approach might be to use
migrate_task more in !CPU_NOT_IDLE cases?

Running a task to an idle CPU immediately, instead of refusing
to have the load balancer move it, improves latencies for fairly
obvious reasons.

I am not entirely clear on why the load balancer should need to
be any more conservative about moving tasks than the wakeup
path is in eg. select_idle_sibling.


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 35bdc0cccfa6..60acf71a2d39 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7415,7 +7415,7 @@ static int detach_tasks(struct lb_env *env)
case migrate_util:
util = task_util_est(p);

- if (util > env->imbalance)
+ if (util > env->imbalance && env->idle == CPU_NOT_IDLE)
goto next;

env->imbalance -= util;
Re: [PATCH] fix scheduler regression from "sched/fair: Rework load_balance()" [ In reply to ]
Le lundi 26 oct. 2020 ? 12:04:45 (-0400), Rik van Riel a ?crit :
> On Mon, 26 Oct 2020 16:42:14 +0100
> Vincent Guittot <vincent.guittot@linaro.org> wrote:
> > On Mon, 26 Oct 2020 at 16:04, Rik van Riel <riel@surriel.com> wrote:
>
> > > Could utilization estimates be off, either lagging or
> > > simply having a wrong estimate for a task, resulting
> > > in no task getting pulled sometimes, while doing a
> > > migrate_task imbalance always moves over something?
> >
> > task and cpu utilization are not always up to fully synced and may lag
> > a bit which explains that sometimes LB can fail to migrate for a small
> > diff
>
> OK, running with this little snippet below, I see latencies
> improve back to near where they used to be:
>
> Latency percentiles (usec) runtime 150 (s)
> 50.0th: 13
> 75.0th: 31
> 90.0th: 69
> 95.0th: 90
> *99.0th: 761
> 99.5th: 2268
> 99.9th: 9104
> min=1, max=16158
>
> I suspect the right/cleaner approach might be to use
> migrate_task more in !CPU_NOT_IDLE cases?
>
> Running a task to an idle CPU immediately, instead of refusing
> to have the load balancer move it, improves latencies for fairly
> obvious reasons.
>
> I am not entirely clear on why the load balancer should need to
> be any more conservative about moving tasks than the wakeup
> path is in eg. select_idle_sibling.


what you are suggesting is something like:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4978964e75e5..3b6fbf33abc2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9156,7 +9156,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
* emptying busiest.
*/
if (local->group_type == group_has_spare) {
- if (busiest->group_type > group_fully_busy) {
+ if ((busiest->group_type > group_fully_busy) &&
+ !(env->sd->flags & SD_SHARE_PKG_RESOURCES)) {
/*
* If busiest is overloaded, try to fill spare
* capacity. This might end up creating spare capacity

which also fixes the problem for me and alignes LB with wakeup path regarding the migration
in the LLC

>
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 35bdc0cccfa6..60acf71a2d39 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7415,7 +7415,7 @@ static int detach_tasks(struct lb_env *env)
> case migrate_util:
> util = task_util_est(p);
>
> - if (util > env->imbalance)
> + if (util > env->imbalance && env->idle == CPU_NOT_IDLE)

we don't want to be so agressive outside LLC and such bypass will defeat the purpose of
using utilization instead of nr_running.


> goto next;
>
> env->imbalance -= util;
Re: [PATCH] fix scheduler regression from "sched/fair: Rework load_balance()" [ In reply to ]
On 26 Oct 2020, at 12:20, Vincent Guittot wrote:

> Le lundi 26 oct. 2020 à 12:04:45 (-0400), Rik van Riel a écrit :
>> On Mon, 26 Oct 2020 16:42:14 +0100
>> Vincent Guittot <vincent.guittot@linaro.org> wrote:
>>> On Mon, 26 Oct 2020 at 16:04, Rik van Riel <riel@surriel.com> wrote:
>>
>>>> Could utilization estimates be off, either lagging or
>>>> simply having a wrong estimate for a task, resulting
>>>> in no task getting pulled sometimes, while doing a
>>>> migrate_task imbalance always moves over something?
>>>
>>> task and cpu utilization are not always up to fully synced and may
>>> lag
>>> a bit which explains that sometimes LB can fail to migrate for a
>>> small
>>> diff
>>
>> OK, running with this little snippet below, I see latencies
>> improve back to near where they used to be:
>>
>> Latency percentiles (usec) runtime 150 (s)
>> 50.0th: 13
>> 75.0th: 31
>> 90.0th: 69
>> 95.0th: 90
>> *99.0th: 761
>> 99.5th: 2268
>> 99.9th: 9104
>> min=1, max=16158
>>
>> I suspect the right/cleaner approach might be to use
>> migrate_task more in !CPU_NOT_IDLE cases?
>>
>> Running a task to an idle CPU immediately, instead of refusing
>> to have the load balancer move it, improves latencies for fairly
>> obvious reasons.
>>
>> I am not entirely clear on why the load balancer should need to
>> be any more conservative about moving tasks than the wakeup
>> path is in eg. select_idle_sibling.
>
>
> what you are suggesting is something like:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4978964e75e5..3b6fbf33abc2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9156,7 +9156,8 @@ static inline void calculate_imbalance(struct
> lb_env *env, struct sd_lb_stats *s
> * emptying busiest.
> */
> if (local->group_type == group_has_spare) {
> - if (busiest->group_type > group_fully_busy) {
> + if ((busiest->group_type > group_fully_busy) &&
> + !(env->sd->flags & SD_SHARE_PKG_RESOURCES)) {
> /*
> * If busiest is overloaded, try to fill spare
> * capacity. This might end up creating spare
> capacity
>
> which also fixes the problem for me and alignes LB with wakeup path
> regarding the migration
> in the LLC

Vincent’s patch on top of 5.10-rc1 looks pretty great:

Latency percentiles (usec) runtime 90 (s) (3320 total samples)
50.0th: 161 (1687 samples)
75.0th: 200 (817 samples)
90.0th: 228 (488 samples)
95.0th: 254 (164 samples)
*99.0th: 314 (131 samples)
99.5th: 330 (17 samples)
99.9th: 356 (13 samples)
min=29, max=358

Next we test in prod, which probably won’t have answers until
tomorrow. Thanks again Vincent!

-chris
Re: [PATCH] fix scheduler regression from "sched/fair: Rework load_balance()" [ In reply to ]
On Mon, 26 Oct 2020 at 17:48, Chris Mason <clm@fb.com> wrote:
>
> On 26 Oct 2020, at 12:20, Vincent Guittot wrote:
>
> > Le lundi 26 oct. 2020 à 12:04:45 (-0400), Rik van Riel a écrit :
> >> On Mon, 26 Oct 2020 16:42:14 +0100
> >> Vincent Guittot <vincent.guittot@linaro.org> wrote:
> >>> On Mon, 26 Oct 2020 at 16:04, Rik van Riel <riel@surriel.com> wrote:
> >>
> >>>> Could utilization estimates be off, either lagging or
> >>>> simply having a wrong estimate for a task, resulting
> >>>> in no task getting pulled sometimes, while doing a
> >>>> migrate_task imbalance always moves over something?
> >>>
> >>> task and cpu utilization are not always up to fully synced and may
> >>> lag
> >>> a bit which explains that sometimes LB can fail to migrate for a
> >>> small
> >>> diff
> >>
> >> OK, running with this little snippet below, I see latencies
> >> improve back to near where they used to be:
> >>
> >> Latency percentiles (usec) runtime 150 (s)
> >> 50.0th: 13
> >> 75.0th: 31
> >> 90.0th: 69
> >> 95.0th: 90
> >> *99.0th: 761
> >> 99.5th: 2268
> >> 99.9th: 9104
> >> min=1, max=16158
> >>
> >> I suspect the right/cleaner approach might be to use
> >> migrate_task more in !CPU_NOT_IDLE cases?
> >>
> >> Running a task to an idle CPU immediately, instead of refusing
> >> to have the load balancer move it, improves latencies for fairly
> >> obvious reasons.
> >>
> >> I am not entirely clear on why the load balancer should need to
> >> be any more conservative about moving tasks than the wakeup
> >> path is in eg. select_idle_sibling.
> >
> >
> > what you are suggesting is something like:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 4978964e75e5..3b6fbf33abc2 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9156,7 +9156,8 @@ static inline void calculate_imbalance(struct
> > lb_env *env, struct sd_lb_stats *s
> > * emptying busiest.
> > */
> > if (local->group_type == group_has_spare) {
> > - if (busiest->group_type > group_fully_busy) {
> > + if ((busiest->group_type > group_fully_busy) &&
> > + !(env->sd->flags & SD_SHARE_PKG_RESOURCES)) {
> > /*
> > * If busiest is overloaded, try to fill spare
> > * capacity. This might end up creating spare
> > capacity
> >
> > which also fixes the problem for me and alignes LB with wakeup path
> > regarding the migration
> > in the LLC
>
> Vincent’s patch on top of 5.10-rc1 looks pretty great:
>
> Latency percentiles (usec) runtime 90 (s) (3320 total samples)
> 50.0th: 161 (1687 samples)
> 75.0th: 200 (817 samples)
> 90.0th: 228 (488 samples)
> 95.0th: 254 (164 samples)
> *99.0th: 314 (131 samples)
> 99.5th: 330 (17 samples)
> 99.9th: 356 (13 samples)
> min=29, max=358
>
> Next we test in prod, which probably won’t have answers until
> tomorrow. Thanks again Vincent!

Great !

I'm going to run more tests on my setup as well to make sure that it
doesn't generate unexpected side effects on other kinds of use cases.

>
> -chris
Re: [PATCH] fix scheduler regression from "sched/fair: Rework load_balance()" [ In reply to ]
On Mon, 2020-10-26 at 17:52 +0100, Vincent Guittot wrote:
> On Mon, 26 Oct 2020 at 17:48, Chris Mason <clm@fb.com> wrote:
> > On 26 Oct 2020, at 12:20, Vincent Guittot wrote:
> >
> > > what you are suggesting is something like:
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 4978964e75e5..3b6fbf33abc2 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -9156,7 +9156,8 @@ static inline void
> > > calculate_imbalance(struct
> > > lb_env *env, struct sd_lb_stats *s
> > > * emptying busiest.
> > > */
> > > if (local->group_type == group_has_spare) {
> > > - if (busiest->group_type > group_fully_busy) {
> > > + if ((busiest->group_type > group_fully_busy) &&
> > > + !(env->sd->flags & SD_SHARE_PKG_RESOURCES)) {
> > > /*
> > > * If busiest is overloaded, try to fill
> > > spare
> > > * capacity. This might end up creating
> > > spare
> > > capacity
> > >
> > > which also fixes the problem for me and alignes LB with wakeup
> > > path
> > > regarding the migration
> > > in the LLC
> >
> > Vincent’s patch on top of 5.10-rc1 looks pretty great:
> >
> > Latency percentiles (usec) runtime 90 (s) (3320 total samples)
> > 50.0th: 161 (1687 samples)
> > 75.0th: 200 (817 samples)
> > 90.0th: 228 (488 samples)
> > 95.0th: 254 (164 samples)
> > *99.0th: 314 (131 samples)
> > 99.5th: 330 (17 samples)
> > 99.9th: 356 (13 samples)
> > min=29, max=358
> >
> > Next we test in prod, which probably won’t have answers until
> > tomorrow. Thanks again Vincent!
>
> Great !
>
> I'm going to run more tests on my setup as well to make sure that it
> doesn't generate unexpected side effects on other kinds of use cases.

We have tested the patch with several pretty demanding
workloads for the past several days, and it seems to
do the trick!

With all the current scheduler code from the Linus tree,
plus this patch on top, performance is as good as it ever
was before with one workload, and slightly better with
the other.

--
All Rights Reversed.
Re: [PATCH] fix scheduler regression from "sched/fair: Rework load_balance()" [ In reply to ]
On Fri, 30 Oct 2020 at 03:11, Rik van Riel <riel@surriel.com> wrote:
>
> On Mon, 2020-10-26 at 17:52 +0100, Vincent Guittot wrote:
> > On Mon, 26 Oct 2020 at 17:48, Chris Mason <clm@fb.com> wrote:
> > > On 26 Oct 2020, at 12:20, Vincent Guittot wrote:
> > >
> > > > what you are suggesting is something like:
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 4978964e75e5..3b6fbf33abc2 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -9156,7 +9156,8 @@ static inline void
> > > > calculate_imbalance(struct
> > > > lb_env *env, struct sd_lb_stats *s
> > > > * emptying busiest.
> > > > */
> > > > if (local->group_type == group_has_spare) {
> > > > - if (busiest->group_type > group_fully_busy) {
> > > > + if ((busiest->group_type > group_fully_busy) &&
> > > > + !(env->sd->flags & SD_SHARE_PKG_RESOURCES)) {
> > > > /*
> > > > * If busiest is overloaded, try to fill
> > > > spare
> > > > * capacity. This might end up creating
> > > > spare
> > > > capacity
> > > >
> > > > which also fixes the problem for me and alignes LB with wakeup
> > > > path
> > > > regarding the migration
> > > > in the LLC
> > >
> > > Vincent’s patch on top of 5.10-rc1 looks pretty great:
> > >
> > > Latency percentiles (usec) runtime 90 (s) (3320 total samples)
> > > 50.0th: 161 (1687 samples)
> > > 75.0th: 200 (817 samples)
> > > 90.0th: 228 (488 samples)
> > > 95.0th: 254 (164 samples)
> > > *99.0th: 314 (131 samples)
> > > 99.5th: 330 (17 samples)
> > > 99.9th: 356 (13 samples)
> > > min=29, max=358
> > >
> > > Next we test in prod, which probably won’t have answers until
> > > tomorrow. Thanks again Vincent!
> >
> > Great !
> >
> > I'm going to run more tests on my setup as well to make sure that it
> > doesn't generate unexpected side effects on other kinds of use cases.
>
> We have tested the patch with several pretty demanding
> workloads for the past several days, and it seems to
> do the trick!
>
> With all the current scheduler code from the Linus tree,
> plus this patch on top, performance is as good as it ever
> was before with one workload, and slightly better with
> the other.

Thanks for the test results.
I still have a few tests to run on my systems but current results look
good for me too.


>
> --
> All Rights Reversed.