Mailing List Archive

1 2  View All
Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ [ In reply to ]
Hi Tim,

Le mardi 23 mars 2021 ? 14:37:59 (-0700), Tim Chen a ?crit :
>
>
> On 1/29/21 9:27 AM, Vincent Guittot wrote:
> >
> > The patch below moves the update of the blocked load of CPUs outside newidle_balance().
>
> On a well known database workload, we also saw a lot of overhead to do update_blocked_averages
> in newidle_balance(). So changes to reduce this overhead is much welcomed.
>
> Turning on cgroup induces 9% throughput degradation on a 2 socket 40 cores per socket Icelake system.
>
> A big part of the overhead in our database workload comes from updating
> blocked averages in newidle_balance, caused by I/O threads making
> some CPUs go in and out of idle frequently in the following code path:
>
> ----__blkdev_direct_IO_simple
> |
> |----io_schedule_timeout
> | |
> | ----schedule_timeout
> | |
> | ----schedule
> | |
> | ----__schedule
> | |
> | ----pick_next_task_fair
> | |
> | ----newidle_balance
> | |
> ----update_blocked_averages
>
> We found update_blocked_averages() now consumed most CPU time, eating up 2% of the CPU cycles once cgroup
> gets turned on.
>
> I hacked up Joe's original patch to rate limit the update of blocked
> averages called from newidle_balance(). The 9% throughput degradation reduced to
> 5.4%. We'll be testing Vincent's change to see if it can give
> similar performance improvement.
>
> Though in our test environment, sysctl_sched_migration_cost was kept
> much lower (25000) compared to the default (500000), to encourage migrations to idle cpu
> and reduce latency. We got quite a lot of calls to update_blocked_averages directly
> and then try to load_balance in newidle_balance instead of relegating
> the responsibility to idle load balancer. (See code snippet in newidle_balance below)
>
>
> if (this_rq->avg_idle < sysctl_sched_migration_cost || <-----sched_migration_cost check
> !READ_ONCE(this_rq->rd->overload)) {
>
> rcu_read_lock();
> sd = rcu_dereference_check_sched_domain(this_rq->sd);
> if (sd)
> update_next_balance(sd, &next_balance);
> rcu_read_unlock();
>
> goto out; <--- invoke idle load balancer
> }
>
> raw_spin_unlock(&this_rq->lock);
>
> update_blocked_averages(this_cpu);
>
> .... followed by load balance code ---
>

> So the update_blocked_averages offload to idle_load_balancer in Vincent's patch is less
> effective in this case with small sched_migration_cost.
>
> Looking at the code a bit more, we don't actually load balance every time in this code path
> unless our avg_idle time exceeds some threshold. Doing update_blocked_averages immediately

IIUC your problem, we call update_blocked_averages() but because of:

if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
update_next_balance(sd, &next_balance);
break;
}

the for_each_domain loop stops even before running load_balance on the 1st
sched domain level which means that update_blocked_averages() was called
unnecessarily.

And this is even more true with a small sysctl_sched_migration_cost which allows newly
idle LB for very small this_rq->avg_idle. We could wonder why you set such a low value
for sysctl_sched_migration_cost which is lower than the max_newidle_lb_cost of the
smallest domain but that's probably because of task_hot().

if avg_idle is lower than the sd->max_newidle_lb_cost of the 1st sched_domain, we should
skip spin_unlock/lock and for_each_domain() loop entirely

Maybe something like below:


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 76e33a70d575..08933e0d87ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10723,17 +10723,21 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
*/
rq_unpin_lock(this_rq, rf);

+ rcu_read_lock();
+ sd = rcu_dereference_check_sched_domain(this_rq->sd);
+
if (this_rq->avg_idle < sysctl_sched_migration_cost ||
- !READ_ONCE(this_rq->rd->overload)) {
+ !READ_ONCE(this_rq->rd->overload) ||
+ (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {

- rcu_read_lock();
- sd = rcu_dereference_check_sched_domain(this_rq->sd);
if (sd)
update_next_balance(sd, &next_balance);
rcu_read_unlock();

goto out;
}
+ rcu_read_unlock();
+

raw_spin_unlock(&this_rq->lock);


> is only needed if we do call load_balance(). If we don't do any load balance in the code path,
> we can let the idle load balancer update the blocked averages lazily.
>
> Something like the following perhaps on top of Vincent's patch? We haven't really tested
> this change yet but want to see if this change makes sense to you.
>
> Tim
>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 63950d80fd0b..b93f5f52658a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10591,6 +10591,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> struct sched_domain *sd;
> int pulled_task = 0;
> u64 curr_cost = 0;
> + bool updated_blocked_avg = false;
>
> update_misfit_status(NULL, this_rq);
> /*
> @@ -10627,7 +10628,6 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>
> raw_spin_unlock(&this_rq->lock);
>
> - update_blocked_averages(this_cpu);
> rcu_read_lock();
> for_each_domain(this_cpu, sd) {
> int continue_balancing = 1;
> @@ -10639,6 +10639,11 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> }
>
> if (sd->flags & SD_BALANCE_NEWIDLE) {
> + if (!updated_blocked_avg) {
> + update_blocked_averages(this_cpu);
> + updated_blocked_avg = true;
> + }
> +
> t0 = sched_clock_cpu(this_cpu);
>
> pulled_task = load_balance(this_cpu, this_rq,
>
>
Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ [ In reply to ]
On 3/24/21 6:44 AM, Vincent Guittot wrote:
> Hi Tim,

>
> IIUC your problem, we call update_blocked_averages() but because of:
>
> if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
> update_next_balance(sd, &next_balance);
> break;
> }
>
> the for_each_domain loop stops even before running load_balance on the 1st
> sched domain level which means that update_blocked_averages() was called
> unnecessarily.
>

That's right

> And this is even more true with a small sysctl_sched_migration_cost which allows newly
> idle LB for very small this_rq->avg_idle. We could wonder why you set such a low value
> for sysctl_sched_migration_cost which is lower than the max_newidle_lb_cost of the
> smallest domain but that's probably because of task_hot().
>
> if avg_idle is lower than the sd->max_newidle_lb_cost of the 1st sched_domain, we should
> skip spin_unlock/lock and for_each_domain() loop entirely
>
> Maybe something like below:
>

The patch makes sense. I'll ask our benchmark team to queue this patch for testing.

Tim
Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ [ In reply to ]
Hi Tim,

On Wed, 24 Mar 2021 at 17:05, Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
>
>
> On 3/24/21 6:44 AM, Vincent Guittot wrote:
> > Hi Tim,
>
> >
> > IIUC your problem, we call update_blocked_averages() but because of:
> >
> > if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
> > update_next_balance(sd, &next_balance);
> > break;
> > }
> >
> > the for_each_domain loop stops even before running load_balance on the 1st
> > sched domain level which means that update_blocked_averages() was called
> > unnecessarily.
> >
>
> That's right
>
> > And this is even more true with a small sysctl_sched_migration_cost which allows newly
> > idle LB for very small this_rq->avg_idle. We could wonder why you set such a low value
> > for sysctl_sched_migration_cost which is lower than the max_newidle_lb_cost of the
> > smallest domain but that's probably because of task_hot().
> >
> > if avg_idle is lower than the sd->max_newidle_lb_cost of the 1st sched_domain, we should
> > skip spin_unlock/lock and for_each_domain() loop entirely
> >
> > Maybe something like below:
> >
>
> The patch makes sense. I'll ask our benchmark team to queue this patch for testing.

Do you have feedback from your benchmark team ?

Regards,
Vincent
>
> Tim
>
>
Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ [ In reply to ]
On 4/7/21 7:02 AM, Vincent Guittot wrote:
> Hi Tim,
>
> On Wed, 24 Mar 2021 at 17:05, Tim Chen <tim.c.chen@linux.intel.com> wrote:
>>
>>
>>
>> On 3/24/21 6:44 AM, Vincent Guittot wrote:
>>> Hi Tim,
>>
>>>
>>> IIUC your problem, we call update_blocked_averages() but because of:
>>>
>>> if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
>>> update_next_balance(sd, &next_balance);
>>> break;
>>> }
>>>
>>> the for_each_domain loop stops even before running load_balance on the 1st
>>> sched domain level which means that update_blocked_averages() was called
>>> unnecessarily.
>>>
>>
>> That's right
>>
>>> And this is even more true with a small sysctl_sched_migration_cost which allows newly
>>> idle LB for very small this_rq->avg_idle. We could wonder why you set such a low value
>>> for sysctl_sched_migration_cost which is lower than the max_newidle_lb_cost of the
>>> smallest domain but that's probably because of task_hot().
>>>
>>> if avg_idle is lower than the sd->max_newidle_lb_cost of the 1st sched_domain, we should
>>> skip spin_unlock/lock and for_each_domain() loop entirely
>>>
>>> Maybe something like below:
>>>
>>
>> The patch makes sense. I'll ask our benchmark team to queue this patch for testing.
>
> Do you have feedback from your benchmark team ?
>

Vincent,

Thanks for following up. I just got some data back from the benchmark team.
The performance didn't change with your patch. And the overall cpu% of update_blocked_averages
also remain at about the same level. My first thought was perhaps this update
still didn't catch all the calls to update_blocked_averages

if (this_rq->avg_idle < sysctl_sched_migration_cost ||
- !READ_ONCE(this_rq->rd->overload)) {
+ !READ_ONCE(this_rq->rd->overload) ||
+ (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {

To experiment, I added one more check on the next_balance to further limit
the path to actually do idle load balance with the next_balance time.

if (this_rq->avg_idle < sysctl_sched_migration_cost ||
- !READ_ONCE(this_rq->rd->overload)) {
+ time_before(jiffies, this_rq->next_balance) ||
+ !READ_ONCE(this_rq->rd->overload) ||
+ (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {

I was suprised to find the overall cpu% consumption of update_blocked_averages
and throughput of the benchmark still didn't change much. So I took a
peek into the profile and found the update_blocked_averages calls shifted to the idle load balancer.
The call to update_locked_averages was reduced in newidle_balance so the patch did
what we intended. But the overall rate of calls to
update_blocked_averages remain roughly the same, shifting from
newidle_balance to run_rebalance_domains.

100.00% (ffffffff810cf070)
|
---update_blocked_averages
|
|--95.47%--run_rebalance_domains
| __do_softirq
| |
| |--94.27%--asm_call_irq_on_stack
| | do_softirq_own_stack
| | |
| | |--93.74%--irq_exit_rcu
| | | |
| | | |--88.20%--sysvec_apic_timer_interrupt
| | | | asm_sysvec_apic_timer_interrupt
| | | | |
...
|
|
--4.53%--newidle_balance
pick_next_task_fair

I was expecting idle load balancer to be rate limited to 60 Hz, which
should be 15 jiffies apart on the test system with CONFIG_HZ_250.
When I did a trace on a single CPU, I see that update_blocked_averages
are often called between 1 to 4 jiffies apart, which is at a much higher
rate than I expected. I haven't taken a closer look yet. But you may
have a better idea. I won't have access to the test system and workload
till probably next week.

Thanks.

Tim
Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ [ In reply to ]
On Wed, 7 Apr 2021 at 19:19, Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
>
>
> On 4/7/21 7:02 AM, Vincent Guittot wrote:
> > Hi Tim,
> >
> > On Wed, 24 Mar 2021 at 17:05, Tim Chen <tim.c.chen@linux.intel.com> wrote:
> >>
> >>
> >>
> >> On 3/24/21 6:44 AM, Vincent Guittot wrote:
> >>> Hi Tim,
> >>
> >>>
> >>> IIUC your problem, we call update_blocked_averages() but because of:
> >>>
> >>> if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
> >>> update_next_balance(sd, &next_balance);
> >>> break;
> >>> }
> >>>
> >>> the for_each_domain loop stops even before running load_balance on the 1st
> >>> sched domain level which means that update_blocked_averages() was called
> >>> unnecessarily.
> >>>
> >>
> >> That's right
> >>
> >>> And this is even more true with a small sysctl_sched_migration_cost which allows newly
> >>> idle LB for very small this_rq->avg_idle. We could wonder why you set such a low value
> >>> for sysctl_sched_migration_cost which is lower than the max_newidle_lb_cost of the
> >>> smallest domain but that's probably because of task_hot().
> >>>
> >>> if avg_idle is lower than the sd->max_newidle_lb_cost of the 1st sched_domain, we should
> >>> skip spin_unlock/lock and for_each_domain() loop entirely
> >>>
> >>> Maybe something like below:
> >>>
> >>
> >> The patch makes sense. I'll ask our benchmark team to queue this patch for testing.
> >
> > Do you have feedback from your benchmark team ?
> >
>
> Vincent,
>
> Thanks for following up. I just got some data back from the benchmark team.
> The performance didn't change with your patch. And the overall cpu% of update_blocked_averages
> also remain at about the same level. My first thought was perhaps this update
> still didn't catch all the calls to update_blocked_averages
>
> if (this_rq->avg_idle < sysctl_sched_migration_cost ||
> - !READ_ONCE(this_rq->rd->overload)) {
> + !READ_ONCE(this_rq->rd->overload) ||
> + (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>
> To experiment, I added one more check on the next_balance to further limit
> the path to actually do idle load balance with the next_balance time.
>
> if (this_rq->avg_idle < sysctl_sched_migration_cost ||
> - !READ_ONCE(this_rq->rd->overload)) {
> + time_before(jiffies, this_rq->next_balance) ||
> + !READ_ONCE(this_rq->rd->overload) ||
> + (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>
> I was suprised to find the overall cpu% consumption of update_blocked_averages
> and throughput of the benchmark still didn't change much. So I took a
> peek into the profile and found the update_blocked_averages calls shifted to the idle load balancer.
> The call to update_locked_averages was reduced in newidle_balance so the patch did
> what we intended. But the overall rate of calls to

At least , we have removed the useless call to update_blocked_averages
in newidle_balance when we will not perform any newly idle load
balance

> update_blocked_averages remain roughly the same, shifting from
> newidle_balance to run_rebalance_domains.
>
> 100.00% (ffffffff810cf070)
> |
> ---update_blocked_averages
> |
> |--95.47%--run_rebalance_domains
> | __do_softirq
> | |
> | |--94.27%--asm_call_irq_on_stack
> | | do_softirq_own_stack

The call of update_blocked_averages mainly comes from SCHED_SOFTIRQ.
And as a result, not from the new path
do_idle()->nohz_run_idle_balance() which has been added by this patch
to defer the call to update_nohz_stats() after newlyidle_balance and
before entering idle.

> | | |
> | | |--93.74%--irq_exit_rcu
> | | | |
> | | | |--88.20%--sysvec_apic_timer_interrupt
> | | | | asm_sysvec_apic_timer_interrupt
> | | | | |
> ...
> |
> |
> --4.53%--newidle_balance
> pick_next_task_fair
>
> I was expecting idle load balancer to be rate limited to 60 Hz, which

Why 60Hz ?

> should be 15 jiffies apart on the test system with CONFIG_HZ_250.
> When I did a trace on a single CPU, I see that update_blocked_averages
> are often called between 1 to 4 jiffies apart, which is at a much higher
> rate than I expected. I haven't taken a closer look yet. But you may

2 things can trigger a SCHED_SOFTIRQ/run_rebalance_domains:
- the need for an update of blocked load which should not happen more
than once every 32ms which means a rate of around 30Hz
- the need for a load balance of a sched_domain. The min interval for
a sched_domain is its weight when the CPU is idle which is usually few
jiffies

The only idea that I have for now is that we spend less time in
newidle_balance which changes the dynamic of your system.

In your trace, could you check if update_blocked_averages is called
during the tick ? and Is the current task idle task ?

Vincent

> have a better idea. I won't have access to the test system and workload
> till probably next week.
>
> Thanks.
>
> Tim
Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ [ In reply to ]
On 4/8/21 7:51 AM, Vincent Guittot wrote:

>> I was suprised to find the overall cpu% consumption of update_blocked_averages
>> and throughput of the benchmark still didn't change much. So I took a
>> peek into the profile and found the update_blocked_averages calls shifted to the idle load balancer.
>> The call to update_locked_averages was reduced in newidle_balance so the patch did
>> what we intended. But the overall rate of calls to
>
> At least , we have removed the useless call to update_blocked_averages
> in newidle_balance when we will not perform any newly idle load
> balance
>
>> update_blocked_averages remain roughly the same, shifting from
>> newidle_balance to run_rebalance_domains.
>>
>> 100.00% (ffffffff810cf070)
>> |
>> ---update_blocked_averages
>> |
>> |--95.47%--run_rebalance_domains
>> | __do_softirq
>> | |
>> | |--94.27%--asm_call_irq_on_stack
>> | | do_softirq_own_stack
>
> The call of update_blocked_averages mainly comes from SCHED_SOFTIRQ.
> And as a result, not from the new path
> do_idle()->nohz_run_idle_balance() which has been added by this patch
> to defer the call to update_nohz_stats() after newlyidle_balance and
> before entering idle.
>
>> | | |
>> | | |--93.74%--irq_exit_rcu
>> | | | |
>> | | | |--88.20%--sysvec_apic_timer_interrupt
>> | | | | asm_sysvec_apic_timer_interrupt
>> | | | | |
>> ...
>> |
>> |
>> --4.53%--newidle_balance
>> pick_next_task_fair
>>
>> I was expecting idle load balancer to be rate limited to 60 Hz, which
>
> Why 60Hz ?
>

My thinking is we will trigger load balance only after rq->next_balance.

void trigger_load_balance(struct rq *rq)
{
/* Don't need to rebalance while attached to NULL domain */
if (unlikely(on_null_domain(rq)))
return;

if (time_after_eq(jiffies, rq->next_balance))
raise_softirq(SCHED_SOFTIRQ);

nohz_balancer_kick(rq);
}

And it seems like next_balance is set to be 60 Hz

static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
{
int continue_balancing = 1;
int cpu = rq->cpu;
int busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
unsigned long interval;
struct sched_domain *sd;
/* Earliest time when we have to do rebalance again */
unsigned long next_balance = jiffies + 60*HZ;


>> should be 15 jiffies apart on the test system with CONFIG_HZ_250.
>> When I did a trace on a single CPU, I see that update_blocked_averages
>> are often called between 1 to 4 jiffies apart, which is at a much higher
>> rate than I expected. I haven't taken a closer look yet. But you may
>
> 2 things can trigger a SCHED_SOFTIRQ/run_rebalance_domains:
> - the need for an update of blocked load which should not happen more
> than once every 32ms which means a rate of around 30Hz
> - the need for a load balance of a sched_domain. The min interval for
> a sched_domain is its weight when the CPU is idle which is usually few
> jiffies
>
> The only idea that I have for now is that we spend less time in
> newidle_balance which changes the dynamic of your system.
>
> In your trace, could you check if update_blocked_averages is called
> during the tick ? and Is the current task idle task ?

Here's a snapshot of the trace. However I didn't have the current task in my trace.
You can tell the frequency that update_blocked_averages is called on
cpu 2 by the jiffies value. They are quite close together (1 to 3 jiffies apart).
When I have a chance to get on the machine, I'll take another look
at the current task and whether we got to trigger_load_balance() from scheduler_tick().


3.505 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb731
4.505 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb732
6.484 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb733
6.506 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb734
9.503 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb737
11.504 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb739
11.602 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb739
11.624 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
11.642 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
11.645 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
11.977 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
12.003 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
12.015 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
12.043 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
12.567 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73a
13.856 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73b
13.910 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
14.003 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
14.159 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
14.203 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
14.223 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
14.301 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
14.504 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73c
14.637 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73c
14.666 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
15.059 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
15.083 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
15.100 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
15.103 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
15.150 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
15.227 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
15.248 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
15.311 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
15.503 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73d
16.140 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73d
16.185 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73d
16.224 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73d
16.340 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73d
16.384 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73d
16.503 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73e
16.993 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73e
17.504 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73f
17.630 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73f
17.830 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
18.015 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
18.031 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
18.036 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
18.040 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
18.502 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb740

Thanks for taking a look.

Tim
Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ [ In reply to ]
On Fri, 9 Apr 2021 at 01:05, Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
>
>
>
> On 4/8/21 7:51 AM, Vincent Guittot wrote:
>
> >> I was suprised to find the overall cpu% consumption of update_blocked_averages
> >> and throughput of the benchmark still didn't change much. So I took a
> >> peek into the profile and found the update_blocked_averages calls shifted to the idle load balancer.
> >> The call to update_locked_averages was reduced in newidle_balance so the patch did
> >> what we intended. But the overall rate of calls to
> >
> > At least , we have removed the useless call to update_blocked_averages
> > in newidle_balance when we will not perform any newly idle load
> > balance
> >
> >> update_blocked_averages remain roughly the same, shifting from
> >> newidle_balance to run_rebalance_domains.
> >>
> >> 100.00% (ffffffff810cf070)
> >> |
> >> ---update_blocked_averages
> >> |
> >> |--95.47%--run_rebalance_domains
> >> | __do_softirq
> >> | |
> >> | |--94.27%--asm_call_irq_on_stack
> >> | | do_softirq_own_stack
> >
> > The call of update_blocked_averages mainly comes from SCHED_SOFTIRQ.
> > And as a result, not from the new path
> > do_idle()->nohz_run_idle_balance() which has been added by this patch
> > to defer the call to update_nohz_stats() after newlyidle_balance and
> > before entering idle.
> >
> >> | | |
> >> | | |--93.74%--irq_exit_rcu
> >> | | | |
> >> | | | |--88.20%--sysvec_apic_timer_interrupt
> >> | | | | asm_sysvec_apic_timer_interrupt
> >> | | | | |
> >> ...
> >> |
> >> |
> >> --4.53%--newidle_balance
> >> pick_next_task_fair
> >>
> >> I was expecting idle load balancer to be rate limited to 60 Hz, which
> >
> > Why 60Hz ?
> >
>
> My thinking is we will trigger load balance only after rq->next_balance.
>
> void trigger_load_balance(struct rq *rq)
> {
> /* Don't need to rebalance while attached to NULL domain */
> if (unlikely(on_null_domain(rq)))
> return;
>
> if (time_after_eq(jiffies, rq->next_balance))
> raise_softirq(SCHED_SOFTIRQ);
>
> nohz_balancer_kick(rq);
> }
>
> And it seems like next_balance is set to be 60 Hz
>
> static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
> {
> int continue_balancing = 1;
> int cpu = rq->cpu;
> int busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
> unsigned long interval;
> struct sched_domain *sd;
> /* Earliest time when we have to do rebalance again */
> unsigned long next_balance = jiffies + 60*HZ;

This doesn't mean 60 Hz period but 60*HZ with HZ being the number of
jiffies per second. We init next_balance with now + 60 sec to make
sure it's far later than the next balance of the sched_domains

Then, update_next_balance() keeps track of 1st balance to happen next time

>
>
> >> should be 15 jiffies apart on the test system with CONFIG_HZ_250.
> >> When I did a trace on a single CPU, I see that update_blocked_averages
> >> are often called between 1 to 4 jiffies apart, which is at a much higher
> >> rate than I expected. I haven't taken a closer look yet. But you may
> >
> > 2 things can trigger a SCHED_SOFTIRQ/run_rebalance_domains:
> > - the need for an update of blocked load which should not happen more
> > than once every 32ms which means a rate of around 30Hz
> > - the need for a load balance of a sched_domain. The min interval for
> > a sched_domain is its weight when the CPU is idle which is usually few
> > jiffies
> >
> > The only idea that I have for now is that we spend less time in
> > newidle_balance which changes the dynamic of your system.
> >
> > In your trace, could you check if update_blocked_averages is called
> > during the tick ? and Is the current task idle task ?
>
> Here's a snapshot of the trace. However I didn't have the current task in my trace.
> You can tell the frequency that update_blocked_averages is called on
> cpu 2 by the jiffies value. They are quite close together (1 to 3 jiffies apart).
> When I have a chance to get on the machine, I'll take another look
> at the current task and whether we got to trigger_load_balance() from scheduler_tick().
>
>
> 3.505 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb731
> 4.505 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb732
> 6.484 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb733
> 6.506 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb734
> 9.503 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb737
> 11.504 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb739
> 11.602 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb739
> 11.624 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
> 11.642 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
> 11.645 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
> 11.977 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
> 12.003 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
> 12.015 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
> 12.043 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
> 12.567 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73a
> 13.856 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73b
> 13.910 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
> 14.003 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
> 14.159 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
> 14.203 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
> 14.223 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
> 14.301 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
> 14.504 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73c
> 14.637 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73c
> 14.666 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
> 15.059 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
> 15.083 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
> 15.100 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
> 15.103 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
> 15.150 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
> 15.227 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
> 15.248 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
> 15.311 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
> 15.503 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73d
> 16.140 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73d
> 16.185 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73d
> 16.224 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73d
> 16.340 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73d
> 16.384 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73d
> 16.503 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73e
> 16.993 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73e
> 17.504 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73f
> 17.630 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73f
> 17.830 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
> 18.015 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
> 18.031 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
> 18.036 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
> 18.040 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
> 18.502 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb740
>

I don't know exactly what you track with "next_balance=" in
probe:newidle_balance but it always starts with the same value
0x1004fb76c in the future to finish with a value 0x1004fb731 in the
past. This would mean that a load balance is needed during the next
tick which explains why we can see then the
probe:update_blocked_averages for each tick.

Also could you check if the tick is stopped when idle. When the
predicted idle time is short and the next wake is expected to happen
before the next tick, the tick is not stopped.

Vincent

> Thanks for taking a look.
>
> Tim
Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ [ In reply to ]
On 4/9/21 8:26 AM, Vincent Guittot wrote:

>>>>
>>>> I was expecting idle load balancer to be rate limited to 60 Hz, which
>>>
>>> Why 60Hz ?
>>>
>>
>> My thinking is we will trigger load balance only after rq->next_balance.
>>
>> void trigger_load_balance(struct rq *rq)
>> {
>> /* Don't need to rebalance while attached to NULL domain */
>> if (unlikely(on_null_domain(rq)))
>> return;
>>
>> if (time_after_eq(jiffies, rq->next_balance))
>> raise_softirq(SCHED_SOFTIRQ);
>>
>> nohz_balancer_kick(rq);
>> }
>>
>> And it seems like next_balance is set to be 60 Hz
>>
>> static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
>> {
>> int continue_balancing = 1;
>> int cpu = rq->cpu;
>> int busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
>> unsigned long interval;
>> struct sched_domain *sd;
>> /* Earliest time when we have to do rebalance again */
>> unsigned long next_balance = jiffies + 60*HZ;
>
> This doesn't mean 60 Hz period but 60*HZ with HZ being the number of
> jiffies per second. We init next_balance with now + 60 sec to make
> sure it's far later than the next balance of the sched_domains
>
> Then, update_next_balance() keeps track of 1st balance to happen next time
>

Thanks for pointing out my misread of the code. In this case the
balance frequency should be lower than I thought as balance should be 60 sec
apart in theory.

>> Here's a snapshot of the trace. However I didn't have the current task in my trace.
>> You can tell the frequency that update_blocked_averages is called on
>> cpu 2 by the jiffies value. They are quite close together (1 to 3 jiffies apart).
>> When I have a chance to get on the machine, I'll take another look
>> at the current task and whether we got to trigger_load_balance() from scheduler_tick().
>>
>>
>> 3.505 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb731
>> 4.505 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb732
>> 6.484 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb733
>> 6.506 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb734
>> 9.503 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb737
>> 11.504 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb739
>> 11.602 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb739
>> 11.624 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>> 11.642 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>> 11.645 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>> 11.977 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>> 12.003 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>> 12.015 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>> 12.043 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>> 12.567 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73a
>> 13.856 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73b
>> 13.910 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>> 14.003 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>> 14.159 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>> 14.203 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>> 14.223 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>> 14.301 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>> 14.504 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73c
>> 14.637 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73c
>> 14.666 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>> 15.059 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>> 15.083 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>> 15.100 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>> 15.103 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>> 15.150 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>> 15.227 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>> 15.248 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>> 15.311 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>> 15.503 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73d
>> 16.140 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73d
>> 16.185 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73d
>> 16.224 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73d
>> 16.340 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73d
>> 16.384 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73d
>> 16.503 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73e
>> 16.993 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73e
>> 17.504 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73f
>> 17.630 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73f
>> 17.830 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
>> 18.015 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
>> 18.031 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
>> 18.036 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
>> 18.040 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73f
>> 18.502 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb740
>>
>
> I don't know exactly what you track with "next_balance=" in

It is the rq->next_balance value as we enter the newidle_balance function.

> probe:newidle_balance but it always starts with the same value
> 0x1004fb76c in the future to finish with a value 0x1004fb731 in the
> past.

This indeed is odd as the next_balance should move forward and not backward.

> This would mean that a load balance is needed during the next
> tick which explains why we can see then the
> probe:update_blocked_averages for each tick.

Will try to debug and find out why the next_balance has gone backwards
next time I get access to the test system.

>
> Also could you check if the tick is stopped when idle. When the
> predicted idle time is short and the next wake is expected to happen
> before the next tick, the tick is not stopped.
>

Will do.

Tim
Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ [ In reply to ]
On 4/9/21 10:59 AM, Tim Chen wrote:

>>> 11.602 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb739
>>> 11.624 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>>> 11.642 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>>> 11.645 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>>> 11.977 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>>> 12.003 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>>> 12.015 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>>> 12.043 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
>>> 12.567 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73a
>>> 13.856 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73b
>>> 13.910 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>>> 14.003 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>>> 14.159 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>>> 14.203 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>>> 14.223 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>>> 14.301 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
>>> 14.504 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73c

>> I don't know exactly what you track with "next_balance=" in
>
> It is the rq->next_balance value as we enter the newidle_balance function.
>
>> probe:newidle_balance but it always starts with the same value
>> 0x1004fb76c in the future to finish with a value 0x1004fb731 in the
>> past.
>
> This indeed is odd as the next_balance should move forward and not backward.


Vincent,

I found a bug in newidle_balance() that moved the next balance time
backward. I fixed it in patch 1 below. This corrects the
next_balance time update and we now have proper load balance rate limiting.

After putting in the other two changes previously discussed with you (patch 2 and 3 below),
I see very good improvement (about +5%) in the database workload I was investigating.
The costly update_blocked_averages() function is called much less frequently and consumed
only 0.2% of cpu cycles instead of 2.6% before the changes.

Including all three patches here together for easier review. The patches
apply to the tip tree's sched/core branch.

Thanks.

Tim

--->8---

From 848eb8f45b53b45cacf70022c98f632daabefe77 Mon Sep 17 00:00:00 2001
Message-Id: <848eb8f45b53b45cacf70022c98f632daabefe77.1620677280.git.tim.c.chen@linux.intel.com>
From: Tim Chen <tim.c.chen@linux.intel.com>
Date: Fri, 7 May 2021 14:19:47 -0700
Subject: [PATCH 1/3] sched: Fix rq->next_balance time going backward

In traces on newidle_balance(), this_rq->next_balance
time goes backward from time to time, e.g.

11.602 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb739
11.624 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
13.856 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73b
13.910 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
14.637 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73c
14.666 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c

This was due to newidle_balance() updated this_rq->next_balance
to an earlier time than its current value. The real intention
was to make sure next_balance move this_rq->next_balance forward
in its update:

out:
/* Move the next balance forward */
if (time_after(this_rq->next_balance, next_balance))
this_rq->next_balance = next_balance;

The actual outcome was moving this_rq->next_balance backward,
in the wrong direction.

Fix the incorrect check on next_balance causing the problem.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d75af1ecfb4..b0b5698b2184 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10681,7 +10681,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)

out:
/* Move the next balance forward */
- if (time_after(this_rq->next_balance, next_balance))
+ if (time_after(next_balance, this_rq->next_balance))
this_rq->next_balance = next_balance;

if (pulled_task)
--
2.20.1


From f2c9af4af6438ad79076e1a664003dc01ad4fdf0 Mon Sep 17 00:00:00 2001
Message-Id: <f2c9af4af6438ad79076e1a664003dc01ad4fdf0.1620677280.git.tim.c.chen@linux.intel.com>
In-Reply-To: <848eb8f45b53b45cacf70022c98f632daabefe77.1620677280.git.tim.c.chen@linux.intel.com>
References: <848eb8f45b53b45cacf70022c98f632daabefe77.1620677280.git.tim.c.chen@linux.intel.com>
From: Vincent Guittot <vincent.guittot@linaro.org>
Date: Fri, 7 May 2021 14:38:10 -0700
Subject: [PATCH 2/3] sched: Skip update_blocked_averages if we are defering
load balance

In newidle_balance(), the scheduler skips load balance to the new idle cpu when sd is this_rq and when

this_rq->avg_idle < sd->max_newidle_lb_cost

Doing a costly call to update_blocked_averages() will
not be useful and simply adds overhead when this condition is true.

Check the condition early in newidle_balance() to skip update_blocked_averages()
when possible.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
kernel/sched/fair.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b0b5698b2184..f828b75488a0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10612,17 +10612,20 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
*/
rq_unpin_lock(this_rq, rf);

+ rcu_read_lock();
+ sd = rcu_dereference_check_sched_domain(this_rq->sd);
+
if (this_rq->avg_idle < sysctl_sched_migration_cost ||
- !READ_ONCE(this_rq->rd->overload)) {
+ !READ_ONCE(this_rq->rd->overload) ||
+ (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {

- rcu_read_lock();
- sd = rcu_dereference_check_sched_domain(this_rq->sd);
if (sd)
update_next_balance(sd, &next_balance);
rcu_read_unlock();

goto out;
}
+ rcu_read_unlock();

raw_spin_unlock(&this_rq->lock);

--
2.20.1


From c45d13c6156c3cdc340ef3ba523b8750642a9c50 Mon Sep 17 00:00:00 2001
Message-Id: <c45d13c6156c3cdc340ef3ba523b8750642a9c50.1620677280.git.tim.c.chen@linux.intel.com>
In-Reply-To: <848eb8f45b53b45cacf70022c98f632daabefe77.1620677280.git.tim.c.chen@linux.intel.com>
References: <848eb8f45b53b45cacf70022c98f632daabefe77.1620677280.git.tim.c.chen@linux.intel.com>
From: Tim Chen <tim.c.chen@linux.intel.com>
Date: Fri, 7 May 2021 14:54:54 -0700
Subject: [PATCH 3/3] sched: Rate limit load balance in newidle_balance()

Currently newidle_balance() could do load balancng even if the cpu is not
due for a load balance. Make newidle_balance() check the next_balance
time on the cpu's runqueue so it defers load balancing if it is not
due for its load balance.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
kernel/sched/fair.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f828b75488a0..8e00e1fdd6e0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10617,6 +10617,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)

if (this_rq->avg_idle < sysctl_sched_migration_cost ||
!READ_ONCE(this_rq->rd->overload) ||
+ time_before(jiffies, this_rq->next_balance) ||
(sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {

if (sd)
--
2.20.1
Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ [ In reply to ]
Hi Tim,

On Mon, 10 May 2021 at 23:59, Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
>
>
> On 4/9/21 10:59 AM, Tim Chen wrote:
>
> >>> 11.602 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb739
> >>> 11.624 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
> >>> 11.642 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
> >>> 11.645 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
> >>> 11.977 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
> >>> 12.003 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
> >>> 12.015 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
> >>> 12.043 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
> >>> 12.567 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73a
> >>> 13.856 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73b
> >>> 13.910 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
> >>> 14.003 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
> >>> 14.159 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
> >>> 14.203 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
> >>> 14.223 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
> >>> 14.301 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
> >>> 14.504 ( ): probe:update_blocked_averages:(ffffffff810cf070) cpu=2 jiffies=0x1004fb73c
>
> >> I don't know exactly what you track with "next_balance=" in
> >
> > It is the rq->next_balance value as we enter the newidle_balance function.
> >
> >> probe:newidle_balance but it always starts with the same value
> >> 0x1004fb76c in the future to finish with a value 0x1004fb731 in the
> >> past.
> >
> > This indeed is odd as the next_balance should move forward and not backward.
>
>
> Vincent,
>
> I found a bug in newidle_balance() that moved the next balance time
> backward. I fixed it in patch 1 below. This corrects the
> next_balance time update and we now have proper load balance rate limiting.
>
> After putting in the other two changes previously discussed with you (patch 2 and 3 below),
> I see very good improvement (about +5%) in the database workload I was investigating.
> The costly update_blocked_averages() function is called much less frequently and consumed
> only 0.2% of cpu cycles instead of 2.6% before the changes.

That's a good news

>
> Including all three patches here together for easier review. The patches
> apply to the tip tree's sched/core branch.
>
> Thanks.
>
> Tim
>
> --->8---
>
> From 848eb8f45b53b45cacf70022c98f632daabefe77 Mon Sep 17 00:00:00 2001
> Message-Id: <848eb8f45b53b45cacf70022c98f632daabefe77.1620677280.git.tim.c.chen@linux.intel.com>
> From: Tim Chen <tim.c.chen@linux.intel.com>
> Date: Fri, 7 May 2021 14:19:47 -0700
> Subject: [PATCH 1/3] sched: Fix rq->next_balance time going backward
>
> In traces on newidle_balance(), this_rq->next_balance
> time goes backward from time to time, e.g.
>
> 11.602 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb739
> 11.624 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
> 13.856 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73b
> 13.910 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
> 14.637 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73c
> 14.666 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>
> This was due to newidle_balance() updated this_rq->next_balance
> to an earlier time than its current value. The real intention
> was to make sure next_balance move this_rq->next_balance forward
> in its update:

Sometimes, we want to set this_rq->next_balance backward compared to
its current value. When a CPU is busy, the balance interval is
multiplied by busy_factor which is set to 16 by default. On SMT2 sched
domain level, it means that the interval will be 32ms when busy
instead of 2ms. But if a busy load balance happens just before
becoming idle, the this_rq->next_balance will be set 32ms later
whereas it should go down to 2ms as the CPU is now idle. And this
becomes even worse when you have 128 CPUs at die sched_domain level
because the idle CPU will have to wait 2048ms instead of the correct
128ms interval.

>
> out:
> /* Move the next balance forward */
> if (time_after(this_rq->next_balance, next_balance))
> this_rq->next_balance = next_balance;
>
> The actual outcome was moving this_rq->next_balance backward,
> in the wrong direction.

As explained above, this is intentional.
You are facing a situation where the load balance of sched_domain
level has not run for a while and last_balance is quite old and
generates a next_balance still in the past.

typically:

CPU is busy
CPU runs busy LB at time T so last_balance = T
And next LB will not happen before T+16*Interval
At time T+15*Interval, CPU enters idle
newidle_balance updates next_balance to last_balance+Interval which is
in the past

>
> Fix the incorrect check on next_balance causing the problem.
>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
> kernel/sched/fair.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1d75af1ecfb4..b0b5698b2184 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10681,7 +10681,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>
> out:
> /* Move the next balance forward */
> - if (time_after(this_rq->next_balance, next_balance))
> + if (time_after(next_balance, this_rq->next_balance))

The current comparison is correct but next_balance should not be in the past.

update_next_balance() is only used in newidle_balance() so we could
modify it to have:

next = max(jiffies+1, next = sd->last_balance + interval)


> this_rq->next_balance = next_balance;
>
> if (pulled_task)
> --
> 2.20.1
>
>
> From f2c9af4af6438ad79076e1a664003dc01ad4fdf0 Mon Sep 17 00:00:00 2001
> Message-Id: <f2c9af4af6438ad79076e1a664003dc01ad4fdf0.1620677280.git.tim.c.chen@linux.intel.com>
> In-Reply-To: <848eb8f45b53b45cacf70022c98f632daabefe77.1620677280.git.tim.c.chen@linux.intel.com>
> References: <848eb8f45b53b45cacf70022c98f632daabefe77.1620677280.git.tim.c.chen@linux.intel.com>
> From: Vincent Guittot <vincent.guittot@linaro.org>
> Date: Fri, 7 May 2021 14:38:10 -0700
> Subject: [PATCH 2/3] sched: Skip update_blocked_averages if we are defering
> load balance
>
> In newidle_balance(), the scheduler skips load balance to the new idle cpu when sd is this_rq and when
>
> this_rq->avg_idle < sd->max_newidle_lb_cost
>
> Doing a costly call to update_blocked_averages() will
> not be useful and simply adds overhead when this condition is true.
>
> Check the condition early in newidle_balance() to skip update_blocked_averages()
> when possible.
>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
> kernel/sched/fair.c | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b0b5698b2184..f828b75488a0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10612,17 +10612,20 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> */
> rq_unpin_lock(this_rq, rf);
>
> + rcu_read_lock();
> + sd = rcu_dereference_check_sched_domain(this_rq->sd);
> +
> if (this_rq->avg_idle < sysctl_sched_migration_cost ||
> - !READ_ONCE(this_rq->rd->overload)) {
> + !READ_ONCE(this_rq->rd->overload) ||
> + (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>
> - rcu_read_lock();
> - sd = rcu_dereference_check_sched_domain(this_rq->sd);
> if (sd)
> update_next_balance(sd, &next_balance);
> rcu_read_unlock();
>
> goto out;
> }
> + rcu_read_unlock();
>
> raw_spin_unlock(&this_rq->lock);
>
> --
> 2.20.1
>
>
> From c45d13c6156c3cdc340ef3ba523b8750642a9c50 Mon Sep 17 00:00:00 2001
> Message-Id: <c45d13c6156c3cdc340ef3ba523b8750642a9c50.1620677280.git.tim.c.chen@linux.intel.com>
> In-Reply-To: <848eb8f45b53b45cacf70022c98f632daabefe77.1620677280.git.tim.c.chen@linux.intel.com>
> References: <848eb8f45b53b45cacf70022c98f632daabefe77.1620677280.git.tim.c.chen@linux.intel.com>
> From: Tim Chen <tim.c.chen@linux.intel.com>
> Date: Fri, 7 May 2021 14:54:54 -0700
> Subject: [PATCH 3/3] sched: Rate limit load balance in newidle_balance()
>
> Currently newidle_balance() could do load balancng even if the cpu is not
> due for a load balance. Make newidle_balance() check the next_balance
> time on the cpu's runqueue so it defers load balancing if it is not
> due for its load balance.
>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
> kernel/sched/fair.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f828b75488a0..8e00e1fdd6e0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10617,6 +10617,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>
> if (this_rq->avg_idle < sysctl_sched_migration_cost ||
> !READ_ONCE(this_rq->rd->overload) ||
> + time_before(jiffies, this_rq->next_balance) ||
> (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>
> if (sd)
> --
> 2.20.1
>
Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ [ In reply to ]
On 5/11/21 8:25 AM, Vincent Guittot wrote:
> Hi Tim,
>
> Sometimes, we want to set this_rq->next_balance backward compared to
> its current value. When a CPU is busy, the balance interval is
> multiplied by busy_factor which is set to 16 by default. On SMT2 sched
> domain level, it means that the interval will be 32ms when busy
> instead of 2ms. But if a busy load balance happens just before
> becoming idle, the this_rq->next_balance will be set 32ms later
> whereas it should go down to 2ms as the CPU is now idle. And this
> becomes even worse when you have 128 CPUs at die sched_domain level
> because the idle CPU will have to wait 2048ms instead of the correct
> 128ms interval.
>
>>
>> out:
>> /* Move the next balance forward */
>> - if (time_after(this_rq->next_balance, next_balance))
>> + if (time_after(next_balance, this_rq->next_balance))
>
> The current comparison is correct but next_balance should not be in the past.

I understand then the intention is after the update,
this_rq->next_balance should have a minimum value of jiffies+1. So
we will need

out:
/* Move the next balance forward */
+ this_rq->next_balance = max(jiffies+1, this_rq->next_balance);
if (time_after(this_rq->next_balance, next_balance))
this_rq->next_balance = next_balance;

as next_balance computed will be at least jiffies+1 after your fix to
update_next_balance(). We still need to take care of the case when
this_rq->next_balance <= jiffies.

So combining with your suggestion on the fix to update_next_balance(),
the fix will be

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d75af1ecfb4..2dc471c1511c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9901,7 +9901,7 @@ update_next_balance(struct sched_domain *sd, unsigned long *next_balance)

/* used by idle balance, so cpu_busy = 0 */
interval = get_sd_balance_interval(sd, 0);
- next = sd->last_balance + interval;
+ next = max(jiffies+1, sd->last_balance + interval);

if (time_after(*next_balance, next))
*next_balance = next;
@@ -10681,6 +10681,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)

out:
/* Move the next balance forward */
+ this_rq->next_balance = max(jiffies+1, this_rq->next_balance);
if (time_after(this_rq->next_balance, next_balance))
this_rq->next_balance = next_balance;


>
> update_next_balance() is only used in newidle_balance() so we could
> modify it to have:
>
> next = max(jiffies+1, next = sd->last_balance + interval)

Is the extra assignment "next = sd->last_balance + interval" needed?
This seems more straight forward:

next = max(jiffies+1, sd->last_balance + interval)

I will try to get the benchmark folks to do another run with this change.
Hopefully I'll get some bandwidth from them soon.

Thanks.

Tim
Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ [ In reply to ]
On Tue, 11 May 2021 at 19:25, Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
>
>
> On 5/11/21 8:25 AM, Vincent Guittot wrote:
> > Hi Tim,
> >
> > Sometimes, we want to set this_rq->next_balance backward compared to
> > its current value. When a CPU is busy, the balance interval is
> > multiplied by busy_factor which is set to 16 by default. On SMT2 sched
> > domain level, it means that the interval will be 32ms when busy
> > instead of 2ms. But if a busy load balance happens just before
> > becoming idle, the this_rq->next_balance will be set 32ms later
> > whereas it should go down to 2ms as the CPU is now idle. And this
> > becomes even worse when you have 128 CPUs at die sched_domain level
> > because the idle CPU will have to wait 2048ms instead of the correct
> > 128ms interval.
> >
> >>
> >> out:
> >> /* Move the next balance forward */
> >> - if (time_after(this_rq->next_balance, next_balance))
> >> + if (time_after(next_balance, this_rq->next_balance))
> >
> > The current comparison is correct but next_balance should not be in the past.
>
> I understand then the intention is after the update,
> this_rq->next_balance should have a minimum value of jiffies+1. So
> we will need
>
> out:
> /* Move the next balance forward */
> + this_rq->next_balance = max(jiffies+1, this_rq->next_balance);
> if (time_after(this_rq->next_balance, next_balance))
> this_rq->next_balance = next_balance;
>
> as next_balance computed will be at least jiffies+1 after your fix to
> update_next_balance(). We still need to take care of the case when
> this_rq->next_balance <= jiffies.
>
> So combining with your suggestion on the fix to update_next_balance(),
> the fix will be
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1d75af1ecfb4..2dc471c1511c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9901,7 +9901,7 @@ update_next_balance(struct sched_domain *sd, unsigned long *next_balance)
>
> /* used by idle balance, so cpu_busy = 0 */
> interval = get_sd_balance_interval(sd, 0);
> - next = sd->last_balance + interval;
> + next = max(jiffies+1, sd->last_balance + interval);
>
> if (time_after(*next_balance, next))
> *next_balance = next;
> @@ -10681,6 +10681,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>
> out:
> /* Move the next balance forward */
> + this_rq->next_balance = max(jiffies+1, this_rq->next_balance);
> if (time_after(this_rq->next_balance, next_balance))
> this_rq->next_balance = next_balance;
>
>
> >
> > update_next_balance() is only used in newidle_balance() so we could
> > modify it to have:
> >
> > next = max(jiffies+1, next = sd->last_balance + interval)
>
> Is the extra assignment "next = sd->last_balance + interval" needed?

No it's a typo mistake while copy pasting the line

> This seems more straight forward:
>
> next = max(jiffies+1, sd->last_balance + interval)
>
> I will try to get the benchmark folks to do another run with this change.
> Hopefully I'll get some bandwidth from them soon.
>
> Thanks.
>
> Tim
>
Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ [ In reply to ]
On 05/11/21 10:25, Tim Chen wrote:
> > update_next_balance() is only used in newidle_balance() so we could
> > modify it to have:
> >
> > next = max(jiffies+1, next = sd->last_balance + interval)
>
> Is the extra assignment "next = sd->last_balance + interval" needed?
> This seems more straight forward:
>
> next = max(jiffies+1, sd->last_balance + interval)

I haven't been following the whole conversation closely, but it's always
interesting when manipulating time in non time_*() functions.

Is this max() safe against wrapping?

Thanks

--
Qais Yousef
Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ [ In reply to ]
On 5/12/21 6:59 AM, Qais Yousef wrote:
> On 05/11/21 10:25, Tim Chen wrote:
>>> update_next_balance() is only used in newidle_balance() so we could
>>> modify it to have:
>>>
>>> next = max(jiffies+1, next = sd->last_balance + interval)
>>
>> Is the extra assignment "next = sd->last_balance + interval" needed?
>> This seems more straight forward:
>>
>> next = max(jiffies+1, sd->last_balance + interval)
>
> I haven't been following the whole conversation closely, but it's always
> interesting when manipulating time in non time_*() functions.
>
> Is this max() safe against wrapping?

Looking at the definition, seems like max doesn't take care of wrapping.
#define max(a, b) \
({ \
typeof(a) __a = (a); \
typeof(b) __b = (b); \
MINMAX_ASSERT_COMPATIBLE(typeof(__a), typeof(__b)); \
__a > __b ? __a : __b; \
})


Probably need to do
next = time_after(jiffies+1, sd->last_balance + interval) ? jiffies+1 : sd->last_balance + interval;

Tim
Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ [ In reply to ]
On 05/13/21 11:45, Tim Chen wrote:
>
>
> On 5/12/21 6:59 AM, Qais Yousef wrote:
> > On 05/11/21 10:25, Tim Chen wrote:
> >>> update_next_balance() is only used in newidle_balance() so we could
> >>> modify it to have:
> >>>
> >>> next = max(jiffies+1, next = sd->last_balance + interval)
> >>
> >> Is the extra assignment "next = sd->last_balance + interval" needed?
> >> This seems more straight forward:
> >>
> >> next = max(jiffies+1, sd->last_balance + interval)
> >
> > I haven't been following the whole conversation closely, but it's always
> > interesting when manipulating time in non time_*() functions.
> >
> > Is this max() safe against wrapping?
>
> Looking at the definition, seems like max doesn't take care of wrapping.
> #define max(a, b) \
> ({ \
> typeof(a) __a = (a); \
> typeof(b) __b = (b); \
> MINMAX_ASSERT_COMPATIBLE(typeof(__a), typeof(__b)); \
> __a > __b ? __a : __b; \
> })
>
>
> Probably need to do
> next = time_after(jiffies+1, sd->last_balance + interval) ? jiffies+1 : sd->last_balance + interval;

Yep, that's what I thought it should look like. There's a small chance jiffies
would have changed between the 2 reads though. I can't see how this would cause
a problem, so we should be fine.

Would it be more useful (and readable) to have time_min()/time_max() wrappers?
This type of usage is rare but it'll help to have a common way to handle this
scenario.

Naming might get controversial though :-); time_earliest()/time_latest() could
be another option.

The wrapper is nice to have for me, so feel free to ignore the suggestion.

Thanks

--
Qais Yousef
Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ [ In reply to ]
On 5/12/21 6:59 AM, Qais Yousef wrote:
> On 05/11/21 10:25, Tim Chen wrote:
>>> update_next_balance() is only used in newidle_balance() so we could
>>> modify it to have:
>>>
>>> next = max(jiffies+1, next = sd->last_balance + interval)
>>
>> Is the extra assignment "next = sd->last_balance + interval" needed?
>> This seems more straight forward:
>>
>> next = max(jiffies+1, sd->last_balance + interval)
>
> I haven't been following the whole conversation closely, but it's always
> interesting when manipulating time in non time_*() functions.
>
> Is this max() safe against wrapping?
>

Vincent,

Sorry I haven't got back sooner. I finally was able to get some test
time on the test system. The fix works to correct the next balance time
going backwards but the frequency of balancing still remains the same,
so we don't see performance improvement.

I incorporated Qais' suggestion to take care of the wrap around time
(see patch #1) in patches below. This patch by itself prevented
the next_balance from going backward. However, most of the time the
next_balance occurs immediately in the next jiffie after newidle_balance
occured and we still have the same amount of load balancing as the vanilla
kernel on the OLTP workload I was looking at. I didn't see performance
improvement with just patch#1 and patch#2.

The current logic is when a CPU becomes idle, next_balance occur very
shortly (usually in the next jiffie) as get_sd_balance_interval returns
the next_balance in the next jiffie if the CPU is idle. However, in
reality, I saw most CPUs are 95% busy on average for my workload and
a task will wake up on an idle CPU shortly. So having frequent idle
balancing towards shortly idle CPUs is counter productive and simply
increase overhead and does not improve performance.

I tried a patch (patch 3) in addition to the other patches. It improved
performance by 5%, which is quite significant for my OLTP workload.
The patch considers a CPU busy when average its utilization is more
than 80% when determining the next_balance interval. This tweak may
not be ideal for the case when CPU becomes idle after a CPU intensive
task dominates a CPU for a long time and will block for a while.

Hopefully we can find a way to make good judgement on whether we have
a mostly busy CPU that becomes idle, and a task likely to wake up on
it soon. For such case, we should push out the next balance time. Such
logic is absent today in the idle load balance path. And such frequent
load balancing hurt performance when cgroup is turned on. Computing
update_blocked_averages before load balance becomes expensive. For my
OLTP workload, we lose 9% of performance when cgroup is turned on.

Tim


----

From 2a5ebdeabbfdf4584532ef0e27d37ed75ca7dbd3 Mon Sep 17 00:00:00 2001
Message-Id: <2a5ebdeabbfdf4584532ef0e27d37ed75ca7dbd3.1623433293.git.tim.c.chen@linux.intel.com>
From: Tim Chen <tim.c.chen@linux.intel.com>
Date: Tue, 11 May 2021 09:55:41 -0700
Subject: [PATCH 1/3] sched: sched: Fix rq->next_balance time updated to
earlier than current time

In traces on newidle_balance(), this_rq->next_balance
time goes backward and earlier than current time jiffies, e.g.

11.602 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb739
11.624 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
13.856 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73b
13.910 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
14.637 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73c
14.666 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c

It doesn't make sense to have a next_balance in the past.
Fix newidle_balance() and update_next_balance() so the next
balance time is at least jiffies+1.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
kernel/sched/fair.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d75af1ecfb4..740a0572cbf1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9901,7 +9901,10 @@ update_next_balance(struct sched_domain *sd, unsigned long *next_balance)

/* used by idle balance, so cpu_busy = 0 */
interval = get_sd_balance_interval(sd, 0);
- next = sd->last_balance + interval;
+ if (time_after(jiffies+1, sd->last_balance + interval))
+ next = jiffies+1;
+ else
+ next = sd->last_balance + interval;

if (time_after(*next_balance, next))
*next_balance = next;
@@ -10681,6 +10684,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)

out:
/* Move the next balance forward */
+ if (time_after(jiffies+1, this_rq->next_balance))
+ this_rq->next_balance = jiffies+1;
if (time_after(this_rq->next_balance, next_balance))
this_rq->next_balance = next_balance;

--
2.20.1


From 59de98515bda38b8d6faec5f8c68e1c9ec18962e Mon Sep 17 00:00:00 2001
Message-Id: <59de98515bda38b8d6faec5f8c68e1c9ec18962e.1623433293.git.tim.c.chen@linux.intel.com>
In-Reply-To: <2a5ebdeabbfdf4584532ef0e27d37ed75ca7dbd3.1623433293.git.tim.c.chen@linux.intel.com>
References: <2a5ebdeabbfdf4584532ef0e27d37ed75ca7dbd3.1623433293.git.tim.c.chen@linux.intel.com>
From: Vincent Guittot <vincent.guittot@linaro.org>
Date: Fri, 7 May 2021 14:38:10 -0700
Subject: [PATCH 2/3] sched: Skip update_blocked_averages if we are defering
load balance

In newidle_balance(), the scheduler skips load balance to the new idle cpu when sd is this_rq and when

this_rq->avg_idle < sd->max_newidle_lb_cost

Doing a costly call to update_blocked_averages() will
not be useful and simply adds overhead when this condition is true.

Check the condition early in newidle_balance() to skip update_blocked_averages()
when possible.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
kernel/sched/fair.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 740a0572cbf1..a69bfc651e55 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10615,17 +10615,20 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
*/
rq_unpin_lock(this_rq, rf);

+ rcu_read_lock();
+ sd = rcu_dereference_check_sched_domain(this_rq->sd);
+
if (this_rq->avg_idle < sysctl_sched_migration_cost ||
- !READ_ONCE(this_rq->rd->overload)) {
+ !READ_ONCE(this_rq->rd->overload) ||
+ (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {

- rcu_read_lock();
- sd = rcu_dereference_check_sched_domain(this_rq->sd);
if (sd)
update_next_balance(sd, &next_balance);
rcu_read_unlock();

goto out;
}
+ rcu_read_unlock();

raw_spin_unlock(&this_rq->lock);

--
2.20.1


From 4622055d989a5feb446a7893a48fcd31305ec7a7 Mon Sep 17 00:00:00 2001
Message-Id: <4622055d989a5feb446a7893a48fcd31305ec7a7.1623433293.git.tim.c.chen@linux.intel.com>
In-Reply-To: <2a5ebdeabbfdf4584532ef0e27d37ed75ca7dbd3.1623433293.git.tim.c.chen@linux.intel.com>
References: <2a5ebdeabbfdf4584532ef0e27d37ed75ca7dbd3.1623433293.git.tim.c.chen@linux.intel.com>
From: Tim Chen <tim.c.chen@linux.intel.com>
Date: Mon, 24 May 2021 13:21:03 -0700
Subject: [PATCH 3/3] sched: Don't shorten the load balance interval of a 80%
or more busy CPU

For a CPU that's busy 80% or more on average, it is quite likely that a task
will wake up on it very soon. It is better to not shorten the load
balance interval as if it is completely idle to save on the load
balancing overhead.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
kernel/sched/fair.c | 20 ++++++++++++++------
1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a69bfc651e55..7353395d8a3a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9895,12 +9895,11 @@ get_sd_balance_interval(struct sched_domain *sd, int cpu_busy)
}

static inline void
-update_next_balance(struct sched_domain *sd, unsigned long *next_balance)
+update_next_balance(struct sched_domain *sd, unsigned long *next_balance, int cpu_busy)
{
unsigned long interval, next;

- /* used by idle balance, so cpu_busy = 0 */
- interval = get_sd_balance_interval(sd, 0);
+ interval = get_sd_balance_interval(sd, cpu_busy);
if (time_after(jiffies+1, sd->last_balance + interval))
next = jiffies+1;
else
@@ -10593,6 +10592,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
struct sched_domain *sd;
int pulled_task = 0;
u64 curr_cost = 0;
+ int cpu_busy = 0;

update_misfit_status(NULL, this_rq);
/*
@@ -10618,12 +10618,20 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
rcu_read_lock();
sd = rcu_dereference_check_sched_domain(this_rq->sd);

+ /*
+ * Consider the cpu busy if it has more than 80% average utilization.
+ * Idle balance such cpu not as frequently as a task may wake up soon.
+ */
+ if ((cpu_util(this_cpu) * 10 > capacity_orig_of(this_cpu) * 8))
+ cpu_busy = 1;
+
if (this_rq->avg_idle < sysctl_sched_migration_cost ||
!READ_ONCE(this_rq->rd->overload) ||
(sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {

if (sd)
- update_next_balance(sd, &next_balance);
+ update_next_balance(sd, &next_balance, cpu_busy);
+
rcu_read_unlock();

goto out;
@@ -10639,7 +10647,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
u64 t0, domain_cost;

if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
- update_next_balance(sd, &next_balance);
+ update_next_balance(sd, &next_balance, cpu_busy);
break;
}

@@ -10657,7 +10665,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
curr_cost += domain_cost;
}

- update_next_balance(sd, &next_balance);
+ update_next_balance(sd, &next_balance, cpu_busy);

/*
* Stop searching for tasks to pull if there are
--
2.20.1
Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ [ In reply to ]
On Fri, 11 Jun 2021 at 22:00, Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
>
>
> On 5/12/21 6:59 AM, Qais Yousef wrote:
> > On 05/11/21 10:25, Tim Chen wrote:
> >>> update_next_balance() is only used in newidle_balance() so we could
> >>> modify it to have:
> >>>
> >>> next = max(jiffies+1, next = sd->last_balance + interval)
> >>
> >> Is the extra assignment "next = sd->last_balance + interval" needed?
> >> This seems more straight forward:
> >>
> >> next = max(jiffies+1, sd->last_balance + interval)
> >
> > I haven't been following the whole conversation closely, but it's always
> > interesting when manipulating time in non time_*() functions.
> >
> > Is this max() safe against wrapping?
> >
>
> Vincent,
>
> Sorry I haven't got back sooner. I finally was able to get some test
> time on the test system. The fix works to correct the next balance time
> going backwards but the frequency of balancing still remains the same,
> so we don't see performance improvement.
>
> I incorporated Qais' suggestion to take care of the wrap around time
> (see patch #1) in patches below. This patch by itself prevented
> the next_balance from going backward. However, most of the time the
> next_balance occurs immediately in the next jiffie after newidle_balance

Which is not really surprising as we don't want to keep a CPU idle if
another one is overloaded.

> occured and we still have the same amount of load balancing as the vanilla
> kernel on the OLTP workload I was looking at. I didn't see performance
> improvement with just patch#1 and patch#2.
>
> The current logic is when a CPU becomes idle, next_balance occur very
> shortly (usually in the next jiffie) as get_sd_balance_interval returns
> the next_balance in the next jiffie if the CPU is idle. However, in
> reality, I saw most CPUs are 95% busy on average for my workload and
> a task will wake up on an idle CPU shortly. So having frequent idle
> balancing towards shortly idle CPUs is counter productive and simply
> increase overhead and does not improve performance.

Just to make sure that I understand your problem correctly: Your problem is:
- that we have an ilb happening on the idle CPU and consume cycle
- or that the ilb will pull a task on an idle CPU on which a task will
shortly wakeup which ends to 2 tasks competing for the same CPU.

>
> I tried a patch (patch 3) in addition to the other patches. It improved
> performance by 5%, which is quite significant for my OLTP workload.
> The patch considers a CPU busy when average its utilization is more
> than 80% when determining the next_balance interval. This tweak may
> not be ideal for the case when CPU becomes idle after a CPU intensive
> task dominates a CPU for a long time and will block for a while.
>
> Hopefully we can find a way to make good judgement on whether we have
> a mostly busy CPU that becomes idle, and a task likely to wake up on
> it soon. For such case, we should push out the next balance time. Such
> logic is absent today in the idle load balance path. And such frequent
> load balancing hurt performance when cgroup is turned on. Computing
> update_blocked_averages before load balance becomes expensive. For my
> OLTP workload, we lose 9% of performance when cgroup is turned on.
>
> Tim
>
>
> ----
>
> From 2a5ebdeabbfdf4584532ef0e27d37ed75ca7dbd3 Mon Sep 17 00:00:00 2001
> Message-Id: <2a5ebdeabbfdf4584532ef0e27d37ed75ca7dbd3.1623433293.git.tim.c.chen@linux.intel.com>
> From: Tim Chen <tim.c.chen@linux.intel.com>
> Date: Tue, 11 May 2021 09:55:41 -0700
> Subject: [PATCH 1/3] sched: sched: Fix rq->next_balance time updated to
> earlier than current time
>
> In traces on newidle_balance(), this_rq->next_balance
> time goes backward and earlier than current time jiffies, e.g.
>
> 11.602 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb739
> 11.624 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb739
> 13.856 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73b
> 13.910 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73b
> 14.637 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb76c jiffies=0x1004fb73c
> 14.666 ( ): probe:newidle_balance:(ffffffff810d2470) this_rq=0xffff88fe7f8aae00 next_balance=0x1004fb731 jiffies=0x1004fb73c
>
> It doesn't make sense to have a next_balance in the past.
> Fix newidle_balance() and update_next_balance() so the next
> balance time is at least jiffies+1.
>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
> kernel/sched/fair.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1d75af1ecfb4..740a0572cbf1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9901,7 +9901,10 @@ update_next_balance(struct sched_domain *sd, unsigned long *next_balance)
>
> /* used by idle balance, so cpu_busy = 0 */
> interval = get_sd_balance_interval(sd, 0);
> - next = sd->last_balance + interval;
> + if (time_after(jiffies+1, sd->last_balance + interval))
> + next = jiffies+1;
> + else
> + next = sd->last_balance + interval;
>
> if (time_after(*next_balance, next))
> *next_balance = next;
> @@ -10681,6 +10684,8 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>
> out:
> /* Move the next balance forward */
> + if (time_after(jiffies+1, this_rq->next_balance))
> + this_rq->next_balance = jiffies+1;
> if (time_after(this_rq->next_balance, next_balance))
> this_rq->next_balance = next_balance;
>
> --
> 2.20.1
>
>
> From 59de98515bda38b8d6faec5f8c68e1c9ec18962e Mon Sep 17 00:00:00 2001
> Message-Id: <59de98515bda38b8d6faec5f8c68e1c9ec18962e.1623433293.git.tim.c.chen@linux.intel.com>
> In-Reply-To: <2a5ebdeabbfdf4584532ef0e27d37ed75ca7dbd3.1623433293.git.tim.c.chen@linux.intel.com>
> References: <2a5ebdeabbfdf4584532ef0e27d37ed75ca7dbd3.1623433293.git.tim.c.chen@linux.intel.com>
> From: Vincent Guittot <vincent.guittot@linaro.org>
> Date: Fri, 7 May 2021 14:38:10 -0700
> Subject: [PATCH 2/3] sched: Skip update_blocked_averages if we are defering
> load balance
>
> In newidle_balance(), the scheduler skips load balance to the new idle cpu when sd is this_rq and when
>
> this_rq->avg_idle < sd->max_newidle_lb_cost
>
> Doing a costly call to update_blocked_averages() will
> not be useful and simply adds overhead when this condition is true.
>
> Check the condition early in newidle_balance() to skip update_blocked_averages()
> when possible.
>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
> kernel/sched/fair.c | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 740a0572cbf1..a69bfc651e55 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10615,17 +10615,20 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> */
> rq_unpin_lock(this_rq, rf);
>
> + rcu_read_lock();
> + sd = rcu_dereference_check_sched_domain(this_rq->sd);
> +
> if (this_rq->avg_idle < sysctl_sched_migration_cost ||
> - !READ_ONCE(this_rq->rd->overload)) {
> + !READ_ONCE(this_rq->rd->overload) ||
> + (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>
> - rcu_read_lock();
> - sd = rcu_dereference_check_sched_domain(this_rq->sd);
> if (sd)
> update_next_balance(sd, &next_balance);
> rcu_read_unlock();
>
> goto out;
> }
> + rcu_read_unlock();
>
> raw_spin_unlock(&this_rq->lock);
>
> --
> 2.20.1
>
>
> From 4622055d989a5feb446a7893a48fcd31305ec7a7 Mon Sep 17 00:00:00 2001
> Message-Id: <4622055d989a5feb446a7893a48fcd31305ec7a7.1623433293.git.tim.c.chen@linux.intel.com>
> In-Reply-To: <2a5ebdeabbfdf4584532ef0e27d37ed75ca7dbd3.1623433293.git.tim.c.chen@linux.intel.com>
> References: <2a5ebdeabbfdf4584532ef0e27d37ed75ca7dbd3.1623433293.git.tim.c.chen@linux.intel.com>
> From: Tim Chen <tim.c.chen@linux.intel.com>
> Date: Mon, 24 May 2021 13:21:03 -0700
> Subject: [PATCH 3/3] sched: Don't shorten the load balance interval of a 80%
> or more busy CPU
>
> For a CPU that's busy 80% or more on average, it is quite likely that a task
> will wake up on it very soon. It is better to not shorten the load
> balance interval as if it is completely idle to save on the load
> balancing overhead.
>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
> kernel/sched/fair.c | 20 ++++++++++++++------
> 1 file changed, 14 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a69bfc651e55..7353395d8a3a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9895,12 +9895,11 @@ get_sd_balance_interval(struct sched_domain *sd, int cpu_busy)
> }
>
> static inline void
> -update_next_balance(struct sched_domain *sd, unsigned long *next_balance)
> +update_next_balance(struct sched_domain *sd, unsigned long *next_balance, int cpu_busy)
> {
> unsigned long interval, next;
>
> - /* used by idle balance, so cpu_busy = 0 */
> - interval = get_sd_balance_interval(sd, 0);
> + interval = get_sd_balance_interval(sd, cpu_busy);
> if (time_after(jiffies+1, sd->last_balance + interval))
> next = jiffies+1;
> else
> @@ -10593,6 +10592,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> struct sched_domain *sd;
> int pulled_task = 0;
> u64 curr_cost = 0;
> + int cpu_busy = 0;
>
> update_misfit_status(NULL, this_rq);
> /*
> @@ -10618,12 +10618,20 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> rcu_read_lock();
> sd = rcu_dereference_check_sched_domain(this_rq->sd);
>
> + /*
> + * Consider the cpu busy if it has more than 80% average utilization.
> + * Idle balance such cpu not as frequently as a task may wake up soon.
> + */
> + if ((cpu_util(this_cpu) * 10 > capacity_orig_of(this_cpu) * 8))
> + cpu_busy = 1;
> +
> if (this_rq->avg_idle < sysctl_sched_migration_cost ||
> !READ_ONCE(this_rq->rd->overload) ||
> (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
>
> if (sd)
> - update_next_balance(sd, &next_balance);
> + update_next_balance(sd, &next_balance, cpu_busy);
> +
> rcu_read_unlock();
>
> goto out;
> @@ -10639,7 +10647,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> u64 t0, domain_cost;
>
> if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
> - update_next_balance(sd, &next_balance);
> + update_next_balance(sd, &next_balance, cpu_busy);
> break;
> }
>
> @@ -10657,7 +10665,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> curr_cost += domain_cost;
> }
>
> - update_next_balance(sd, &next_balance);
> + update_next_balance(sd, &next_balance, cpu_busy);
>
> /*
> * Stop searching for tasks to pull if there are
> --
> 2.20.1
>
Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ [ In reply to ]
On 6/18/21 3:28 AM, Vincent Guittot wrote:

>>
>> The current logic is when a CPU becomes idle, next_balance occur very
>> shortly (usually in the next jiffie) as get_sd_balance_interval returns
>> the next_balance in the next jiffie if the CPU is idle. However, in
>> reality, I saw most CPUs are 95% busy on average for my workload and
>> a task will wake up on an idle CPU shortly. So having frequent idle
>> balancing towards shortly idle CPUs is counter productive and simply
>> increase overhead and does not improve performance.
>
> Just to make sure that I understand your problem correctly: Your problem is:
> - that we have an ilb happening on the idle CPU and consume cycle

That's right. The cycles are consumed heavily in update_blocked_averages()
when cgroup is enabled.

> - or that the ilb will pull a task on an idle CPU on which a task will
> shortly wakeup which ends to 2 tasks competing for the same CPU.
>

Because for the OLTP workload I'm looking at, we have tasks that sleep
for a short while and wake again very shortly (i.e. the CPU actually
is ~95% busy on average), pulling tasks to such a CPU is really not
helpful to improve overall CPU utilization in the system. So my
intuition is for such almost fully busy CPU, we should defer load
balancing to it (see prototype patch 3).

Tim
Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ [ In reply to ]
On Fri, 18 Jun 2021 at 18:14, Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
>
>
> On 6/18/21 3:28 AM, Vincent Guittot wrote:
>
> >>
> >> The current logic is when a CPU becomes idle, next_balance occur very
> >> shortly (usually in the next jiffie) as get_sd_balance_interval returns
> >> the next_balance in the next jiffie if the CPU is idle. However, in
> >> reality, I saw most CPUs are 95% busy on average for my workload and
> >> a task will wake up on an idle CPU shortly. So having frequent idle
> >> balancing towards shortly idle CPUs is counter productive and simply
> >> increase overhead and does not improve performance.
> >
> > Just to make sure that I understand your problem correctly: Your problem is:
> > - that we have an ilb happening on the idle CPU and consume cycle
>
> That's right. The cycles are consumed heavily in update_blocked_averages()
> when cgroup is enabled.

But they are normally consumed on an idle CPU and the ILB checks
need_resched() before running load balance for the next idle CPU.

Does it mean that your problem is coming from update_blocked_average()
spending a long time with rq_lock_irqsave and increasing the wakeup
latency of your short running task ?

>
> > - or that the ilb will pull a task on an idle CPU on which a task will
> > shortly wakeup which ends to 2 tasks competing for the same CPU.
> >
>
> Because for the OLTP workload I'm looking at, we have tasks that sleep
> for a short while and wake again very shortly (i.e. the CPU actually
> is ~95% busy on average), pulling tasks to such a CPU is really not
> helpful to improve overall CPU utilization in the system. So my
> intuition is for such almost fully busy CPU, we should defer load
> balancing to it (see prototype patch 3).

Note that this is at the opposite of what you said earlier:
"
Though in our test environment, sysctl_sched_migration_cost was kept
much lower (25000) compared to the default (500000), to encourage
migrations to idle cpu
and reduce latency.
"

But, it will be quite hard to find a value that fits to requirements
for everybody and some will have UCs for which they want to pull tasks
even if the CPU is 95% busy; You can have 2ms of idle time but having
a utilization above 95% and an ILB inside a Core or at LLC is somewhat
cheap and would take advantage of those 2ms

>
> Tim
>
>
>
>

1 2  View All