Mailing List Archive: [PATCH 2 of 6 v2] xen: sched_credit: improve tickling of idle CPUs

Right now, when a VCPU wakes-up, we check whether it should preempt
what is running on the PCPU, and whether or not the waking VCPU can
be migrated (by tickling some idlers). However, this can result in
suboptimal or even wrong behaviour, as explained here:

http://lists.xen.org/archives/html/xen-devel/2012-10/msg01732.html

This change, instead, when deciding which PCPU(s) to tickle, upon
VCPU wake-up, considers both what it is likely to happen on the PCPU
where the wakeup occurs,and whether or not there are idlers where
the woken-up VCPU can run. In fact, if there are, we can avoid
interrupting the running VCPU. Only in case there aren't any of
these PCPUs, preemption and migration are the way to go.

This has been tested (on top of the previous change) by running
the following benchmarks inside 2, 6 and 10 VMs, concurrently, on
a shared host, each with 2 VCPUs and 960 MB of memory (host had 16
ways and 12 GB RAM).

1) All VMs had 'cpus="all"' in their config file.

$ sysbench --test=cpu ... (time, lower is better)
| VMs | w/o this change | w/ this change |
| 2 | 50.078467 +/- 1.6676162 | 49.673667 +/- 0.0094321 |
| 6 | 63.259472 +/- 0.1137586 | 61.680011 +/- 1.0208723 |
| 10 | 91.246797 +/- 0.1154008 | 90.396720 +/- 1.5900423 |
$ sysbench --test=memory ... (throughput, higher is better)
| VMs | w/o this change | w/ this change |
| 2 | 485.56333 +/- 6.0527356 | 487.83167 +/- 0.7602850 |
| 6 | 401.36278 +/- 1.9745916 | 409.96778 +/- 3.6761092 |
| 10 | 294.43933 +/- 0.8064945 | 302.49033 +/- 0.2343978 |
$ specjbb2005 ... (throughput, higher is better)
| VMs | w/o this change | w/ this change |
| 2 | 43150.63 +/- 1359.5616 | 43275.427 +/- 606.28185 |
| 6 | 29274.29 +/- 1024.4042 | 29716.189 +/- 1290.1878 |
| 10 | 19061.28 +/- 512.88561 | 19192.599 +/- 605.66058 |

2) All VMs had their VCPUs statically pinned to the host's PCPUs.

$ sysbench --test=cpu ... (time, lower is better)
| VMs | w/o this change | w/ this change |
| 2 | 47.8211 +/- 0.0215504 | 47.826900 +/- 0.0077872 |
| 6 | 62.689122 +/- 0.0877173 | 62.764539 +/- 0.3882493 |
| 10 | 90.321097 +/- 1.4803867 | 89.974570 +/- 1.1437566 |
$ sysbench --test=memory ... (throughput, higher is better)
| VMs | w/o this change | w/ this change |
| 2 | 550.97667 +/- 2.3512355 | 550.87000 +/- 0.8140792 |
| 6 | 443.15000 +/- 5.7471797 | 454.01056 +/- 8.4373466 |
| 10 | 313.89233 +/- 1.3237493 | 321.81167 +/- 0.3528418 |
$ specjbb2005 ... (throughput, higher is better)
| 2 | 49591.057 +/- 952.93384 | 49594.195 +/- 799.57976 |
| 6 | 33538.247 +/- 1089.2115 | 33671.758 +/- 1077.6806 |
| 10 | 21927.870 +/- 831.88742 | 21891.131 +/- 563.37929 |

Numbers show how the change has either no or very limited impact
(specjbb2005 case) or, when it does have some impact, that is a
real improvement in performances (sysbench-memory case).

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Changes from v1:
* Rewritten as per George's suggestion, in order to improve readability.
* Killed some of the stats, with the only exception of `tickle_idlers_none'
and `tickle_idlers_some'. They don't make things look that terrible and
I think they could be useful.
* The preemption+migration of the currently running VCPU has been turned
into a migration request, instead than just tickling. I traced this
thing some more, and it looks like that is the way to go. Tickling is
not effective here, because the woken PCPU would expect cur to be out
of the scheduler tail, which is likely false (cur->is_running is
still set to 1).

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -133,6 +133,7 @@ struct csched_vcpu {
uint32_t state_idle;
uint32_t migrate_q;
uint32_t migrate_r;
+ uint32_t kicked_away;
} stats;
#endif
};
@@ -251,54 +252,67 @@ static inline void
struct csched_vcpu * const cur =
CSCHED_VCPU(per_cpu(schedule_data, cpu).curr);
struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, cpu));
- cpumask_t mask;
+ cpumask_t mask, idle_mask;
+ int idlers_empty;

ASSERT(cur);
cpumask_clear(&mask);

- /* If strictly higher priority than current VCPU, signal the CPU */
- if ( new->pri > cur->pri )
+ idlers_empty = cpumask_empty(prv->idlers);
+
+ /*
+ * If the pcpu is idle, or there are no idlers and the new
+ * vcpu is a higher priority than the old vcpu, run it here.
+ *
+ * If there are idle cpus, first try to find one suitable to run
+ * new, so we can avoid preempting cur. If we cannot find a
+ * suitable idler on which to run new, run it here, but try to
+ * find a suitable idler on which to run cur instead.
+ */
+ if ( cur->pri == CSCHED_PRI_IDLE
+ || (idlers_empty && new->pri > cur->pri) )
{
- if ( cur->pri == CSCHED_PRI_IDLE )
- SCHED_STAT_CRANK(tickle_local_idler);
- else if ( cur->pri == CSCHED_PRI_TS_OVER )
- SCHED_STAT_CRANK(tickle_local_over);
- else if ( cur->pri == CSCHED_PRI_TS_UNDER )
- SCHED_STAT_CRANK(tickle_local_under);
- else
- SCHED_STAT_CRANK(tickle_local_other);
-
+ if ( cur->pri != CSCHED_PRI_IDLE )
+ SCHED_STAT_CRANK(tickle_idlers_none);
cpumask_set_cpu(cpu, &mask);
}
+ else if ( !idlers_empty )
+ {
+ /* Check whether or not there are idlers that can run new */
+ cpumask_and(&idle_mask, prv->idlers, new->vcpu->cpu_affinity);

- /*
- * If this CPU has at least two runnable VCPUs, we tickle any idlers to
- * let them know there is runnable work in the system...
- */
- if ( cur->pri > CSCHED_PRI_IDLE )
- {
- if ( cpumask_empty(prv->idlers) )
+ /*
+ * If there are no suitable idlers for new, and it's higher
+ * priority than cur, ask the scheduler to migrate cur away.
+ * We have to act like this (instead of just waking some of
+ * the idlers suitable for cur) because cur is running.
+ *
+ * If there are suitable idlers for new, no matter priorities,
+ * leave cur alone (as it is running and is, likely, cache-hot)
+ * and wake some of them (which is waking up and so is, likely,
+ * cache cold anyway).
+ */
+ if ( cpumask_empty(&idle_mask) && new->pri > cur->pri )
{
SCHED_STAT_CRANK(tickle_idlers_none);
+ SCHED_VCPU_STAT_CRANK(cur, kicked_away);
+ SCHED_VCPU_STAT_CRANK(cur, migrate_r);
+ SCHED_STAT_CRANK(migrate_kicked_away);
+ set_bit(_VPF_migrating, &cur->vcpu->pause_flags);
+ cpumask_set_cpu(cpu, &mask);
}
- else
+ else if ( !cpumask_empty(&idle_mask) )
{
- cpumask_t idle_mask;
-
- cpumask_and(&idle_mask, prv->idlers, new->vcpu->cpu_affinity);
- if ( !cpumask_empty(&idle_mask) )
+ /* Which of the idlers suitable for new shall we wake up? */
+ SCHED_STAT_CRANK(tickle_idlers_some);
+ if ( opt_tickle_one_idle )
{
- SCHED_STAT_CRANK(tickle_idlers_some);
- if ( opt_tickle_one_idle )
- {
- this_cpu(last_tickle_cpu) =
- cpumask_cycle(this_cpu(last_tickle_cpu), &idle_mask);
- cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask);
- }
- else
- cpumask_or(&mask, &mask, &idle_mask);
+ this_cpu(last_tickle_cpu) =
+ cpumask_cycle(this_cpu(last_tickle_cpu), &idle_mask);
+ cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask);
}
- cpumask_and(&mask, &mask, new->vcpu->cpu_affinity);
+ else
+ cpumask_or(&mask, &mask, &idle_mask);
}
}

@@ -1456,13 +1470,14 @@ csched_dump_vcpu(struct csched_vcpu *svc
{
printk(" credit=%i [w=%u]", atomic_read(&svc->credit), sdom->weight);
#ifdef CSCHED_STATS
- printk(" (%d+%u) {a/i=%u/%u m=%u+%u}",
+ printk(" (%d+%u) {a/i=%u/%u m=%u+%u (k=%u)}",
svc->stats.credit_last,
svc->stats.credit_incr,
svc->stats.state_active,
svc->stats.state_idle,
svc->stats.migrate_q,
- svc->stats.migrate_r);
+ svc->stats.migrate_r,
+ svc->stats.kicked_away);
#endif
}

diff --git a/xen/include/xen/perfc_defn.h b/xen/include/xen/perfc_defn.h
--- a/xen/include/xen/perfc_defn.h
+++ b/xen/include/xen/perfc_defn.h
@@ -39,10 +39,6 @@ PERFCOUNTER(vcpu_wake_runnable, "csc
PERFCOUNTER(vcpu_wake_not_runnable, "csched: vcpu_wake_not_runnable")
PERFCOUNTER(vcpu_park, "csched: vcpu_park")
PERFCOUNTER(vcpu_unpark, "csched: vcpu_unpark")
-PERFCOUNTER(tickle_local_idler, "csched: tickle_local_idler")
-PERFCOUNTER(tickle_local_over, "csched: tickle_local_over")
-PERFCOUNTER(tickle_local_under, "csched: tickle_local_under")
-PERFCOUNTER(tickle_local_other, "csched: tickle_local_other")
PERFCOUNTER(tickle_idlers_none, "csched: tickle_idlers_none")
PERFCOUNTER(tickle_idlers_some, "csched: tickle_idlers_some")
PERFCOUNTER(load_balance_idle, "csched: load_balance_idle")
@@ -52,6 +48,7 @@ PERFCOUNTER(steal_trylock_failed, "csc
PERFCOUNTER(steal_peer_idle, "csched: steal_peer_idle")
PERFCOUNTER(migrate_queued, "csched: migrate_queued")
PERFCOUNTER(migrate_running, "csched: migrate_running")
+PERFCOUNTER(migrate_kicked_away, "csched: migrate_kicked_away")
PERFCOUNTER(vcpu_hot, "csched: vcpu_hot")

PERFCOUNTER(need_flush_tlb_flush, "PG_need_flush tlb flushes")

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

On 12/12/12 02:52, Dario Faggioli wrote:
> Right now, when a VCPU wakes-up, we check whether it should preempt
> what is running on the PCPU, and whether or not the waking VCPU can
> be migrated (by tickling some idlers). However, this can result in
> suboptimal or even wrong behaviour, as explained here:
>
> http://lists.xen.org/archives/html/xen-devel/2012-10/msg01732.html
>
> This change, instead, when deciding which PCPU(s) to tickle, upon
> VCPU wake-up, considers both what it is likely to happen on the PCPU
> where the wakeup occurs,and whether or not there are idlers where
> the woken-up VCPU can run. In fact, if there are, we can avoid
> interrupting the running VCPU. Only in case there aren't any of
> these PCPUs, preemption and migration are the way to go.
>
> This has been tested (on top of the previous change) by running
> the following benchmarks inside 2, 6 and 10 VMs, concurrently, on
> a shared host, each with 2 VCPUs and 960 MB of memory (host had 16
> ways and 12 GB RAM).
>
> 1) All VMs had 'cpus="all"' in their config file.
>
> $ sysbench --test=cpu ... (time, lower is better)
> | VMs | w/o this change | w/ this change |
> | 2 | 50.078467 +/- 1.6676162 | 49.673667 +/- 0.0094321 |
> | 6 | 63.259472 +/- 0.1137586 | 61.680011 +/- 1.0208723 |
> | 10 | 91.246797 +/- 0.1154008 | 90.396720 +/- 1.5900423 |
> $ sysbench --test=memory ... (throughput, higher is better)
> | VMs | w/o this change | w/ this change |
> | 2 | 485.56333 +/- 6.0527356 | 487.83167 +/- 0.7602850 |
> | 6 | 401.36278 +/- 1.9745916 | 409.96778 +/- 3.6761092 |
> | 10 | 294.43933 +/- 0.8064945 | 302.49033 +/- 0.2343978 |
> $ specjbb2005 ... (throughput, higher is better)
> | VMs | w/o this change | w/ this change |
> | 2 | 43150.63 +/- 1359.5616 | 43275.427 +/- 606.28185 |
> | 6 | 29274.29 +/- 1024.4042 | 29716.189 +/- 1290.1878 |
> | 10 | 19061.28 +/- 512.88561 | 19192.599 +/- 605.66058 |
>
>
> 2) All VMs had their VCPUs statically pinned to the host's PCPUs.
>
> $ sysbench --test=cpu ... (time, lower is better)
> | VMs | w/o this change | w/ this change |
> | 2 | 47.8211 +/- 0.0215504 | 47.826900 +/- 0.0077872 |
> | 6 | 62.689122 +/- 0.0877173 | 62.764539 +/- 0.3882493 |
> | 10 | 90.321097 +/- 1.4803867 | 89.974570 +/- 1.1437566 |
> $ sysbench --test=memory ... (throughput, higher is better)
> | VMs | w/o this change | w/ this change |
> | 2 | 550.97667 +/- 2.3512355 | 550.87000 +/- 0.8140792 |
> | 6 | 443.15000 +/- 5.7471797 | 454.01056 +/- 8.4373466 |
> | 10 | 313.89233 +/- 1.3237493 | 321.81167 +/- 0.3528418 |
> $ specjbb2005 ... (throughput, higher is better)
> | 2 | 49591.057 +/- 952.93384 | 49594.195 +/- 799.57976 |
> | 6 | 33538.247 +/- 1089.2115 | 33671.758 +/- 1077.6806 |
> | 10 | 21927.870 +/- 831.88742 | 21891.131 +/- 563.37929 |
>
>
> Numbers show how the change has either no or very limited impact
> (specjbb2005 case) or, when it does have some impact, that is a
> real improvement in performances (sysbench-memory case).
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> ---
> Changes from v1:
> * Rewritten as per George's suggestion, in order to improve readability.
> * Killed some of the stats, with the only exception of `tickle_idlers_none'
> and `tickle_idlers_some'. They don't make things look that terrible and
> I think they could be useful.
> * The preemption+migration of the currently running VCPU has been turned
> into a migration request, instead than just tickling. I traced this
> thing some more, and it looks like that is the way to go. Tickling is
> not effective here, because the woken PCPU would expect cur to be out
> of the scheduler tail, which is likely false (cur->is_running is
> still set to 1).

Ah, right. :-)

Acked-by: George Dunlap <george.dunlap@eu.citrix.com>

>
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -133,6 +133,7 @@ struct csched_vcpu {
> uint32_t state_idle;
> uint32_t migrate_q;
> uint32_t migrate_r;
> + uint32_t kicked_away;
> } stats;
> #endif
> };
> @@ -251,54 +252,67 @@ static inline void
> struct csched_vcpu * const cur =
> CSCHED_VCPU(per_cpu(schedule_data, cpu).curr);
> struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, cpu));
> - cpumask_t mask;
> + cpumask_t mask, idle_mask;
> + int idlers_empty;
>
> ASSERT(cur);
> cpumask_clear(&mask);
>
> - /* If strictly higher priority than current VCPU, signal the CPU */
> - if ( new->pri > cur->pri )
> + idlers_empty = cpumask_empty(prv->idlers);
> +
> + /*
> + * If the pcpu is idle, or there are no idlers and the new
> + * vcpu is a higher priority than the old vcpu, run it here.
> + *
> + * If there are idle cpus, first try to find one suitable to run
> + * new, so we can avoid preempting cur. If we cannot find a
> + * suitable idler on which to run new, run it here, but try to
> + * find a suitable idler on which to run cur instead.
> + */
> + if ( cur->pri == CSCHED_PRI_IDLE
> + || (idlers_empty && new->pri > cur->pri) )
> {
> - if ( cur->pri == CSCHED_PRI_IDLE )
> - SCHED_STAT_CRANK(tickle_local_idler);
> - else if ( cur->pri == CSCHED_PRI_TS_OVER )
> - SCHED_STAT_CRANK(tickle_local_over);
> - else if ( cur->pri == CSCHED_PRI_TS_UNDER )
> - SCHED_STAT_CRANK(tickle_local_under);
> - else
> - SCHED_STAT_CRANK(tickle_local_other);
> -
> + if ( cur->pri != CSCHED_PRI_IDLE )
> + SCHED_STAT_CRANK(tickle_idlers_none);
> cpumask_set_cpu(cpu, &mask);
> }
> + else if ( !idlers_empty )
> + {
> + /* Check whether or not there are idlers that can run new */
> + cpumask_and(&idle_mask, prv->idlers, new->vcpu->cpu_affinity);
>
> - /*
> - * If this CPU has at least two runnable VCPUs, we tickle any idlers to
> - * let them know there is runnable work in the system...
> - */
> - if ( cur->pri > CSCHED_PRI_IDLE )
> - {
> - if ( cpumask_empty(prv->idlers) )
> + /*
> + * If there are no suitable idlers for new, and it's higher
> + * priority than cur, ask the scheduler to migrate cur away.
> + * We have to act like this (instead of just waking some of
> + * the idlers suitable for cur) because cur is running.
> + *
> + * If there are suitable idlers for new, no matter priorities,
> + * leave cur alone (as it is running and is, likely, cache-hot)
> + * and wake some of them (which is waking up and so is, likely,
> + * cache cold anyway).
> + */
> + if ( cpumask_empty(&idle_mask) && new->pri > cur->pri )
> {
> SCHED_STAT_CRANK(tickle_idlers_none);
> + SCHED_VCPU_STAT_CRANK(cur, kicked_away);
> + SCHED_VCPU_STAT_CRANK(cur, migrate_r);
> + SCHED_STAT_CRANK(migrate_kicked_away);
> + set_bit(_VPF_migrating, &cur->vcpu->pause_flags);
> + cpumask_set_cpu(cpu, &mask);
> }
> - else
> + else if ( !cpumask_empty(&idle_mask) )
> {
> - cpumask_t idle_mask;
> -
> - cpumask_and(&idle_mask, prv->idlers, new->vcpu->cpu_affinity);
> - if ( !cpumask_empty(&idle_mask) )
> + /* Which of the idlers suitable for new shall we wake up? */
> + SCHED_STAT_CRANK(tickle_idlers_some);
> + if ( opt_tickle_one_idle )
> {
> - SCHED_STAT_CRANK(tickle_idlers_some);
> - if ( opt_tickle_one_idle )
> - {
> - this_cpu(last_tickle_cpu) =
> - cpumask_cycle(this_cpu(last_tickle_cpu), &idle_mask);
> - cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask);
> - }
> - else
> - cpumask_or(&mask, &mask, &idle_mask);
> + this_cpu(last_tickle_cpu) =
> + cpumask_cycle(this_cpu(last_tickle_cpu), &idle_mask);
> + cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask);
> }
> - cpumask_and(&mask, &mask, new->vcpu->cpu_affinity);
> + else
> + cpumask_or(&mask, &mask, &idle_mask);
> }
> }
>
> @@ -1456,13 +1470,14 @@ csched_dump_vcpu(struct csched_vcpu *svc
> {
> printk(" credit=%i [w=%u]", atomic_read(&svc->credit), sdom->weight);
> #ifdef CSCHED_STATS
> - printk(" (%d+%u) {a/i=%u/%u m=%u+%u}",
> + printk(" (%d+%u) {a/i=%u/%u m=%u+%u (k=%u)}",
> svc->stats.credit_last,
> svc->stats.credit_incr,
> svc->stats.state_active,
> svc->stats.state_idle,
> svc->stats.migrate_q,
> - svc->stats.migrate_r);
> + svc->stats.migrate_r,
> + svc->stats.kicked_away);
> #endif
> }
>
> diff --git a/xen/include/xen/perfc_defn.h b/xen/include/xen/perfc_defn.h
> --- a/xen/include/xen/perfc_defn.h
> +++ b/xen/include/xen/perfc_defn.h
> @@ -39,10 +39,6 @@ PERFCOUNTER(vcpu_wake_runnable, "csc
> PERFCOUNTER(vcpu_wake_not_runnable, "csched: vcpu_wake_not_runnable")
> PERFCOUNTER(vcpu_park, "csched: vcpu_park")
> PERFCOUNTER(vcpu_unpark, "csched: vcpu_unpark")
> -PERFCOUNTER(tickle_local_idler, "csched: tickle_local_idler")
> -PERFCOUNTER(tickle_local_over, "csched: tickle_local_over")
> -PERFCOUNTER(tickle_local_under, "csched: tickle_local_under")
> -PERFCOUNTER(tickle_local_other, "csched: tickle_local_other")
> PERFCOUNTER(tickle_idlers_none, "csched: tickle_idlers_none")
> PERFCOUNTER(tickle_idlers_some, "csched: tickle_idlers_some")
> PERFCOUNTER(load_balance_idle, "csched: load_balance_idle")
> @@ -52,6 +48,7 @@ PERFCOUNTER(steal_trylock_failed, "csc
> PERFCOUNTER(steal_peer_idle, "csched: steal_peer_idle")
> PERFCOUNTER(migrate_queued, "csched: migrate_queued")
> PERFCOUNTER(migrate_running, "csched: migrate_running")
> +PERFCOUNTER(migrate_kicked_away, "csched: migrate_kicked_away")
> PERFCOUNTER(vcpu_hot, "csched: vcpu_hot")
>
> PERFCOUNTER(need_flush_tlb_flush, "PG_need_flush tlb flushes")

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel