Mailing List Archive

[PATCH 0 of 8] NUMA Awareness for the Credit Scheduler
Hi Everyone,

Here it comes a patch series instilling some NUMA awareness in the Credit
scheduler.

What the patches do is teaching the Xen's scheduler how to try maximizing
performances on a NUMA host, taking advantage of the information coming from
the automatic NUMA placement we have in libxl. Right now, the
placement algorithm runs and selects a node (or a set of nodes) where it is best
to put a new domain on. Then, all the memory for the new domain is allocated
from those node(s) and all the vCPUs of the new domain are pinned to the pCPUs
of those node(s). What we do here is, instead of statically pinning the domain's
vCPUs to the nodes' pCPUs, have the (Credit) scheduler _prefer_ running them
there. That enables most of the performances benefits of "real" pinning, but
without its intrinsic lack of flexibility.

The above happens by extending to the scheduler the knowledge of a domain's
node-affinity. We then ask it to first try to run the domain's vCPUs on one of
the nodes the domain has affinity with. Of course, if that turns out to be
impossible, it falls back on the old behaviour (i.e., considering vcpu-affinity
only).

Just allow me to mention that NUMA aware scheduling not only is one of the item
of the NUMA roadmap I'm trying to maintain here
http://wiki.xen.org/wiki/Xen_NUMA_Roadmap. It is also one of the features we
decided we want for Xen 4.3 (and thus it is part of the list of such features
that George is maintaining).

Up to now, I've been able to thoroughly test this only on my 2 NUMA nodes
testbox, by running the SpecJBB2005 benchmark concurrently on multiple VMs, and
the results looks really nice. A full set of what I got can be found inside my
presentation from last XenSummit, which is available here:

http://www.slideshare.net/xen_com_mgr/numa-and-virtualization-the-case-of-xen?ref=http://www.xen.org/xensummit/xs12na_talks/T9.html

However, I rerun some of the tests in these last days (since I changed some
bits of the implementation) and here's what I got:

-------------------------------------------------------
SpecJBB2005 Total Aggregate Throughput
-------------------------------------------------------
#VMs No NUMA affinity NUMA affinity & +/- %
scheduling
-------------------------------------------------------
2 34653.273 40243.015 +16.13%
4 29883.057 35526.807 +18.88%
6 23512.926 27015.786 +14.89%
8 19120.243 21825.818 +14.15%
10 15676.675 17701.472 +12.91%

Basically, results are consistent with what is shown in the super-nice graphs I
have in the slides above! :-) As said, this looks nice to me, especially
considering that my test machine is quite small, i.e., its 2 nodes are very
close to each others from a latency point of view. I really expect more
improvement on bigger hardware, where much greater NUMA effect is to be
expected. Of course, I myself will continue benchmarking (hopefully, on
systems with more than 2 nodes too), but should anyone want to run its own
testing, that would be great, so feel free to do that and report results to me
and/or to the list!

A little bit more about the series:

1/8 xen, libxc: rename xenctl_cpumap to xenctl_bitmap
2/8 xen, libxc: introduce node maps and masks

Is some preparation work.

3/8 xen: let the (credit) scheduler know about `node affinity`

Is where the vcpu load balancing logic of the credit scheduler is modified to
support node-affinity.

4/8 xen: allow for explicitly specifying node-affinity
5/8 libxc: allow for explicitly specifying node-affinity
6/8 libxl: allow for explicitly specifying node-affinity
7/8 libxl: automatic placement deals with node-affinity

Is what wires the in-scheduler node-affinity support with the external world.
Please, note that patch 4 touches XSM and Flask, which is the area with which I
have less experience and less chance to test properly. So, If Daniel and/or
anyone interested in that could take a look and comment, that would be awesome.

8/8 xl: report node-affinity for domains

Is just some small output enhancement.

Thanks and Regards,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
Re: [PATCH 0 of 8] NUMA Awareness for the Credit Scheduler [ In reply to ]
> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> Sent: Friday, October 05, 2012 8:08 AM
> To: xen-devel@lists.xen.org
> Cc: Andre Przywara; Ian Campbell; Anil Madhavapeddy; George Dunlap; Andrew Cooper; Juergen Gross; Ian
> Jackson; Jan Beulich; Marcus Granado; Daniel De Graaf; Matt Wilson
> Subject: [Xen-devel] [PATCH 0 of 8] NUMA Awareness for the Credit Scheduler
>
> Hi Everyone,
>
> Here it comes a patch series instilling some NUMA awareness in the Credit
> scheduler.

Hi Dario --

Just wondering... is the NUMA information preserved on live migration?
I'm not saying that it necessarily should, but it may just work
due to the implementation (since migration is a form of domain creation).
In either case, it might be good to comment about live migration
on your wiki.

Thanks,
Dan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
Re: [PATCH 0 of 8] NUMA Awareness for the Credit Scheduler [ In reply to ]
Am 05.10.2012 16:08, schrieb Dario Faggioli:
> Hi Everyone,
>
> Here it comes a patch series instilling some NUMA awareness in the Credit
> scheduler.
>
> What the patches do is teaching the Xen's scheduler how to try maximizing
> performances on a NUMA host, taking advantage of the information coming from
> the automatic NUMA placement we have in libxl. Right now, the
> placement algorithm runs and selects a node (or a set of nodes) where it is best
> to put a new domain on. Then, all the memory for the new domain is allocated
> from those node(s) and all the vCPUs of the new domain are pinned to the pCPUs
> of those node(s). What we do here is, instead of statically pinning the domain's
> vCPUs to the nodes' pCPUs, have the (Credit) scheduler _prefer_ running them
> there. That enables most of the performances benefits of "real" pinning, but
> without its intrinsic lack of flexibility.
>
> The above happens by extending to the scheduler the knowledge of a domain's
> node-affinity. We then ask it to first try to run the domain's vCPUs on one of
> the nodes the domain has affinity with. Of course, if that turns out to be
> impossible, it falls back on the old behaviour (i.e., considering vcpu-affinity
> only).
>
> Just allow me to mention that NUMA aware scheduling not only is one of the item
> of the NUMA roadmap I'm trying to maintain here
> http://wiki.xen.org/wiki/Xen_NUMA_Roadmap. It is also one of the features we
> decided we want for Xen 4.3 (and thus it is part of the list of such features
> that George is maintaining).
>
> Up to now, I've been able to thoroughly test this only on my 2 NUMA nodes
> testbox, by running the SpecJBB2005 benchmark concurrently on multiple VMs, and
> the results looks really nice. A full set of what I got can be found inside my
> presentation from last XenSummit, which is available here:
>
> http://www.slideshare.net/xen_com_mgr/numa-and-virtualization-the-case-of-xen?ref=http://www.xen.org/xensummit/xs12na_talks/T9.html
>
> However, I rerun some of the tests in these last days (since I changed some
> bits of the implementation) and here's what I got:
>
> -------------------------------------------------------
> SpecJBB2005 Total Aggregate Throughput
> -------------------------------------------------------
> #VMs No NUMA affinity NUMA affinity& +/- %
> scheduling
> -------------------------------------------------------
> 2 34653.273 40243.015 +16.13%
> 4 29883.057 35526.807 +18.88%
> 6 23512.926 27015.786 +14.89%
> 8 19120.243 21825.818 +14.15%
> 10 15676.675 17701.472 +12.91%
>
> Basically, results are consistent with what is shown in the super-nice graphs I
> have in the slides above! :-) As said, this looks nice to me, especially
> considering that my test machine is quite small, i.e., its 2 nodes are very
> close to each others from a latency point of view. I really expect more
> improvement on bigger hardware, where much greater NUMA effect is to be
> expected. Of course, I myself will continue benchmarking (hopefully, on
> systems with more than 2 nodes too), but should anyone want to run its own
> testing, that would be great, so feel free to do that and report results to me
> and/or to the list!
>
> A little bit more about the series:
>
> 1/8 xen, libxc: rename xenctl_cpumap to xenctl_bitmap
> 2/8 xen, libxc: introduce node maps and masks
>
> Is some preparation work.
>
> 3/8 xen: let the (credit) scheduler know about `node affinity`
>
> Is where the vcpu load balancing logic of the credit scheduler is modified to
> support node-affinity.
>
> 4/8 xen: allow for explicitly specifying node-affinity
> 5/8 libxc: allow for explicitly specifying node-affinity
> 6/8 libxl: allow for explicitly specifying node-affinity
> 7/8 libxl: automatic placement deals with node-affinity
>
> Is what wires the in-scheduler node-affinity support with the external world.
> Please, note that patch 4 touches XSM and Flask, which is the area with which I
> have less experience and less chance to test properly. So, If Daniel and/or
> anyone interested in that could take a look and comment, that would be awesome.
>
> 8/8 xl: report node-affinity for domains
>
> Is just some small output enhancement.

Apart from the minor comment to Patch 3:

Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>


--
Juergen Gross Principal Developer Operating Systems
PBG PDG ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28 Internet: ts.fujitsu.com
D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
Re: [PATCH 0 of 8] NUMA Awareness for the Credit Scheduler [ In reply to ]
On Mon, 2012-10-08 at 12:43 -0700, Dan Magenheimer wrote:
> Just wondering... is the NUMA information preserved on live migration?
> I'm not saying that it necessarily should, but it may just work
> due to the implementation (since migration is a form of domain creation).
>
What could I say... yes, but "preserved" is not the right word. :-)

In fact, something happens when you migrate a VM. As you said, migration
is a special case of domain creation, so the placement algorithm will
trigger as a part of the process of creating the target VM (unless you
override the relevant options in the config file during migration
itself). That means the target VM will be placed on one (some) node(s)
of the target host, and it's node-affinity will be set accordingly.

_However_, there is right now no guarantee for the final decision from
the placing algorithm on the target machine to be "compatible" with the
one made on the source machine at initial VM creation time. For
instance, if your VM fits in just one node and is placed there on
machine A, it well could end up being split on two or more nodes when
migrated on machine B (and, of course, the vice versa).

Whether, that is acceptable or not, is of course debatable, and we had a
bit of this discussion already (although no real conclusion has been
reached yet).
My take is that, right now, since we do not yet expose any virtual NUMA
topology to the VM itself, the behaviour described above is fine. As
soon as we'll have some guest NUMA awareness, than it might be
worthwhile to try to preserve it, at least to some extent.

Oh, and BTW, I'm of course talking about migration with xl and libxl. If
you use other toolstacks, then the hypervisor will default to his
current (_without_ this series) behaviour, and it all will depend on who
and when calls xc_domain_node_setaffinity() or, perhaps,
XEN_DOMCTL_setnodeaffinity directly.

> In either case, it might be good to comment about live migration
> on your wiki.
>
That is definitely a good point, I will put something there about
migration and the behaviour described above.

Thanks and Regards,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
Re: [PATCH 0 of 8] NUMA Awareness for the Credit Scheduler [ In reply to ]
On Tue, Oct 09, 2012 at 11:45:49AM +0100, Dario Faggioli wrote:
>
> Whether, that is acceptable or not, is of course debatable, and we had a
> bit of this discussion already (although no real conclusion has been
> reached yet).
> My take is that, right now, since we do not yet expose any virtual NUMA
> topology to the VM itself, the behaviour described above is fine. As
> soon as we'll have some guest NUMA awareness, than it might be
> worthwhile to try to preserve it, at least to some extent.

For what it's worth, under VMware all bets are off if a vNUMA enabled
guest is migrated via vMotion. See "Performance Best Practices for
VMware vSphere 5.0" [1] page 40. There is also a good deal of
information in a paper published by VMware labs on HPC workloads [2]
and a blog post on NUMA load balancing [3].

Matt

[1] http://www.vmware.com/pdf/Perf_Best_Practices_vSphere5.0.pdf
[2] http://labs.vmware.com/publications/performance-evaluation-of-hpc-benchmarks-on-vmwares-esxi-server
[3] http://blogs.vmware.com/vsphere/2012/02/vspherenuma-loadbalancing.htmlvnu




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
Re: [PATCH 0 of 8] NUMA Awareness for the Credit Scheduler [ In reply to ]
On Fri, Oct 5, 2012 at 3:08 PM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> Hi Everyone,
>
> Here it comes a patch series instilling some NUMA awareness in the Credit
> scheduler.

Hey Dario -- I've looked through everything and acked everything I
felt I understood well enough / had the authority to ack. Thanks for
the good work!

-George

>
> What the patches do is teaching the Xen's scheduler how to try maximizing
> performances on a NUMA host, taking advantage of the information coming from
> the automatic NUMA placement we have in libxl. Right now, the
> placement algorithm runs and selects a node (or a set of nodes) where it is best
> to put a new domain on. Then, all the memory for the new domain is allocated
> from those node(s) and all the vCPUs of the new domain are pinned to the pCPUs
> of those node(s). What we do here is, instead of statically pinning the domain's
> vCPUs to the nodes' pCPUs, have the (Credit) scheduler _prefer_ running them
> there. That enables most of the performances benefits of "real" pinning, but
> without its intrinsic lack of flexibility.
>
> The above happens by extending to the scheduler the knowledge of a domain's
> node-affinity. We then ask it to first try to run the domain's vCPUs on one of
> the nodes the domain has affinity with. Of course, if that turns out to be
> impossible, it falls back on the old behaviour (i.e., considering vcpu-affinity
> only).
>
> Just allow me to mention that NUMA aware scheduling not only is one of the item
> of the NUMA roadmap I'm trying to maintain here
> http://wiki.xen.org/wiki/Xen_NUMA_Roadmap. It is also one of the features we
> decided we want for Xen 4.3 (and thus it is part of the list of such features
> that George is maintaining).
>
> Up to now, I've been able to thoroughly test this only on my 2 NUMA nodes
> testbox, by running the SpecJBB2005 benchmark concurrently on multiple VMs, and
> the results looks really nice. A full set of what I got can be found inside my
> presentation from last XenSummit, which is available here:
>
> http://www.slideshare.net/xen_com_mgr/numa-and-virtualization-the-case-of-xen?ref=http://www.xen.org/xensummit/xs12na_talks/T9.html
>
> However, I rerun some of the tests in these last days (since I changed some
> bits of the implementation) and here's what I got:
>
> -------------------------------------------------------
> SpecJBB2005 Total Aggregate Throughput
> -------------------------------------------------------
> #VMs No NUMA affinity NUMA affinity & +/- %
> scheduling
> -------------------------------------------------------
> 2 34653.273 40243.015 +16.13%
> 4 29883.057 35526.807 +18.88%
> 6 23512.926 27015.786 +14.89%
> 8 19120.243 21825.818 +14.15%
> 10 15676.675 17701.472 +12.91%
>
> Basically, results are consistent with what is shown in the super-nice graphs I
> have in the slides above! :-) As said, this looks nice to me, especially
> considering that my test machine is quite small, i.e., its 2 nodes are very
> close to each others from a latency point of view. I really expect more
> improvement on bigger hardware, where much greater NUMA effect is to be
> expected. Of course, I myself will continue benchmarking (hopefully, on
> systems with more than 2 nodes too), but should anyone want to run its own
> testing, that would be great, so feel free to do that and report results to me
> and/or to the list!
>
> A little bit more about the series:
>
> 1/8 xen, libxc: rename xenctl_cpumap to xenctl_bitmap
> 2/8 xen, libxc: introduce node maps and masks
>
> Is some preparation work.
>
> 3/8 xen: let the (credit) scheduler know about `node affinity`
>
> Is where the vcpu load balancing logic of the credit scheduler is modified to
> support node-affinity.
>
> 4/8 xen: allow for explicitly specifying node-affinity
> 5/8 libxc: allow for explicitly specifying node-affinity
> 6/8 libxl: allow for explicitly specifying node-affinity
> 7/8 libxl: automatic placement deals with node-affinity
>
> Is what wires the in-scheduler node-affinity support with the external world.
> Please, note that patch 4 touches XSM and Flask, which is the area with which I
> have less experience and less chance to test properly. So, If Daniel and/or
> anyone interested in that could take a look and comment, that would be awesome.
>
> 8/8 xl: report node-affinity for domains
>
> Is just some small output enhancement.
>
> Thanks and Regards,
> Dario
>
> --
> <<This happens because I choose it to happen!>> (Raistlin Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
Re: [PATCH 0 of 8] NUMA Awareness for the Credit Scheduler [ In reply to ]
On Wed, 2012-10-10 at 12:00 +0100, George Dunlap wrote:
> On Fri, Oct 5, 2012 at 3:08 PM, Dario Faggioli
> <dario.faggioli@citrix.com> wrote:
> > Hi Everyone,
> >
> > Here it comes a patch series instilling some NUMA awareness in the Credit
> > scheduler.
>
> Hey Dario --
>
Hi!

> I've looked through everything and acked everything I
> felt I understood well enough / had the authority to ack.
>
Yep, I've seen that. Thanks.

> Thanks for
> the good work!
>
Well, thanks to you for the good comments... And be prepared for next
round! :-P

Regards,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
Re: [PATCH 0 of 8] NUMA Awareness for the Credit Scheduler [ In reply to ]
On Mon, 2012-10-08 at 12:43 -0700, Dan Magenheimer wrote:
> Just wondering... is the NUMA information preserved on live migration?
> I'm not saying that it necessarily should, but it may just work
> due to the implementation (since migration is a form of domain creation).
> In either case, it might be good to comment about live migration
> on your wiki.
>
FYI:

http://wiki.xen.org/wiki/Xen_NUMA_Introduction
http://wiki.xen.org/wiki?title=Xen_NUMA_Introduction&diff=5327&oldid=4598

As per the NUMA roadmap ( http://wiki.xen.org/wiki/Xen_NUMA_Roadmap ),
you can see here [1] that it was already there. :-)

Regards,
Dario

[1] http://wiki.xen.org/wiki/Xen_NUMA_Roadmap#Virtual_NUMA_topology_exposure_to_guests

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)