Mailing List Archive

Vanilla Xen total CPU %
Hi everyone,

I am interested in calculating the approximate total CPU runtime % of
a vanilla Xen project host (without any of the bells and whistles of
XCP-ng or Xen Server). What I have at my disposal is Ubuntu, Xen and
the xl tool stack.

I have been experimenting with writing a parser for xentop output in
batch mode, this is a fairly easy task and I can see other attempts at
parsers across dead and dying github projects... my issue around this
is the precise meaning of the 'CPU/sec' metric given by xentop and how
I could use it to infer a total cpu time.

The docs for xentop say "CPU(sec) CPU time which the guest OS has
consumed(cumulated)".

My confusion is around how CPU 'seconds' actually relat to vCPUs, real
cores etc in this context. I can also see a couple of attempts at
figuring out total CPU %, but none look quite right.

If I were able to derive both the CPU seconds for each domu in an
interval, the aggregate CPU seconds in this interval and both total
vCPUs and physical cores what would be the correct formula for
approximating a total CPU runtime %?

Also if I am missing a trick and there is an easier way of calculating
this I would be extremely happy to hear it, as simple is nice :)

Thank you.
Re: Vanilla Xen total CPU % [ In reply to ]
On Thu, Jun 4, 2020 at 8:34 PM Nick Calvert <nick.calvert@gmail.com> wrote:

> Hi everyone,
>
> I am interested in calculating the approximate total CPU runtime % of
> a vanilla Xen project host (without any of the bells and whistles of
> XCP-ng or Xen Server). What I have at my disposal is Ubuntu, Xen and
> the xl tool stack.
>

Just to be clear: What you mean, is you want to add up the time all VMs are
running? (i.e., if you have one VM at 150%, another at 25%, and another at
25%, the total would be 200%?)


> I have been experimenting with writing a parser for xentop output in
> batch mode, this is a fairly easy task and I can see other attempts at
> parsers across dead and dying github projects... my issue around this
> is the precise meaning of the 'CPU/sec' metric given by xentop and how
> I could use it to infer a total cpu time.
>

If you don't mind me asking, what are you (and the projects you mention)
using this information for?

Xen is open-source project, so rather than having dozens of projects trying
to work around the fact that the core tools don't tell them what they want
to know, it seems like it would be better to either modify xentop to tell
you what you want to know, or add a new tool to do the same thing.


>
> The docs for xentop say "CPU(sec) CPU time which the guest OS has
> consumed(cumulated)".
>
> My confusion is around how CPU 'seconds' actually relat to vCPUs, real
> cores etc in this context. I can also see a couple of attempts at
> figuring out total CPU %, but none look quite right.
>
> If I were able to derive both the CPU seconds for each domu in an
> interval, the aggregate CPU seconds in this interval and both total
> vCPUs and physical cores what would be the correct formula for
> approximating a total CPU runtime %?
>
> Also if I am missing a trick and there is an easier way of calculating
> this I would be extremely happy to hear it, as simple is nice :)
>

I think if I were writing a program, I'd probably use libxl to get the raw
data, rather than trying to parse xentop. libxl_list_domain() will return
a list of libxl_dominfo, which has a field "cpu_time", which is (I believe)
the number of nanoseconds of cpu time that domain has consumed ever in its
lifetime.

So what you'd do is take a timestamp (t1) call libxl_list_domain(), and go
through the resulting list, adding up `cpu_time` (c1). Then at some point
later, take a timestamp (t2) and do another sum (c2). Your total host
utilization between t1 and t2 would then be (c2 - c1) / (t2 - t1).

If you're using 4.13 at least, you could use the golang bindings instead,
if you didn't want to use C.

-George
RE: Vanilla Xen total CPU % [ In reply to ]
Hi George,



Thank you very much for taking the time to respond to me.



What I have been trying to do (and I think it’s the same for the other abandoned projects I came across) is come up with an equivalent to something like the Hyper-V % Total Runtime counter, which gives a (probably not very precise) but useful account of the total CPU ‘load’ of a hypervisor.



This is the sort of metric which can be useful for spotting overall trends, or when the sum of all virtual machine CPU usage passes an alerting trigger. I appreciate that at this point it would probably be necessary to look at other metrics to determine what was actually happening.



What I was trying to do was stream some of the xentop counters into a time series database (influxdb) so I could graph this. Other people have attempted the same, as an example there are projects doing this with graphite, some people were using old xm python bindings to basically do exactly as you describe in your mail and return their own usage % for graphing.



Being able to do this with Go in 4.13 is very interesting and I did not know such bindings existed. I have some experience with the language so will look at this now. I am however straddled with some older versions of Xen and had got as far as building a parser and a.) pushing the cpu seconds value for each domU directly into a database and playing with them as if they were a network interface counters and b.) taking samples in a time interval and performing a calculation just as you describe and inserting these directly into a database with a timestamp.



One thing I was unsure on - and I think this is my ignorance on how such things are calculated - is how the total CPU capacity of the hypervisor impacted this. For instance, if I have a 20 real core, hyperthreaded hypervisor and a 100 second interval, I guess as an oversimplification there are 4000 CPU seconds are available for execution in that interval? That is when I started to get confused about the how to determine a total % from the info in xentop.



Many thanks.





From: Xen-users <xen-users-bounces@lists.xenproject.org> On Behalf Of George Dunlap
Sent: 08 June 2020 11:59
To: Nick Calvert <nick.calvert@gmail.com>
Cc: George Dunlap <george.dunlap@citrix.com>; xen-users <xen-users@lists.xenproject.org>
Subject: Re: Vanilla Xen total CPU %







On Thu, Jun 4, 2020 at 8:34 PM Nick Calvert <nick.calvert@gmail.com <mailto:nick.calvert@gmail.com> > wrote:

Hi everyone,

I am interested in calculating the approximate total CPU runtime % of
a vanilla Xen project host (without any of the bells and whistles of
XCP-ng or Xen Server). What I have at my disposal is Ubuntu, Xen and
the xl tool stack.



Just to be clear: What you mean, is you want to add up the time all VMs are running? (i.e., if you have one VM at 150%, another at 25%, and another at 25%, the total would be 200%?)



I have been experimenting with writing a parser for xentop output in
batch mode, this is a fairly easy task and I can see other attempts at
parsers across dead and dying github projects... my issue around this
is the precise meaning of the 'CPU/sec' metric given by xentop and how
I could use it to infer a total cpu time.



If you don't mind me asking, what are you (and the projects you mention) using this information for?



Xen is open-source project, so rather than having dozens of projects trying to work around the fact that the core tools don't tell them what they want to know, it seems like it would be better to either modify xentop to tell you what you want to know, or add a new tool to do the same thing.




The docs for xentop say "CPU(sec) CPU time which the guest OS has
consumed(cumulated)".

My confusion is around how CPU 'seconds' actually relat to vCPUs, real
cores etc in this context. I can also see a couple of attempts at
figuring out total CPU %, but none look quite right.

If I were able to derive both the CPU seconds for each domu in an
interval, the aggregate CPU seconds in this interval and both total
vCPUs and physical cores what would be the correct formula for
approximating a total CPU runtime %?

Also if I am missing a trick and there is an easier way of calculating
this I would be extremely happy to hear it, as simple is nice :)



I think if I were writing a program, I'd probably use libxl to get the raw data, rather than trying to parse xentop. libxl_list_domain() will return a list of libxl_dominfo, which has a field "cpu_time", which is (I believe) the number of nanoseconds of cpu time that domain has consumed ever in its lifetime.



So what you'd do is take a timestamp (t1) call libxl_list_domain(), and go through the resulting list, adding up `cpu_time` (c1). Then at some point later, take a timestamp (t2) and do another sum (c2). Your total host utilization between t1 and t2 would then be (c2 - c1) / (t2 - t1).



If you're using 4.13 at least, you could use the golang bindings instead, if you didn't want to use C.



-George
Re: Vanilla Xen total CPU % [ In reply to ]
> On Jun 8, 2020, at 12:58 PM, Nick Calvert <nick.calvert@simplyhosting.cloud> wrote:
>
> Hi George,
>
> Thank you very much for taking the time to respond to me.
>
> What I have been trying to do (and I think it’s the same for the other abandoned projects I came across) is come up with an equivalent to something like the Hyper-V % Total Runtime counter, which gives a (probably not very precise) but useful account of the total CPU ‘load’ of a hypervisor.
>
> This is the sort of metric which can be useful for spotting overall trends, or when the sum of all virtual machine CPU usage passes an alerting trigger. I appreciate that at this point it would probably be necessary to look at other metrics to determine what was actually happening.
>
> What I was trying to do was stream some of the xentop counters into a time series database (influxdb) so I could graph this. Other people have attempted the same, as an example there are projects doing this with graphite, some people were using old xm python bindings to basically do exactly as you describe in your mail and return their own usage % for graphing.

There are a couple of simple things we could do to make this sort of thing easier. We could:

1. Add a ‘hypervisor utilization’ field which does this addition for you

2. Add an option to xentop have a ‘json’ output format

3. Modify xentop to allow a “format” string, such that you could request it only output the “hypervisor utilization”.

> Being able to do this with Go in 4.13 is very interesting and I did not know such bindings existed. I have some experience with the language so will look at this now. I am however straddled with some older versions of Xen and had got as far as building a parser and a.) pushing the cpu seconds value for each domU directly into a database and playing with them as if they were a network interface counters and b.) taking samples in a time interval and performing a calculation just as you describe and inserting these directly into a database with a timestamp.

We haven’t made a big deal of golang bindings yet, because it’s still labelled as “experimental”: there’s a lot of missing functionality, and we don’t yet promise not to break backwards compatibility. Xen 4.13 had only some very basic functions and structures defined, but ListDomain was one of them. For our upcoming 4.14 release (should be out in a month or so), the functionality is greatly expanded, but still not yet complete.

I think it’s *very* unlikely that the signature of that function is going to change significantly, so I think you’re reasonably safe using it.

One downside of the approach of writing your own binary that libxl is tied to the particular version of the hypervisor you’re using; so every time you update Xen you have to recompile a new binary. (This is currently true both for C and golang.)

> One thing I was unsure on - and I think this is my ignorance on how such things are calculated - is how the total CPU capacity of the hypervisor impacted this. For instance, if I have a 20 real core, hyperthreaded hypervisor and a 100 second interval, I guess as an oversimplification there are 4000 CPU seconds are available for execution in that interval? That is when I started to get confused about the how to determine a total % from the info in xentop.

I’d have to look into the code to be sure, but yes, that’s normally how things work: 1000 added to the “cpu time” corresponds to 1000ns of execution on a single cpu. So if you have 8 cores all executing in parallel for 1000ns, that would look like 8000ns. xentop would then normally do the calculation I described to you — presenting 8000ns / 1000ns * 100% => 800%

So to answer your *original original* question, adding up all the percentages of the domains from xentop for a time period *should* give you the total utilization for that time period.

I can double-check and get back to you.

-George
Re: Vanilla Xen total CPU % [ In reply to ]
Hi,

> > On Jun 8, 2020, at 12:58 PM, Nick Calvert <nick.calvert@simplyhosting.cloud> wrote:
> > What I was trying to do was stream some of the xentop counters
> > into a time series database (influxdb) so I could graph this.

I wanted to do this too, but for Prometheus.

I wrote a libxl C program that prints out the CPU time used by each
guest in Prometheus's exposition format¹. That would be just the
cumulative time in seconds because Prometheus likes things in its
base units, and you can derive percentages and rates from that.

I just trigger it by cron every 5 minutes and use it as a
node_exporter textfile collector², but I guess if I wanted it at a
shorter interval I'd make it into a proper exporter from a
standalone daemon.

If anyone would find it useful I can publish it, though my C is
terrible…

Cheers,
Andy

¹ https://github.com/prometheus/docs/blob/master/content/docs/instrumenting/exposition_formats.md

² https://github.com/prometheus-community/node-exporter-textfile-collector-scripts
Re: Vanilla Xen total CPU % [ In reply to ]
On 2020-06-08 10:08, Andy Smith wrote:
> Hi,
>
>> > On Jun 8, 2020, at 12:58 PM, Nick Calvert <nick.calvert@simplyhosting.cloud> wrote:
>> > What I was trying to do was stream some of the xentop counters
>> > into a time series database (influxdb) so I could graph this.
>
> I wanted to do this too, but for Prometheus.
+1

I was going to approach it from a crystal point of view, binding to the
libxl c library. Again real life, and time got to me. I would love to
see a high quality Xen Prometheus exporter!

>
> I wrote a libxl C program that prints out the CPU time used by each
> guest in Prometheus's exposition format¹. That would be just the
> cumulative time in seconds because Prometheus likes things in its
> base units, and you can derive percentages and rates from that.
>
> I just trigger it by cron every 5 minutes and use it as a
> node_exporter textfile collector², but I guess if I wanted it at a
> shorter interval I'd make it into a proper exporter from a
> standalone daemon.
>
> If anyone would find it useful I can publish it, though my C is
> terrible…

I'd find it useful to look at :) Throw it on a git repo somewhere, or
something :)

-- David
Re: Vanilla Xen total CPU % [ In reply to ]
Hi!

FYI, we managed to contribute to a Netdata plugin using Xen stats (see
https://learn.netdata.cloud/docs/agent/collectors/xenstat.plugin/ ). You
can use it to stream it to a master netdata, and this one can directly put
metrics in a Prometheus DB (I'm doing that for my own production).

Best,


Olivier.

Le mar. 9 juin 2020 à 04:41, David Kowis <david@kow.is> a écrit :

> On 2020-06-08 10:08, Andy Smith wrote:
> > Hi,
> >
> >> > On Jun 8, 2020, at 12:58 PM, Nick Calvert
> <nick.calvert@simplyhosting.cloud> wrote:
> >> > What I was trying to do was stream some of the xentop counters
> >> > into a time series database (influxdb) so I could graph this.
> >
> > I wanted to do this too, but for Prometheus.
> +1
>
> I was going to approach it from a crystal point of view, binding to the
> libxl c library. Again real life, and time got to me. I would love to
> see a high quality Xen Prometheus exporter!
>
> >
> > I wrote a libxl C program that prints out the CPU time used by each
> > guest in Prometheus's exposition format¹. That would be just the
> > cumulative time in seconds because Prometheus likes things in its
> > base units, and you can derive percentages and rates from that.
> >
> > I just trigger it by cron every 5 minutes and use it as a
> > node_exporter textfile collector², but I guess if I wanted it at a
> > shorter interval I'd make it into a proper exporter from a
> > standalone daemon.
> >
> > If anyone would find it useful I can publish it, though my C is
> > terrible…
>
> I'd find it useful to look at :) Throw it on a git repo somewhere, or
> something :)
>
> -- David
>
>