Mailing List Archive

IRQ latency measurements in hypervisor
Hello community,

Recently I was experimenting with context switching latencies in
Xen. I want to share my findings and discuss some issues I have found.

My target was to ensure that no more 30us passes between interrupt
generation and handling in a guest VM. Main goal was to determine if
Xen is ready for soft RT and to identify main issues that prevent
this.

TLDR: I was able to achieve mean 15us latency with tuned RTDS
scheduler, disabled serial console and some hacks to the Xen
toolstack. This was done on a setup with 1 PCPU.

Test setup.

ARMv8 platform. Namely Renesas Rcar H3 SoC on Salvator board.

To accurately determine latency, I employed one of timer counter units
(TMUs) available on the SoC. This is 32-bit timer with auto-reload,
that can generate interrupt on underflow. I fed it with 33.275MHz
clock, which gave me resolution of about 30ns. I programmed the timer
to generate interrupt every 10ms. My ISR then read the current timer
value and determined how much time passed since last underrun. This
gave me time interval between IRQ generation and ISR invocation.
Those values were collected and every 10 seconds statistics was
calculated. There is an example of output from my Linux driver:

[ 83.873162] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 (240 ns)
[ 94.136632] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 (240 ns)
[ 104.400098] rt_eval_tmu e6fc0000.tmu: Mean: 50 (1500 ns) stddev: 129 (3870 ns)
[ 114.663828] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 (240 ns)
[ 124.927296] rt_eval_tmu e6fc0000.tmu: Mean: 56 (1680 ns) stddev: 183 (5490 ns)

This is the baremetal Linux. And there is Dom0:

[ 237.431003] rt_eval_tmu e6fc0000.tmu: Mean: 306 (9180 ns) stddev: 25 (750 ns)
[ 247.694506] rt_eval_tmu e6fc0000.tmu: Mean: 302 (9060 ns) stddev: 17 (510 ns)

Driver outputs both the raw timer value (eg. 4) and the same value
scaled to nanoseconds (eg. 1320 ns). As you can see baremetal setup is
much faster. But experiments showed that Linux does not provide
consistent values, even when running in baremetal mode. You can see
sporadic spikes in "std dev" values.

So my next step was to use proper RT OS to do the measurements. I
chose Zephyr. My PR that adds Xen support to Zephyr can be found at
[1]. Support for RCAR Gen3 is not upstreamed, but is present on my
GitHub([2]). At [3] you can find the source code for application that
does the latency measurements. It behaves exactly as my linux driver,
but provides a bit more information:

*** Booting Zephyr OS build zephyr-v2.4.0-2750-g0f2c858a39fc ***
RT Eval app

Counter freq is 33280000 Hz. Period is 30 ns
Set alarm in 0 sec (332800 ticks)
Mean: 600 (18000 ns) stddev: 3737 (112110 ns) above thr: 0% [265 (7950 ns) - 66955 (2008650 ns)] global [265 (7950 ns) 66955 (2008650 ns)]
Mean: 388 (11640 ns) stddev: 2059 (61770 ns) above thr: 0% [266 (7980 ns) - 58830 (1764900 ns)] global [265 (7950 ns) 66955 (2008650 ns)]
Mean: 358 (10740 ns) stddev: 1796 (53880 ns) above thr: 0% [265 (7950 ns) - 57780 (1733400 ns)] global [265 (7950 ns) 66955 (2008650 ns)]
...

So there you can see: mean time, standard deviation, % of interrupts
that was processed above 30us threshold, minimum and maximum latency
values for the current 10s run, global minimum and maximum.

Zephyr running as baremetal showed very stable results (this is an
older build, so no extended statistics there):

## Starting application at 0x480803C0 ...
*** Booting Zephyr OS build zephyr-v2.4.0-1137-g5803ee1e8183 ***
RT Eval app

Counter freq is 33280000 Hz. Period is 30 ns
Mean: 31 (930 ns) stddev: 0 (0 ns)
Mean: 31 (930 ns) stddev: 0 (0 ns)
Mean: 31 (930 ns) stddev: 0 (0 ns)
Mean: 31 (930 ns) stddev: 0 (0 ns)
Mean: 31 (930 ns) stddev: 0 (0 ns)
Mean: 31 (930 ns) stddev: 0 (0 ns)
...

As Zephyr provided stable readouts with no jitter, I used it to do all
subsequent measurements.


IMPORTANT! All subsequent tests was conducted with only 1 CPU core
enabled. My goal was to ensure that system can timely react to an
external interrupt even under load.


Test results and latency sources

As you can see, baremetal OS provides latency of ~0.9us without any
deviations. The same code running as DomU on idle system shows mean
latency of 12us and deviation of about 10us. Range of latencies in a
10s batch can vary from 8us to 25us. This fits into required 30us
threshold, so no big issue there.

But this worsens under certain conditions.

1. Serial console. RCAR serial driver (scif) works in synchronous
mode, so any heavy serial console output leads to higher
latency. Tests shows mean latency of 1000us and deviation of 1332
us. 54% of interrupts are handled outside of 30us period. Real
values may be even higher, because in some cases timer would do
full cycle and instead of say 11ms ISR would read 11 % 10 = 1ms
latency. I tried to enable asynchronous mode for the serial
console. This made things better, but it lead to gaps in output, so
I just turned the serial console off completely.

2. RTDS scheduler. With console disabled, things like "hexdump -v
/dev/zero" didn't affected the latency so badly, but anyways,
sometimes I got ~600us spikes. This is not a surprise, because of
default RTDS configuration. I changed period for DomU from default
10ms to 100us and things got better: with Dom0 burning CPU I am
rarely getting max latency of about ~30us with mean latency of ~9us
and deviation of ~0.5us. On other hand, when I tried to set period
to 30us, max latency rose up to ~60us.

3. Huge latency spike during domain creation. I conducted some
additional tests, including use of PV drivers, but this didn't
affected the latency in my "real time" domain. But attempt to
create another domain with relatively large memory size of 2GB led
to huge spike in latency. Debugging led to this call path:

XENMEM_populate_physmap -> populate_physmap() ->
alloc_domheap_pages() -> alloc_heap_pages()-> huge
"for ( i = 0; i < (1 << order); i++ )" loop.

This loops handles struct page* for every one of 262144 pages that
was allocated by calling populate_physmap(). This function is not
preemptible and it takes about a second to complete on my
setup. Needless to say that it completely screws any responsiveness
of the hypervisor.

I managed to overcome the issue #3 by commenting out all calls to
populate_one_size() except the populate_one_size(PFN_4K_SHIFT) in
xg_dom_arm.c. This lengthened domain construction, but my "RT" domain
didn't experienced so big latency issues. Apparently all other
hypercalls which are used during domain creation are either fast or
preemptible. No doubts that my hack lead to page tables inflation and
overall performance drop.

Conclusions.

I wanted to:

1. Share my approach to IRQ latency measurements.

2. Share results of said measurements.

3. Ask for other potential scenarios, which can affect IRQ latency so I
can test them on my setup.

4. Discuss the issue with alloc_domheap_pages(). I brooded over ways
to make it preemptible. But it is a) located deep in call chain and
b) used not only by hypercalls. So I can't see an easy way to make
it preemptible. The best solution would be of course to make the
whole hypervisor preemptible. I even heard about some work long
time ago...

5. Show that under certain conditions Xen can be used in a soft RT
system. Of course, additional tests are needed, but with some minor
tweaks it can provide ~15us IRQ latency on a typical workload.

[1] https://github.com/zephyrproject-rtos/zephyr/pull/30164
[2] https://github.com/lorc/zephyr/tree/renesas_salvator_x
[3] https://github.com/lorc/zephyr/blob/rt_eval/samples/rt_eval/src/main.c


--
Volodymyr Babchuk at EPAM
Re: IRQ latency measurements in hypervisor [ In reply to ]
+ Bertrand, Andrew (see comment on alloc_heap_pages())

Really great work, Volodymyr! Some comments below.


On Tue, 12 Jan 2021, Volodymyr Babchuk wrote:
> Hello community,
>
> Recently I was experimenting with context switching latencies in
> Xen. I want to share my findings and discuss some issues I have found.
>
> My target was to ensure that no more 30us passes between interrupt
> generation and handling in a guest VM. Main goal was to determine if
> Xen is ready for soft RT and to identify main issues that prevent
> this.
>
> TLDR: I was able to achieve mean 15us latency with tuned RTDS
> scheduler, disabled serial console and some hacks to the Xen
> toolstack. This was done on a setup with 1 PCPU.
>
> Test setup.
>
> ARMv8 platform. Namely Renesas Rcar H3 SoC on Salvator board.
>
> To accurately determine latency, I employed one of timer counter units
> (TMUs) available on the SoC. This is 32-bit timer with auto-reload,
> that can generate interrupt on underflow. I fed it with 33.275MHz
> clock, which gave me resolution of about 30ns. I programmed the timer
> to generate interrupt every 10ms. My ISR then read the current timer
> value and determined how much time passed since last underrun. This
> gave me time interval between IRQ generation and ISR invocation.
> Those values were collected and every 10 seconds statistics was
> calculated. There is an example of output from my Linux driver:

It looks like a solid approach to collect results, similar to the one we
used for the cache coloring work. Just make sure to collect very many
results.

A few of questions: did you use a single physical CPU? Are you using
RTDS and schedule 2 vCPU on 1 pCPU? Is dom0 idle or busy? I take the
results were measured in domU?


> [ 83.873162] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 (240 ns)
> [ 94.136632] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 (240 ns)
> [ 104.400098] rt_eval_tmu e6fc0000.tmu: Mean: 50 (1500 ns) stddev: 129 (3870 ns)
> [ 114.663828] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 (240 ns)
> [ 124.927296] rt_eval_tmu e6fc0000.tmu: Mean: 56 (1680 ns) stddev: 183 (5490 ns)
>
> This is the baremetal Linux. And there is Dom0:
>
> [ 237.431003] rt_eval_tmu e6fc0000.tmu: Mean: 306 (9180 ns) stddev: 25 (750 ns)
> [ 247.694506] rt_eval_tmu e6fc0000.tmu: Mean: 302 (9060 ns) stddev: 17 (510 ns)
>
> Driver outputs both the raw timer value (eg. 4) and the same value
> scaled to nanoseconds (eg. 1320 ns). As you can see baremetal setup is
> much faster. But experiments showed that Linux does not provide
> consistent values, even when running in baremetal mode. You can see
> sporadic spikes in "std dev" values.

So baremetal IRQ latency is 1320-1680ns and Linux IRQ latency is
9060-9180ns. I am not surprised that Linux results are inconsistent but
I have a couple of observations:

- 9us is high for Linux
If the system is idle, the latency should be lower, around 2-3us. I
imagine you are actually running some sort of interference from dom0? Or
using RTDS and descheduling vCPUs?

- the stddev of 3870ns is high for baremetal
In the baremetal case the stddev should be minimal if the system is
idle.


> So my next step was to use proper RT OS to do the measurements. I
> chose Zephyr. My PR that adds Xen support to Zephyr can be found at
> [1]. Support for RCAR Gen3 is not upstreamed, but is present on my
> GitHub([2]). At [3] you can find the source code for application that
> does the latency measurements. It behaves exactly as my linux driver,
> but provides a bit more information:
>
> *** Booting Zephyr OS build zephyr-v2.4.0-2750-g0f2c858a39fc ***
> RT Eval app
>
> Counter freq is 33280000 Hz. Period is 30 ns
> Set alarm in 0 sec (332800 ticks)
> Mean: 600 (18000 ns) stddev: 3737 (112110 ns) above thr: 0% [265 (7950 ns) - 66955 (2008650 ns)] global [265 (7950 ns) 66955 (2008650 ns)]
> Mean: 388 (11640 ns) stddev: 2059 (61770 ns) above thr: 0% [266 (7980 ns) - 58830 (1764900 ns)] global [265 (7950 ns) 66955 (2008650 ns)]
> Mean: 358 (10740 ns) stddev: 1796 (53880 ns) above thr: 0% [265 (7950 ns) - 57780 (1733400 ns)] global [265 (7950 ns) 66955 (2008650 ns)]
> ...
>
> So there you can see: mean time, standard deviation, % of interrupts
> that was processed above 30us threshold, minimum and maximum latency
> values for the current 10s run, global minimum and maximum.
>
> Zephyr running as baremetal showed very stable results (this is an
> older build, so no extended statistics there):
>
> ## Starting application at 0x480803C0 ...
> *** Booting Zephyr OS build zephyr-v2.4.0-1137-g5803ee1e8183 ***
> RT Eval app
>
> Counter freq is 33280000 Hz. Period is 30 ns
> Mean: 31 (930 ns) stddev: 0 (0 ns)
> Mean: 31 (930 ns) stddev: 0 (0 ns)
> Mean: 31 (930 ns) stddev: 0 (0 ns)
> Mean: 31 (930 ns) stddev: 0 (0 ns)
> Mean: 31 (930 ns) stddev: 0 (0 ns)
> Mean: 31 (930 ns) stddev: 0 (0 ns)
> ...
>
> As Zephyr provided stable readouts with no jitter, I used it to do all
> subsequent measurements.

I am a bit confused here. Looking at the numbers above the stddev is
112110 ns in the first instance. That is pretty high. Am I looking at
the wrong numbers?



> IMPORTANT! All subsequent tests was conducted with only 1 CPU core
> enabled. My goal was to ensure that system can timely react to an
> external interrupt even under load.

All right. FYI I have no frame of reference for 2 vCPUs on 1 pCPUs, all
my tests were done with 1vCPU <-> 1pCPU and the null scheduler.


> Test results and latency sources
>
> As you can see, baremetal OS provides latency of ~0.9us without any
> deviations. The same code running as DomU on idle system shows mean
> latency of 12us and deviation of about 10us. Range of latencies in a
> 10s batch can vary from 8us to 25us. This fits into required 30us
> threshold, so no big issue there.
>
> But this worsens under certain conditions.
>
> 1. Serial console. RCAR serial driver (scif) works in synchronous
> mode, so any heavy serial console output leads to higher
> latency. Tests shows mean latency of 1000us and deviation of 1332
> us. 54% of interrupts are handled outside of 30us period. Real
> values may be even higher, because in some cases timer would do
> full cycle and instead of say 11ms ISR would read 11 % 10 = 1ms
> latency. I tried to enable asynchronous mode for the serial
> console. This made things better, but it lead to gaps in output, so
> I just turned the serial console off completely.

That's very interesting. I wonder if other serial drivers would cause
similar issues, e.g. PL011.


> 2. RTDS scheduler. With console disabled, things like "hexdump -v
> /dev/zero" didn't affected the latency so badly, but anyways,
> sometimes I got ~600us spikes. This is not a surprise, because of
> default RTDS configuration. I changed period for DomU from default
> 10ms to 100us and things got better: with Dom0 burning CPU I am
> rarely getting max latency of about ~30us with mean latency of ~9us
> and deviation of ~0.5us. On other hand, when I tried to set period
> to 30us, max latency rose up to ~60us.

This is very interestingi too. Did you get any spikes with the period
set to 100us? It would be fantastic if there were none.


> 3. Huge latency spike during domain creation. I conducted some
> additional tests, including use of PV drivers, but this didn't
> affected the latency in my "real time" domain. But attempt to
> create another domain with relatively large memory size of 2GB led
> to huge spike in latency. Debugging led to this call path:
>
> XENMEM_populate_physmap -> populate_physmap() ->
> alloc_domheap_pages() -> alloc_heap_pages()-> huge
> "for ( i = 0; i < (1 << order); i++ )" loop.
>
> This loops handles struct page* for every one of 262144 pages that
> was allocated by calling populate_physmap(). This function is not
> preemptible and it takes about a second to complete on my
> setup. Needless to say that it completely screws any responsiveness
> of the hypervisor.
>
> I managed to overcome the issue #3 by commenting out all calls to
> populate_one_size() except the populate_one_size(PFN_4K_SHIFT) in
> xg_dom_arm.c. This lengthened domain construction, but my "RT" domain
> didn't experienced so big latency issues. Apparently all other
> hypercalls which are used during domain creation are either fast or
> preemptible. No doubts that my hack lead to page tables inflation and
> overall performance drop.

I think we need to follow this up and fix this. Maybe just by adding
a hypercall continuation to the loop.



> Conclusions.
>
> I wanted to:
>
> 1. Share my approach to IRQ latency measurements.
>
> 2. Share results of said measurements.

Thank you very much. I asked a couple of questions above to better
understand your setup.


> 3. Ask for other potential scenarios, which can affect IRQ latency so I
> can test them on my setup.

I'll follow up once I understood the above a bit better.


> 4. Discuss the issue with alloc_domheap_pages(). I brooded over ways
> to make it preemptible. But it is a) located deep in call chain and
> b) used not only by hypercalls. So I can't see an easy way to make
> it preemptible. The best solution would be of course to make the
> whole hypervisor preemptible. I even heard about some work long
> time ago...
>
> 5. Show that under certain conditions Xen can be used in a soft RT
> system. Of course, additional tests are needed, but with some minor
> tweaks it can provide ~15us IRQ latency on a typical workload.
>
> [1] https://github.com/zephyrproject-rtos/zephyr/pull/30164
> [2] https://github.com/lorc/zephyr/tree/renesas_salvator_x
> [3] https://github.com/lorc/zephyr/blob/rt_eval/samples/rt_eval/src/main.c
Re: IRQ latency measurements in hypervisor [ In reply to ]
Hi Volodymyr, Stefano,

On 14/01/2021 23:33, Stefano Stabellini wrote:
> + Bertrand, Andrew (see comment on alloc_heap_pages())

Long running hypercalls are usually considered security issues.

In this case, only the control domain can issue large memory allocation
(2GB at a time). Guest, would only be able to allocate 2MB at the time,
so from the numbers below, it would only take 1ms max.

So I think we are fine here. Next time, you find a large loop, please
provide an explanation why they are not security issues (e.g. cannot be
used by guests) or send an email to the Security Team in doubt.

>> ARMv8 platform. Namely Renesas Rcar H3 SoC on Salvator board.

Which core is it?

>> To accurately determine latency, I employed one of timer counter units
>> (TMUs) available on the SoC. This is 32-bit timer with auto-reload,
>> that can generate interrupt on underflow. I fed it with 33.275MHz
>> clock, which gave me resolution of about 30ns. I programmed the timer
>> to generate interrupt every 10ms. My ISR then read the current timer
>> value and determined how much time passed since last underrun. This
>> gave me time interval between IRQ generation and ISR invocation.
>> Those values were collected and every 10 seconds statistics was
>> calculated. There is an example of output from my Linux driver:
>
> It looks like a solid approach to collect results, similar to the one we
> used for the cache coloring work. Just make sure to collect very many
> results.
>
> A few of questions: did you use a single physical CPU? Are you using
> RTDS and schedule 2 vCPU on 1 pCPU? Is dom0 idle or busy? I take the
> results were measured in domU?
>
>
>> [ 83.873162] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 (240 ns)
>> [ 94.136632] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 (240 ns)
>> [ 104.400098] rt_eval_tmu e6fc0000.tmu: Mean: 50 (1500 ns) stddev: 129 (3870 ns)
>> [ 114.663828] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 (240 ns)
>> [ 124.927296] rt_eval_tmu e6fc0000.tmu: Mean: 56 (1680 ns) stddev: 183 (5490 ns)
>>
>> This is the baremetal Linux. And there is Dom0:
>>
>> [ 237.431003] rt_eval_tmu e6fc0000.tmu: Mean: 306 (9180 ns) stddev: 25 (750 ns)
>> [ 247.694506] rt_eval_tmu e6fc0000.tmu: Mean: 302 (9060 ns) stddev: 17 (510 ns)
>>
>> Driver outputs both the raw timer value (eg. 4) and the same value
>> scaled to nanoseconds (eg. 1320 ns). As you can see baremetal setup is
>> much faster. But experiments showed that Linux does not provide
>> consistent values, even when running in baremetal mode. You can see
>> sporadic spikes in "std dev" values.
>
> So baremetal IRQ latency is 1320-1680ns and Linux IRQ latency is
> 9060-9180ns. I am not surprised that Linux results are inconsistent but
> I have a couple of observations:
>
> - 9us is high for Linux
> If the system is idle, the latency should be lower, around 2-3us. I
> imagine you are actually running some sort of interference from dom0? Or
> using RTDS and descheduling vCPUs?
>
> - the stddev of 3870ns is high for baremetal
> In the baremetal case the stddev should be minimal if the system is
> idle.
>
>
>> So my next step was to use proper RT OS to do the measurements. I
>> chose Zephyr. My PR that adds Xen support to Zephyr can be found at
>> [1]. Support for RCAR Gen3 is not upstreamed, but is present on my
>> GitHub([2]). At [3] you can find the source code for application that
>> does the latency measurements. It behaves exactly as my linux driver,
>> but provides a bit more information:
>>
>> *** Booting Zephyr OS build zephyr-v2.4.0-2750-g0f2c858a39fc ***
>> RT Eval app
>>
>> Counter freq is 33280000 Hz. Period is 30 ns
>> Set alarm in 0 sec (332800 ticks)
>> Mean: 600 (18000 ns) stddev: 3737 (112110 ns) above thr: 0% [265 (7950 ns) - 66955 (2008650 ns)] global [265 (7950 ns) 66955 (2008650 ns)]
>> Mean: 388 (11640 ns) stddev: 2059 (61770 ns) above thr: 0% [266 (7980 ns) - 58830 (1764900 ns)] global [265 (7950 ns) 66955 (2008650 ns)]
>> Mean: 358 (10740 ns) stddev: 1796 (53880 ns) above thr: 0% [265 (7950 ns) - 57780 (1733400 ns)] global [265 (7950 ns) 66955 (2008650 ns)]
>> ...
>>
>> So there you can see: mean time, standard deviation, % of interrupts
>> that was processed above 30us threshold, minimum and maximum latency
>> values for the current 10s run, global minimum and maximum.
>>
>> Zephyr running as baremetal showed very stable results (this is an
>> older build, so no extended statistics there):
>>
>> ## Starting application at 0x480803C0 ...
>> *** Booting Zephyr OS build zephyr-v2.4.0-1137-g5803ee1e8183 ***
>> RT Eval app
>>
>> Counter freq is 33280000 Hz. Period is 30 ns
>> Mean: 31 (930 ns) stddev: 0 (0 ns)
>> Mean: 31 (930 ns) stddev: 0 (0 ns)
>> Mean: 31 (930 ns) stddev: 0 (0 ns)
>> Mean: 31 (930 ns) stddev: 0 (0 ns)
>> Mean: 31 (930 ns) stddev: 0 (0 ns)
>> Mean: 31 (930 ns) stddev: 0 (0 ns)
>> ...
>>
>> As Zephyr provided stable readouts with no jitter, I used it to do all
>> subsequent measurements.
>
> I am a bit confused here. Looking at the numbers above the stddev is
> 112110 ns in the first instance. That is pretty high. Am I looking at
> the wrong numbers?
>
>
>
>> IMPORTANT! All subsequent tests was conducted with only 1 CPU core
>> enabled. My goal was to ensure that system can timely react to an
>> external interrupt even under load.
>
> All right. FYI I have no frame of reference for 2 vCPUs on 1 pCPUs, all
> my tests were done with 1vCPU <-> 1pCPU and the null scheduler.
>
>
>> Test results and latency sources
>>
>> As you can see, baremetal OS provides latency of ~0.9us without any
>> deviations. The same code running as DomU on idle system shows mean
>> latency of 12us and deviation of about 10us. Range of latencies in a
>> 10s batch can vary from 8us to 25us. This fits into required 30us
>> threshold, so no big issue there.
>>
>> But this worsens under certain conditions.
>>
>> 1. Serial console. RCAR serial driver (scif) works in synchronous
>> mode, so any heavy serial console output leads to higher
>> latency. Tests shows mean latency of 1000us and deviation of 1332
>> us. 54% of interrupts are handled outside of 30us period. Real
>> values may be even higher, because in some cases timer would do
>> full cycle and instead of say 11ms ISR would read 11 % 10 = 1ms
>> latency. I tried to enable asynchronous mode for the serial
>> console. This made things better, but it lead to gaps in output, so
>> I just turned the serial console off completely.
>
> That's very interesting. I wonder if other serial drivers would cause
> similar issues, e.g. PL011.
>
>
>> 2. RTDS scheduler. With console disabled, things like "hexdump -v
>> /dev/zero" didn't affected the latency so badly, but anyways,
>> sometimes I got ~600us spikes. This is not a surprise, because of
>> default RTDS configuration. I changed period for DomU from default
>> 10ms to 100us and things got better: with Dom0 burning CPU I am
>> rarely getting max latency of about ~30us with mean latency of ~9us
>> and deviation of ~0.5us. On other hand, when I tried to set period
>> to 30us, max latency rose up to ~60us.
In a related topic, I am not entirely sure that all the hypercalls would
be able to fit in the 100us slice. In particular, the one which are
touching the P2M and do memory allocation.

>
> This is very interestingi too. Did you get any spikes with the period
> set to 100us? It would be fantastic if there were none.
>
>
>> 3. Huge latency spike during domain creation. I conducted some
>> additional tests, including use of PV drivers, but this didn't
>> affected the latency in my "real time" domain. But attempt to
>> create another domain with relatively large memory size of 2GB led
>> to huge spike in latency. Debugging led to this call path:
>>
>> XENMEM_populate_physmap -> populate_physmap() ->
>> alloc_domheap_pages() -> alloc_heap_pages()-> huge
>> "for ( i = 0; i < (1 << order); i++ )" loop.

There are two for loops in alloc_heap_pages() using this syntax. Which
one are your referring to?

>>
>> This loops handles struct page* for every one of 262144 pages that
>> was allocated by calling populate_physmap().

Looking at the domain creation code, 2GB will be split in two extents of
1GB. This means, there will be at least a preemption point between the
allocation of the two extents.

That said, this would only half of the time. So there might be more
optimization to do...

> This function is not
>> preemptible and it takes about a second to complete on my
>> setup. Needless to say that it completely screws any responsiveness
>> of the hypervisor.
>>
>> I managed to overcome the issue #3 by commenting out all calls to
>> populate_one_size() except the populate_one_size(PFN_4K_SHIFT) in
>> xg_dom_arm.c. This lengthened domain construction, but my "RT" domain
>> didn't experienced so big latency issues. Apparently all other
>> hypercalls which are used during domain creation are either fast or
>> preemptible. No doubts that my hack lead to page tables inflation and
>> overall performance drop.
>
> I think we need to follow this up and fix this. Maybe just by adding
> a hypercall continuation to the loop.

When I read "hypercall continuation", I read we will return to the guest
context so it can process interrupts and potentially switch to another task.

This means that the guest could issue a second populate_physmap() from
the vCPU. Therefore any restart information should be part of the
hypercall parameters. So far, I don't see how this would be possible.

Even if we overcome that part, this can be easily abuse by a guest as
the memory is not yet accounted to the domain. Imagine a guest that
never request the continuation of the populate_physmap(). So we would
need to block the vCPU until the allocation is finished.

I think the first step is we need to figure out which part of the
allocation is slow (see my question above). From there, we can figure
out if there is a way to reduce the impact.

Cheers,

--
Julien Grall
Re: IRQ latency measurements in hypervisor [ In reply to ]
Hi Stefano,

Stefano Stabellini writes:

[...]

>>
>> ARMv8 platform. Namely Renesas Rcar H3 SoC on Salvator board.
>>
>> To accurately determine latency, I employed one of timer counter units
>> (TMUs) available on the SoC. This is 32-bit timer with auto-reload,
>> that can generate interrupt on underflow. I fed it with 33.275MHz
>> clock, which gave me resolution of about 30ns. I programmed the timer
>> to generate interrupt every 10ms. My ISR then read the current timer
>> value and determined how much time passed since last underrun. This
>> gave me time interval between IRQ generation and ISR invocation.
>> Those values were collected and every 10 seconds statistics was
>> calculated. There is an example of output from my Linux driver:
>
> It looks like a solid approach to collect results, similar to the one we
> used for the cache coloring work. Just make sure to collect very many
> results.
>
> A few of questions: did you use a single physical CPU? Are you using
> RTDS and schedule 2 vCPU on 1 pCPU? Is dom0 idle or busy? I take the
> results were measured in domU?
Yes, I used single pCPU, which ran 1-2 vCPUs depending on setup. At
first, this was only Dom0 with 1 vCPU, but in most cases it was Dom0 +
Zephyr DomU each with 1 vCPU.

I quickly determined that 1 vCPU per 1 pCPU case meets my
requirements. On other hand, our real cases will always have more vCPUs
than pCPUs, so I was interested in 2 vCPUs - 1 pCPU case.

>
>> [ 83.873162] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 (240 ns)
>> [ 94.136632] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 (240 ns)
>> [ 104.400098] rt_eval_tmu e6fc0000.tmu: Mean: 50 (1500 ns) stddev: 129 (3870 ns)
>> [ 114.663828] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 (240 ns)
>> [ 124.927296] rt_eval_tmu e6fc0000.tmu: Mean: 56 (1680 ns) stddev: 183 (5490 ns)
>>
>> This is the baremetal Linux. And there is Dom0:
>>
>> [ 237.431003] rt_eval_tmu e6fc0000.tmu: Mean: 306 (9180 ns) stddev: 25 (750 ns)
>> [ 247.694506] rt_eval_tmu e6fc0000.tmu: Mean: 302 (9060 ns) stddev: 17 (510 ns)
>>
>> Driver outputs both the raw timer value (eg. 4) and the same value
>> scaled to nanoseconds (eg. 1320 ns). As you can see baremetal setup is
>> much faster. But experiments showed that Linux does not provide
>> consistent values, even when running in baremetal mode. You can see
>> sporadic spikes in "std dev" values.
>
> So baremetal IRQ latency is 1320-1680ns and Linux IRQ latency is
> 9060-9180ns. I am not surprised that Linux results are inconsistent but
> I have a couple of observations:
>
> - 9us is high for Linux
> If the system is idle, the latency should be lower, around 2-3us. I
> imagine you are actually running some sort of interference from dom0? Or
> using RTDS and descheduling vCPUs?

9us was in idle state. Interestingly enough, I got latency if 3us while
Dom0 was doing some CPU-intensive tasks. So, under load latency is lower
than in idle state. I didn't investigated this, so I can't tell you what
causes this behavior.

>
> - the stddev of 3870ns is high for baremetal
> In the baremetal case the stddev should be minimal if the system is
> idle.

This is what I expected too. But nevertheless there was spikes. I didn't
investigated this as well, so I can only speculate there. My rootfs is
on NFS, so maybe network driver caused this spikes.

>
>
>> So my next step was to use proper RT OS to do the measurements. I
>> chose Zephyr. My PR that adds Xen support to Zephyr can be found at
>> [1]. Support for RCAR Gen3 is not upstreamed, but is present on my
>> GitHub([2]). At [3] you can find the source code for application that
>> does the latency measurements. It behaves exactly as my linux driver,
>> but provides a bit more information:
>>
>> *** Booting Zephyr OS build zephyr-v2.4.0-2750-g0f2c858a39fc ***
>> RT Eval app
>>
>> Counter freq is 33280000 Hz. Period is 30 ns
>> Set alarm in 0 sec (332800 ticks)
>> Mean: 600 (18000 ns) stddev: 3737 (112110 ns) above thr: 0% [265 (7950 ns) - 66955 (2008650 ns)] global [265 (7950 ns) 66955 (2008650 ns)]
>> Mean: 388 (11640 ns) stddev: 2059 (61770 ns) above thr: 0% [266 (7980 ns) - 58830 (1764900 ns)] global [265 (7950 ns) 66955 (2008650 ns)]
>> Mean: 358 (10740 ns) stddev: 1796 (53880 ns) above thr: 0% [265 (7950 ns) - 57780 (1733400 ns)] global [265 (7950 ns) 66955 (2008650 ns)]

This is Zephyr running as DomU.

>> ...
>>
>> So there you can see: mean time, standard deviation, % of interrupts
>> that was processed above 30us threshold, minimum and maximum latency
>> values for the current 10s run, global minimum and maximum.
>>
>> Zephyr running as baremetal showed very stable results (this is an
>> older build, so no extended statistics there):
>>
>> ## Starting application at 0x480803C0 ...
>> *** Booting Zephyr OS build zephyr-v2.4.0-1137-g5803ee1e8183 ***
>> RT Eval app
>>
>> Counter freq is 33280000 Hz. Period is 30 ns
>> Mean: 31 (930 ns) stddev: 0 (0 ns)
>> Mean: 31 (930 ns) stddev: 0 (0 ns)
>> Mean: 31 (930 ns) stddev: 0 (0 ns)
>> Mean: 31 (930 ns) stddev: 0 (0 ns)
>> Mean: 31 (930 ns) stddev: 0 (0 ns)
>> Mean: 31 (930 ns) stddev: 0 (0 ns)

And this is Zephyr is running as baremetal.

>> ...
>>
>> As Zephyr provided stable readouts with no jitter, I used it to do all
>> subsequent measurements.
>
> I am a bit confused here. Looking at the numbers above the stddev is
> 112110 ns in the first instance. That is pretty high. Am I looking at
> the wrong numbers?

I added some clarification above. As for 112110ns in the very first instance
- I always ignored the first instance, assuming that things need to
settle after domain being created.

But your comment is actually correct: what exacelt should "settle"?
Domain is already created. All should run smoothly. So, this is worth
investigating.

>> IMPORTANT! All subsequent tests was conducted with only 1 CPU core
>> enabled. My goal was to ensure that system can timely react to an
>> external interrupt even under load.
>
> All right. FYI I have no frame of reference for 2 vCPUs on 1 pCPUs, all
> my tests were done with 1vCPU <-> 1pCPU and the null scheduler.

As I said, I had no issues with 1vCPU <-> 1pCPU setup, so I quickly
moved to cases which had issues.

>> Test results and latency sources
>>
>> As you can see, baremetal OS provides latency of ~0.9us without any
>> deviations. The same code running as DomU on idle system shows mean
>> latency of 12us and deviation of about 10us. Range of latencies in a
>> 10s batch can vary from 8us to 25us. This fits into required 30us
>> threshold, so no big issue there.
>>
>> But this worsens under certain conditions.
>>
>> 1. Serial console. RCAR serial driver (scif) works in synchronous
>> mode, so any heavy serial console output leads to higher
>> latency. Tests shows mean latency of 1000us and deviation of 1332
>> us. 54% of interrupts are handled outside of 30us period. Real
>> values may be even higher, because in some cases timer would do
>> full cycle and instead of say 11ms ISR would read 11 % 10 = 1ms
>> latency. I tried to enable asynchronous mode for the serial
>> console. This made things better, but it lead to gaps in output, so
>> I just turned the serial console off completely.
>
> That's very interesting. I wonder if other serial drivers would cause
> similar issues, e.g. PL011.

They should. This behavior is programmed in serial.c. Driver can enable
async mode calling serial_async_transmit(). As I can see, only ns16550
driver does this.

Maybe you didn't saw problems there because you had more pCPU enabled
and Xen used some other pCPU to do UART operations.

>> 2. RTDS scheduler. With console disabled, things like "hexdump -v
>> /dev/zero" didn't affected the latency so badly, but anyways,
>> sometimes I got ~600us spikes. This is not a surprise, because of
>> default RTDS configuration. I changed period for DomU from default
>> 10ms to 100us and things got better: with Dom0 burning CPU I am
>> rarely getting max latency of about ~30us with mean latency of ~9us
>> and deviation of ~0.5us. On other hand, when I tried to set period
>> to 30us, max latency rose up to ~60us.
>
> This is very interestingi too. Did you get any spikes with the period
> set to 100us? It would be fantastic if there were none.

As far as I can remember, there wasn't any. At least in my setup. And
with workaround for alloc_heap_pages() issue. Of course, I didn't
covered all the possible use cases. But in settled state, when all
domains are created and they are just doing owns tasks there was no
spikes.

>
>> 3. Huge latency spike during domain creation. I conducted some
>> additional tests, including use of PV drivers, but this didn't
>> affected the latency in my "real time" domain. But attempt to
>> create another domain with relatively large memory size of 2GB led
>> to huge spike in latency. Debugging led to this call path:
>>
>> XENMEM_populate_physmap -> populate_physmap() ->
>> alloc_domheap_pages() -> alloc_heap_pages()-> huge
>> "for ( i = 0; i < (1 << order); i++ )" loop.
>>
>> This loops handles struct page* for every one of 262144 pages that
>> was allocated by calling populate_physmap(). This function is not
>> preemptible and it takes about a second to complete on my
>> setup. Needless to say that it completely screws any responsiveness
>> of the hypervisor.
>>
>> I managed to overcome the issue #3 by commenting out all calls to
>> populate_one_size() except the populate_one_size(PFN_4K_SHIFT) in
>> xg_dom_arm.c. This lengthened domain construction, but my "RT" domain
>> didn't experienced so big latency issues. Apparently all other
>> hypercalls which are used during domain creation are either fast or
>> preemptible. No doubts that my hack lead to page tables inflation and
>> overall performance drop.
>
> I think we need to follow this up and fix this. Maybe just by adding
> a hypercall continuation to the loop.

As Julien pointed out, that is hardly possible with the current
code. But we need to find some way to fix this.

[...]

--
Volodymyr Babchuk at EPAM
Re: IRQ latency measurements in hypervisor [ In reply to ]
Hi Julien,

Julien Grall writes:

> Hi Volodymyr, Stefano,
>
> On 14/01/2021 23:33, Stefano Stabellini wrote:
>> + Bertrand, Andrew (see comment on alloc_heap_pages())
>
> Long running hypercalls are usually considered security issues.
>
> In this case, only the control domain can issue large memory
> allocation (2GB at a time). Guest, would only be able to allocate 2MB
> at the time, so from the numbers below, it would only take 1ms max.
>
> So I think we are fine here. Next time, you find a large loop, please
> provide an explanation why they are not security issues (e.g. cannot
> be used by guests) or send an email to the Security Team in doubt.

Sure. In this case I took into account that only control domain can
issue this call, I just didn't stated this explicitly. Next time will
do.

>>> ARMv8 platform. Namely Renesas Rcar H3 SoC on Salvator board.
>
> Which core is it?

Cortex A57

[...]

>> 2. RTDS scheduler. With console disabled, things like "hexdump -v
>>> /dev/zero" didn't affected the latency so badly, but anyways,
>>> sometimes I got ~600us spikes. This is not a surprise, because of
>>> default RTDS configuration. I changed period for DomU from default
>>> 10ms to 100us and things got better: with Dom0 burning CPU I am
>>> rarely getting max latency of about ~30us with mean latency of ~9us
>>> and deviation of ~0.5us. On other hand, when I tried to set period
>>> to 30us, max latency rose up to ~60us.
> In a related topic, I am not entirely sure that all the hypercalls
> would be able to fit in the 100us slice. In particular, the one which
> are touching the P2M and do memory allocation.

I agree with you. In my experiments I didn't found a case with long
running hypercall (apart from mentioned populate_physmap), but of course
there should be cases with such calls.

>
>> This is very interestingi too. Did you get any spikes with the
>> period
>> set to 100us? It would be fantastic if there were none.
>>
>>> 3. Huge latency spike during domain creation. I conducted some
>>> additional tests, including use of PV drivers, but this didn't
>>> affected the latency in my "real time" domain. But attempt to
>>> create another domain with relatively large memory size of 2GB led
>>> to huge spike in latency. Debugging led to this call path:
>>>
>>> XENMEM_populate_physmap -> populate_physmap() ->
>>> alloc_domheap_pages() -> alloc_heap_pages()-> huge
>>> "for ( i = 0; i < (1 << order); i++ )" loop.
>
> There are two for loops in alloc_heap_pages() using this syntax. Which
> one are your referring to?

I did some tracing with Lautrebach. It pointed to the first loop and
especially to flush_page_to_ram() call if I remember correctly.

>>>
>>> This loops handles struct page* for every one of 262144 pages that
>>> was allocated by calling populate_physmap().
>
> Looking at the domain creation code, 2GB will be split in two extents
> of 1GB. This means, there will be at least a preemption point between
> the allocation of the two extents.

Yes. 1GB is exactly 262144 4KB pages.

[...]

>>> I managed to overcome the issue #3 by commenting out all calls to
>>> populate_one_size() except the populate_one_size(PFN_4K_SHIFT) in
>>> xg_dom_arm.c. This lengthened domain construction, but my "RT" domain
>>> didn't experienced so big latency issues. Apparently all other
>>> hypercalls which are used during domain creation are either fast or
>>> preemptible. No doubts that my hack lead to page tables inflation and
>>> overall performance drop.
>> I think we need to follow this up and fix this. Maybe just by adding
>> a hypercall continuation to the loop.
>
> When I read "hypercall continuation", I read we will return to the
> guest context so it can process interrupts and potentially switch to
> another task.
>
> This means that the guest could issue a second populate_physmap() from
> the vCPU. Therefore any restart information should be part of the
> hypercall parameters. So far, I don't see how this would be possible.
>
> Even if we overcome that part, this can be easily abuse by a guest as
> the memory is not yet accounted to the domain. Imagine a guest that
> never request the continuation of the populate_physmap(). So we would
> need to block the vCPU until the allocation is finished.

Moreover, most of the alloc_heap_pages() sits under spinlock, so first
step would be to split this function into smaller atomic parts.

> I think the first step is we need to figure out which part of the
> allocation is slow (see my question above). From there, we can figure
> out if there is a way to reduce the impact.

I'll do more tracing and will return with more accurate numbers. But as
far as I can see, any loop on 262144 pages will take some time...

--
Volodymyr Babchuk at EPAM
Re: IRQ latency measurements in hypervisor [ In reply to ]
On 15/01/2021 15:45, Volodymyr Babchuk wrote:
>
> Hi Julien,
>
> Julien Grall writes:
>
>> Hi Volodymyr, Stefano,
>>
>> On 14/01/2021 23:33, Stefano Stabellini wrote:
>>> + Bertrand, Andrew (see comment on alloc_heap_pages())
>>
>> Long running hypercalls are usually considered security issues.
>>
>> In this case, only the control domain can issue large memory
>> allocation (2GB at a time). Guest, would only be able to allocate 2MB
>> at the time, so from the numbers below, it would only take 1ms max.
>>
>> So I think we are fine here. Next time, you find a large loop, please
>> provide an explanation why they are not security issues (e.g. cannot
>> be used by guests) or send an email to the Security Team in doubt.
>
> Sure. In this case I took into account that only control domain can
> issue this call, I just didn't stated this explicitly. Next time will
> do.

I am afraid that's not correct. The guest can request to populate a
region. This is used for instance in the ballooning case.

The main difference is a non-privileged guest will not be able to do
allocation larger than 2MB.

[...]

>>> This is very interestingi too. Did you get any spikes with the
>>> period
>>> set to 100us? It would be fantastic if there were none.
>>>
>>>> 3. Huge latency spike during domain creation. I conducted some
>>>> additional tests, including use of PV drivers, but this didn't
>>>> affected the latency in my "real time" domain. But attempt to
>>>> create another domain with relatively large memory size of 2GB led
>>>> to huge spike in latency. Debugging led to this call path:
>>>>
>>>> XENMEM_populate_physmap -> populate_physmap() ->
>>>> alloc_domheap_pages() -> alloc_heap_pages()-> huge
>>>> "for ( i = 0; i < (1 << order); i++ )" loop.
>>
>> There are two for loops in alloc_heap_pages() using this syntax. Which
>> one are your referring to?
>
> I did some tracing with Lautrebach. It pointed to the first loop and
> especially to flush_page_to_ram() call if I remember correctly.

Thanks, I am not entirely surprised because we are clean and
invalidating the region line by line and across all the CPUs.

If we are assuming 128 bytes cacheline, we will need to issue 32 cache
instructions per page. This going to involve quite a bit of traffic on
the system.

One possibility would be to defer the cache flush when the domain is
created and use the hypercall XEN_DOMCTL_cacheflush to issue the flush.

Note that XEN_DOMCTL_cacheflush would need some modification to be
preemptible. But at least, it will work on a GFN which is easier to track.

>>>> I managed to overcome the issue #3 by commenting out all calls to
>>>> populate_one_size() except the populate_one_size(PFN_4K_SHIFT) in
>>>> xg_dom_arm.c. This lengthened domain construction, but my "RT" domain
>>>> didn't experienced so big latency issues. Apparently all other
>>>> hypercalls which are used during domain creation are either fast or
>>>> preemptible. No doubts that my hack lead to page tables inflation and
>>>> overall performance drop.
>>> I think we need to follow this up and fix this. Maybe just by adding
>>> a hypercall continuation to the loop.
>>
>> When I read "hypercall continuation", I read we will return to the
>> guest context so it can process interrupts and potentially switch to
>> another task.
>>
>> This means that the guest could issue a second populate_physmap() from
>> the vCPU. Therefore any restart information should be part of the
>> hypercall parameters. So far, I don't see how this would be possible.
>>
>> Even if we overcome that part, this can be easily abuse by a guest as
>> the memory is not yet accounted to the domain. Imagine a guest that
>> never request the continuation of the populate_physmap(). So we would
>> need to block the vCPU until the allocation is finished.
>
> Moreover, most of the alloc_heap_pages() sits under spinlock, so first
> step would be to split this function into smaller atomic parts.

Do you have any suggestion how to split it?

>
>> I think the first step is we need to figure out which part of the
>> allocation is slow (see my question above). From there, we can figure
>> out if there is a way to reduce the impact.
>
> I'll do more tracing and will return with more accurate numbers.
> But as far as I can see, any loop on 262144 pages will take some time..
.

It really depends on the content of the loop. On any modern processors,
you are very likely not going to notice a loop that update just a flag.

However, you are likely going to be see an impact if your loop is going
to clean & invalidate the cache for each page.

Cheers,

--
Julien Grall
Re: IRQ latency measurements in hypervisor [ In reply to ]
On Fri, 15 Jan 2021, Volodymyr Babchuk wrote:
> >> ARMv8 platform. Namely Renesas Rcar H3 SoC on Salvator board.
> >>
> >> To accurately determine latency, I employed one of timer counter units
> >> (TMUs) available on the SoC. This is 32-bit timer with auto-reload,
> >> that can generate interrupt on underflow. I fed it with 33.275MHz
> >> clock, which gave me resolution of about 30ns. I programmed the timer
> >> to generate interrupt every 10ms. My ISR then read the current timer
> >> value and determined how much time passed since last underrun. This
> >> gave me time interval between IRQ generation and ISR invocation.
> >> Those values were collected and every 10 seconds statistics was
> >> calculated. There is an example of output from my Linux driver:
> >
> > It looks like a solid approach to collect results, similar to the one we
> > used for the cache coloring work. Just make sure to collect very many
> > results.
> >
> > A few of questions: did you use a single physical CPU? Are you using
> > RTDS and schedule 2 vCPU on 1 pCPU? Is dom0 idle or busy? I take the
> > results were measured in domU?
> Yes, I used single pCPU, which ran 1-2 vCPUs depending on setup. At
> first, this was only Dom0 with 1 vCPU, but in most cases it was Dom0 +
> Zephyr DomU each with 1 vCPU.
>
> I quickly determined that 1 vCPU per 1 pCPU case meets my
> requirements. On other hand, our real cases will always have more vCPUs
> than pCPUs, so I was interested in 2 vCPUs - 1 pCPU case.
>
> >
> >> [ 83.873162] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 (240 ns)
> >> [ 94.136632] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 (240 ns)
> >> [ 104.400098] rt_eval_tmu e6fc0000.tmu: Mean: 50 (1500 ns) stddev: 129 (3870 ns)
> >> [ 114.663828] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 (240 ns)
> >> [ 124.927296] rt_eval_tmu e6fc0000.tmu: Mean: 56 (1680 ns) stddev: 183 (5490 ns)
> >>
> >> This is the baremetal Linux. And there is Dom0:
> >>
> >> [ 237.431003] rt_eval_tmu e6fc0000.tmu: Mean: 306 (9180 ns) stddev: 25 (750 ns)
> >> [ 247.694506] rt_eval_tmu e6fc0000.tmu: Mean: 302 (9060 ns) stddev: 17 (510 ns)
> >>
> >> Driver outputs both the raw timer value (eg. 4) and the same value
> >> scaled to nanoseconds (eg. 1320 ns). As you can see baremetal setup is
> >> much faster. But experiments showed that Linux does not provide
> >> consistent values, even when running in baremetal mode. You can see
> >> sporadic spikes in "std dev" values.
> >
> > So baremetal IRQ latency is 1320-1680ns and Linux IRQ latency is
> > 9060-9180ns. I am not surprised that Linux results are inconsistent but
> > I have a couple of observations:
> >
> > - 9us is high for Linux
> > If the system is idle, the latency should be lower, around 2-3us. I
> > imagine you are actually running some sort of interference from dom0? Or
> > using RTDS and descheduling vCPUs?
>
> 9us was in idle state. Interestingly enough, I got latency if 3us while
> Dom0 was doing some CPU-intensive tasks. So, under load latency is lower
> than in idle state. I didn't investigated this, so I can't tell you what
> causes this behavior.

Did you use vwfi=native? You should definitely be able to see ~3us
without interference and with 1vCPU <-> 1pCPU


> > - the stddev of 3870ns is high for baremetal
> > In the baremetal case the stddev should be minimal if the system is
> > idle.
>
> This is what I expected too. But nevertheless there was spikes. I didn't
> investigated this as well, so I can only speculate there. My rootfs is
> on NFS, so maybe network driver caused this spikes.

Yeah, maybe it would be best to reduce the sources of possible spikes
and get rid of NFS.


> >
> >
> >> So my next step was to use proper RT OS to do the measurements. I
> >> chose Zephyr. My PR that adds Xen support to Zephyr can be found at
> >> [1]. Support for RCAR Gen3 is not upstreamed, but is present on my
> >> GitHub([2]). At [3] you can find the source code for application that
> >> does the latency measurements. It behaves exactly as my linux driver,
> >> but provides a bit more information:
> >>
> >> *** Booting Zephyr OS build zephyr-v2.4.0-2750-g0f2c858a39fc ***
> >> RT Eval app
> >>
> >> Counter freq is 33280000 Hz. Period is 30 ns
> >> Set alarm in 0 sec (332800 ticks)
> >> Mean: 600 (18000 ns) stddev: 3737 (112110 ns) above thr: 0% [265 (7950 ns) - 66955 (2008650 ns)] global [265 (7950 ns) 66955 (2008650 ns)]
> >> Mean: 388 (11640 ns) stddev: 2059 (61770 ns) above thr: 0% [266 (7980 ns) - 58830 (1764900 ns)] global [265 (7950 ns) 66955 (2008650 ns)]
> >> Mean: 358 (10740 ns) stddev: 1796 (53880 ns) above thr: 0% [265 (7950 ns) - 57780 (1733400 ns)] global [265 (7950 ns) 66955 (2008650 ns)]
>
> This is Zephyr running as DomU.
>
> >> ...
> >>
> >> So there you can see: mean time, standard deviation, % of interrupts
> >> that was processed above 30us threshold, minimum and maximum latency
> >> values for the current 10s run, global minimum and maximum.
> >>
> >> Zephyr running as baremetal showed very stable results (this is an
> >> older build, so no extended statistics there):
> >>
> >> ## Starting application at 0x480803C0 ...
> >> *** Booting Zephyr OS build zephyr-v2.4.0-1137-g5803ee1e8183 ***
> >> RT Eval app
> >>
> >> Counter freq is 33280000 Hz. Period is 30 ns
> >> Mean: 31 (930 ns) stddev: 0 (0 ns)
> >> Mean: 31 (930 ns) stddev: 0 (0 ns)
> >> Mean: 31 (930 ns) stddev: 0 (0 ns)
> >> Mean: 31 (930 ns) stddev: 0 (0 ns)
> >> Mean: 31 (930 ns) stddev: 0 (0 ns)
> >> Mean: 31 (930 ns) stddev: 0 (0 ns)
>
> And this is Zephyr is running as baremetal.
>
> >> ...
> >>
> >> As Zephyr provided stable readouts with no jitter, I used it to do all
> >> subsequent measurements.
> >
> > I am a bit confused here. Looking at the numbers above the stddev is
> > 112110 ns in the first instance. That is pretty high. Am I looking at
> > the wrong numbers?
>
> I added some clarification above. As for 112110ns in the very first instance
> - I always ignored the first instance, assuming that things need to
> settle after domain being created.
>
> But your comment is actually correct: what exacelt should "settle"?
> Domain is already created. All should run smoothly. So, this is worth
> investigating.

It is fair to ignore the first 2 measurements.

However, the numbers above have high stddev even at the 3rd measurement:
53us is high and above the target 30us. I take you didn't apply yet the
alloc_heap_pages and the serial workarounds?


> >> IMPORTANT! All subsequent tests was conducted with only 1 CPU core
> >> enabled. My goal was to ensure that system can timely react to an
> >> external interrupt even under load.
> >
> > All right. FYI I have no frame of reference for 2 vCPUs on 1 pCPUs, all
> > my tests were done with 1vCPU <-> 1pCPU and the null scheduler.
>
> As I said, I had no issues with 1vCPU <-> 1pCPU setup, so I quickly
> moved to cases which had issues.
>
> >> Test results and latency sources
> >>
> >> As you can see, baremetal OS provides latency of ~0.9us without any
> >> deviations. The same code running as DomU on idle system shows mean
> >> latency of 12us and deviation of about 10us. Range of latencies in a
> >> 10s batch can vary from 8us to 25us. This fits into required 30us
> >> threshold, so no big issue there.
> >>
> >> But this worsens under certain conditions.
> >>
> >> 1. Serial console. RCAR serial driver (scif) works in synchronous
> >> mode, so any heavy serial console output leads to higher
> >> latency. Tests shows mean latency of 1000us and deviation of 1332
> >> us. 54% of interrupts are handled outside of 30us period. Real
> >> values may be even higher, because in some cases timer would do
> >> full cycle and instead of say 11ms ISR would read 11 % 10 = 1ms
> >> latency. I tried to enable asynchronous mode for the serial
> >> console. This made things better, but it lead to gaps in output, so
> >> I just turned the serial console off completely.
> >
> > That's very interesting. I wonder if other serial drivers would cause
> > similar issues, e.g. PL011.
>
> They should. This behavior is programmed in serial.c. Driver can enable
> async mode calling serial_async_transmit(). As I can see, only ns16550
> driver does this.
>
> Maybe you didn't saw problems there because you had more pCPU enabled
> and Xen used some other pCPU to do UART operations.

Good to know. A workaround would be to let Dom0/DomU have direct access
to the UART, not giving any UART to Xen at all.


> >> 2. RTDS scheduler. With console disabled, things like "hexdump -v
> >> /dev/zero" didn't affected the latency so badly, but anyways,
> >> sometimes I got ~600us spikes. This is not a surprise, because of
> >> default RTDS configuration. I changed period for DomU from default
> >> 10ms to 100us and things got better: with Dom0 burning CPU I am
> >> rarely getting max latency of about ~30us with mean latency of ~9us
> >> and deviation of ~0.5us. On other hand, when I tried to set period
> >> to 30us, max latency rose up to ~60us.
> >
> > This is very interestingi too. Did you get any spikes with the period
> > set to 100us? It would be fantastic if there were none.
>
> As far as I can remember, there wasn't any. At least in my setup. And
> with workaround for alloc_heap_pages() issue. Of course, I didn't
> covered all the possible use cases. But in settled state, when all
> domains are created and they are just doing owns tasks there was no
> spikes.

That's very good to hear
Re: IRQ latency measurements in hypervisor [ In reply to ]
On Fri, 15 Jan 2021, Julien Grall wrote:
> On 15/01/2021 15:45, Volodymyr Babchuk wrote:
> >
> > Hi Julien,
> >
> > Julien Grall writes:
> >
> > > Hi Volodymyr, Stefano,
> > >
> > > On 14/01/2021 23:33, Stefano Stabellini wrote:
> > > > + Bertrand, Andrew (see comment on alloc_heap_pages())
> > >
> > > Long running hypercalls are usually considered security issues.
> > >
> > > In this case, only the control domain can issue large memory
> > > allocation (2GB at a time). Guest, would only be able to allocate 2MB
> > > at the time, so from the numbers below, it would only take 1ms max.
> > >
> > > So I think we are fine here. Next time, you find a large loop, please
> > > provide an explanation why they are not security issues (e.g. cannot
> > > be used by guests) or send an email to the Security Team in doubt.
> >
> > Sure. In this case I took into account that only control domain can
> > issue this call, I just didn't stated this explicitly. Next time will
> > do.
>
> I am afraid that's not correct. The guest can request to populate a region.
> This is used for instance in the ballooning case.
>
> The main difference is a non-privileged guest will not be able to do
> allocation larger than 2MB.
>
> [...]
>
> > > > This is very interestingi too. Did you get any spikes with the
> > > > period
> > > > set to 100us? It would be fantastic if there were none.
> > > >
> > > > > 3. Huge latency spike during domain creation. I conducted some
> > > > > additional tests, including use of PV drivers, but this didn't
> > > > > affected the latency in my "real time" domain. But attempt to
> > > > > create another domain with relatively large memory size of 2GB
> > > > > led
> > > > > to huge spike in latency. Debugging led to this call path:
> > > > >
> > > > > XENMEM_populate_physmap -> populate_physmap() ->
> > > > > alloc_domheap_pages() -> alloc_heap_pages()-> huge
> > > > > "for ( i = 0; i < (1 << order); i++ )" loop.
> > >
> > > There are two for loops in alloc_heap_pages() using this syntax. Which
> > > one are your referring to?
> >
> > I did some tracing with Lautrebach. It pointed to the first loop and
> > especially to flush_page_to_ram() call if I remember correctly.
>
> Thanks, I am not entirely surprised because we are clean and invalidating the
> region line by line and across all the CPUs.
>
> If we are assuming 128 bytes cacheline, we will need to issue 32 cache
> instructions per page. This going to involve quite a bit of traffic on the
> system.

I think Julien is most likely right. It would be good to verify this
with an experiment. For instance, you could remove the
flush_page_to_ram() call for one test and see if you see any latency
problems.


> One possibility would be to defer the cache flush when the domain is created
> and use the hypercall XEN_DOMCTL_cacheflush to issue the flush.
>
> Note that XEN_DOMCTL_cacheflush would need some modification to be
> preemptible. But at least, it will work on a GFN which is easier to track.

This looks like a solid suggestion. XEN_DOMCTL_cacheflush is already
used by the toolstack in a few places.

I am also wondering if we can get away with fewer flush_page_to_ram()
calls from alloc_heap_pages() for memory allocations done at boot time
soon after global boot memory scrubbing.


> > > > > I managed to overcome the issue #3 by commenting out all calls to
> > > > > populate_one_size() except the populate_one_size(PFN_4K_SHIFT) in
> > > > > xg_dom_arm.c. This lengthened domain construction, but my "RT" domain
> > > > > didn't experienced so big latency issues. Apparently all other
> > > > > hypercalls which are used during domain creation are either fast or
> > > > > preemptible. No doubts that my hack lead to page tables inflation and
> > > > > overall performance drop.
> > > > I think we need to follow this up and fix this. Maybe just by adding
> > > > a hypercall continuation to the loop.
> > >
> > > When I read "hypercall continuation", I read we will return to the
> > > guest context so it can process interrupts and potentially switch to
> > > another task.
> > >
> > > This means that the guest could issue a second populate_physmap() from
> > > the vCPU. Therefore any restart information should be part of the
> > > hypercall parameters. So far, I don't see how this would be possible.
> > >
> > > Even if we overcome that part, this can be easily abuse by a guest as
> > > the memory is not yet accounted to the domain. Imagine a guest that
> > > never request the continuation of the populate_physmap(). So we would
> > > need to block the vCPU until the allocation is finished.
> >
> > Moreover, most of the alloc_heap_pages() sits under spinlock, so first
> > step would be to split this function into smaller atomic parts.
>
> Do you have any suggestion how to split it?
>
> >
> > > I think the first step is we need to figure out which part of the
> > > allocation is slow (see my question above). From there, we can figure
> > > out if there is a way to reduce the impact.
> >
> > I'll do more tracing and will return with more accurate numbers. But as far
> > as I can see, any loop on 262144 pages will take some time..
> .
>
> It really depends on the content of the loop. On any modern processors, you
> are very likely not going to notice a loop that update just a flag.
>
> However, you are likely going to be see an impact if your loop is going to
> clean & invalidate the cache for each page.
Re: IRQ latency measurements in hypervisor [ In reply to ]
Hi Stefano,

On 15/01/2021 23:17, Stefano Stabellini wrote:
> On Fri, 15 Jan 2021, Volodymyr Babchuk wrote:
>> 9us was in idle state. Interestingly enough, I got latency if 3us while
>> Dom0 was doing some CPU-intensive tasks. So, under load latency is lower
>> than in idle state. I didn't investigated this, so I can't tell you what
>> causes this behavior.
>
> Did you use vwfi=native? You should definitely be able to see ~3us
> without interference and with 1vCPU <-> 1pCPU

vwfi=native works well in a scenario where each pCPU is dedicated to a
vCPU. However, if you start to oversubscribe the pCPU, the performance
are going to degrade quite a lot because there will be less opportunity
to schedule a different vCPU.

You will also waste quite a bit of pCPU cycles because the vCPU will sit
doing nothing until the end of its time slice.

Cheers,

--
Julien Grall
Re: IRQ latency measurements in hypervisor [ In reply to ]
On 15/01/2021 23:41, Stefano Stabellini wrote:
>>>>> This is very interestingi too. Did you get any spikes with the
>>>>> period
>>>>> set to 100us? It would be fantastic if there were none.
>>>>>
>>>>>> 3. Huge latency spike during domain creation. I conducted some
>>>>>> additional tests, including use of PV drivers, but this didn't
>>>>>> affected the latency in my "real time" domain. But attempt to
>>>>>> create another domain with relatively large memory size of 2GB
>>>>>> led
>>>>>> to huge spike in latency. Debugging led to this call path:
>>>>>>
>>>>>> XENMEM_populate_physmap -> populate_physmap() ->
>>>>>> alloc_domheap_pages() -> alloc_heap_pages()-> huge
>>>>>> "for ( i = 0; i < (1 << order); i++ )" loop.
>>>> There are two for loops in alloc_heap_pages() using this syntax. Which
>>>> one are your referring to?
>>> I did some tracing with Lautrebach. It pointed to the first loop and
>>> especially to flush_page_to_ram() call if I remember correctly.
>> Thanks, I am not entirely surprised because we are clean and invalidating the
>> region line by line and across all the CPUs.
>>
>> If we are assuming 128 bytes cacheline, we will need to issue 32 cache
>> instructions per page. This going to involve quite a bit of traffic on the
>> system.
> I think Julien is most likely right. It would be good to verify this
> with an experiment. For instance, you could remove the
> flush_page_to_ram() call for one test and see if you see any latency
> problems.
>
>
>> One possibility would be to defer the cache flush when the domain is created
>> and use the hypercall XEN_DOMCTL_cacheflush to issue the flush.
>>
>> Note that XEN_DOMCTL_cacheflush would need some modification to be
>> preemptible. But at least, it will work on a GFN which is easier to track.
>
> This looks like a solid suggestion. XEN_DOMCTL_cacheflush is already
> used by the toolstack in a few places.
>
> I am also wondering if we can get away with fewer flush_page_to_ram()
> calls from alloc_heap_pages() for memory allocations done at boot time
> soon after global boot memory scrubbing.

I'm pretty sure there is room to improve Xen's behaviour in general, by
not scrubbing pages already known to be zero.

As far as I'm aware, there are improvements which never got completed
when lazy scrubbing was added, and I think it is giving us a hit on x86,
where we don't even have to do any cache maintenance on the side.

~Andrew
Re: IRQ latency measurements in hypervisor [ In reply to ]
On 12/01/2021 23:48, Volodymyr Babchuk wrote:
> Hello community,
>
> Recently I was experimenting with context switching latencies in
> Xen. I want to share my findings and discuss some issues I have found.
>
> My target was to ensure that no more 30us passes between interrupt
> generation and handling in a guest VM. Main goal was to determine if
> Xen is ready for soft RT and to identify main issues that prevent
> this.
>
> TLDR: I was able to achieve mean 15us latency with tuned RTDS
> scheduler, disabled serial console and some hacks to the Xen
> toolstack. This was done on a setup with 1 PCPU.
>
> Test setup.
>
> ARMv8 platform. Namely Renesas Rcar H3 SoC on Salvator board.
>
> To accurately determine latency, I employed one of timer counter units
> (TMUs) available on the SoC. This is 32-bit timer with auto-reload,
> that can generate interrupt on underflow. I fed it with 33.275MHz
> clock, which gave me resolution of about 30ns. I programmed the timer
> to generate interrupt every 10ms. My ISR then read the current timer
> value and determined how much time passed since last underrun. This
> gave me time interval between IRQ generation and ISR invocation.
> Those values were collected and every 10 seconds statistics was
> calculated. There is an example of output from my Linux driver:
>
> [ 83.873162] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 (240 ns)
> [ 94.136632] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 (240 ns)
> [ 104.400098] rt_eval_tmu e6fc0000.tmu: Mean: 50 (1500 ns) stddev: 129 (3870 ns)
> [ 114.663828] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 (240 ns)
> [ 124.927296] rt_eval_tmu e6fc0000.tmu: Mean: 56 (1680 ns) stddev: 183 (5490 ns)
>
> This is the baremetal Linux. And there is Dom0:
>
> [ 237.431003] rt_eval_tmu e6fc0000.tmu: Mean: 306 (9180 ns) stddev: 25 (750 ns)
> [ 247.694506] rt_eval_tmu e6fc0000.tmu: Mean: 302 (9060 ns) stddev: 17 (510 ns)
>
> Driver outputs both the raw timer value (eg. 4) and the same value
> scaled to nanoseconds (eg. 1320 ns). As you can see baremetal setup is
> much faster. But experiments showed that Linux does not provide
> consistent values, even when running in baremetal mode. You can see
> sporadic spikes in "std dev" values.
>
> So my next step was to use proper RT OS to do the measurements. I
> chose Zephyr. My PR that adds Xen support to Zephyr can be found at
> [1]. Support for RCAR Gen3 is not upstreamed, but is present on my
> GitHub([2]). At [3] you can find the source code for application that
> does the latency measurements. It behaves exactly as my linux driver,
> but provides a bit more information:
>
> *** Booting Zephyr OS build zephyr-v2.4.0-2750-g0f2c858a39fc ***
> RT Eval app
>
> Counter freq is 33280000 Hz. Period is 30 ns
> Set alarm in 0 sec (332800 ticks)
> Mean: 600 (18000 ns) stddev: 3737 (112110 ns) above thr: 0% [265 (7950 ns) - 66955 (2008650 ns)] global [265 (7950 ns) 66955 (2008650 ns)]
> Mean: 388 (11640 ns) stddev: 2059 (61770 ns) above thr: 0% [266 (7980 ns) - 58830 (1764900 ns)] global [265 (7950 ns) 66955 (2008650 ns)]
> Mean: 358 (10740 ns) stddev: 1796 (53880 ns) above thr: 0% [265 (7950 ns) - 57780 (1733400 ns)] global [265 (7950 ns) 66955 (2008650 ns)]
> ...
>
> So there you can see: mean time, standard deviation, % of interrupts
> that was processed above 30us threshold, minimum and maximum latency
> values for the current 10s run, global minimum and maximum.
>
> Zephyr running as baremetal showed very stable results (this is an
> older build, so no extended statistics there):
>
> ## Starting application at 0x480803C0 ...
> *** Booting Zephyr OS build zephyr-v2.4.0-1137-g5803ee1e8183 ***
> RT Eval app
>
> Counter freq is 33280000 Hz. Period is 30 ns
> Mean: 31 (930 ns) stddev: 0 (0 ns)
> Mean: 31 (930 ns) stddev: 0 (0 ns)
> Mean: 31 (930 ns) stddev: 0 (0 ns)
> Mean: 31 (930 ns) stddev: 0 (0 ns)
> Mean: 31 (930 ns) stddev: 0 (0 ns)
> Mean: 31 (930 ns) stddev: 0 (0 ns)
> ...
>
> As Zephyr provided stable readouts with no jitter, I used it to do all
> subsequent measurements.
>
>
> IMPORTANT! All subsequent tests was conducted with only 1 CPU core
> enabled. My goal was to ensure that system can timely react to an
> external interrupt even under load.
>
>
> Test results and latency sources
>
> As you can see, baremetal OS provides latency of ~0.9us without any
> deviations. The same code running as DomU on idle system shows mean
> latency of 12us and deviation of about 10us. Range of latencies in a
> 10s batch can vary from 8us to 25us. This fits into required 30us
> threshold, so no big issue there.

This is all very interesting work - thankyou for sharing.

However, there is one set of measurements which I think are missing, and
are needed to describe the full picture.

You've got measurements for a native kernel taking interrupts, with
excellent results for Zephyr, and good results for Linux.

You've also got measurements for putting Xen into the mix, and the
results could do with some work.

Adding Xen totally changes the structure of the system.  Conceptually
(if the Xen model were expressed as native), what you've got now is a
single piece of userspace (pid1, no fork()ing), which is single threaded
(but doing its own private task switching between "thread"), where some
threads are making blocking system calls (all Xen hypercalls are fully
synchronous) while other threads are measuring the latency for a signal
to hit.

I think it would be incredibly valuable to mock something like that up
as a 3rd set of measurements, because it allows for a fairer comparison
of the rough structure of a virtualised system, with and without Xen
being involved.


It would also be very valuable to see where the latency is in the
system.  Would it be possible to set up a shared page, and have the
various moving parts write the counter value they see into it, so the
final stats can do a breakdown?

You probably want counter samples from:
1) When Xen has positively identified this single interrupt
2) When the interrupt has been marked pending for the guest
3) When the guest kernel has positively identified this single interrupt
4) When the interrupt handler has run (this is the datapoint you've got,
AFAICT).

Maybe for point 2, you also want some metadata as to what context was
interrupted (EL0..2).

You ought to see reasonably consistent results when interrupting EL0,
because that should transition immediately to EL1.  For interrupting
EL1, you'll perhaps see variation depending on whether other higher
priority are being serviced.

For interrupting EL2, this is where I'd expect you to see all the
problems with hypercalls, or potentially other work Xen is doing. 
Fundamentally, Xen can't just drop what it's doing and resume guest
context for the interrupt to be delivered.

We get around this with the hypercall continuation mechanism, which is
conceptually similar to voluntary preemption (we periodically check to
see if we should stop doing what we're doing, unwind state safely, and
get back out into guest context).

However, even using hypercall continuations, there will be a minimum
unit of work, necessary to guarantee forward progress, which cannot be
subdivided further.  There is also a cost to having finer grain
continuation checks, but hopefully that is confined to the error margins.


Alternatively, if we still have problems, something to consider might be
the task domain concept which has been discussed before.  This was an
idea to try and bring something more like normal tasks to Xen, by having
a hypervisor context that the control domain could queue work with,
who's running wouldn't be a synchronous hypercall from from the guest's
perspective.

Such a task domain would require a mechanism to ensure it got scheduled
with a high enough priority not to get starved, and would probably
benefit if we restructured Xen to have fewer irq-off areas (which would
be a great thing irrespective - the scheduler lock in particular needs
to become irqs-off).  In other words, the task domain concept is a lot
of work to do, with only a general hope of it resulting in an
improvement - it shouldn't be jumped upon until lightly to solve perf
problems.

~Andrew
Re: IRQ latency measurements in hypervisor [ In reply to ]
On Thu, 2021-01-14 at 15:33 -0800, Stefano Stabellini wrote:
> On Tue, 12 Jan 2021, Volodymyr Babchuk wrote:
> > 2. RTDS scheduler. With console disabled, things like "hexdump -v
> >    /dev/zero" didn't affected the latency so badly, but anyways,
> >    sometimes I got ~600us spikes. This is not a surprise, because
> > of
> >    default RTDS configuration. I changed period for DomU from
> > default
> >    10ms to 100us and things got better: with Dom0 burning CPU I am
> >    rarely getting max latency of about ~30us with mean latency of
> > ~9us
> >    and deviation of ~0.5us. On other hand, when I tried to set
> > period
> >    to 30us, max latency rose up to ~60us.
>
> This is very interestingi too. Did you get any spikes with the period
> set to 100us? It would be fantastic if there were none.
>
This *probably* makes some sense. Where the *probably* comes from the
fact that all the following reasoning assumes that what I recall about
real-time scheduling theory is correct, on which I would not bet.

Perhaps Stefano can ask to my good old friend Marko Bertogna, from the
Univeristy of Modena, as they're collaborating on cache-coloring, what
he his team think. He was already much better than me with this things,
back in the days of the Ph.D... So for sure he's much better than me
know! :-)

Anyway, as I was saying, having a latency which is ~ 2x of your period
is ok, and it should be expected (when you size the period). In fact,
let's say that your budget is Xus, and your period is 30us. This means
that you get to execute for Xus every 30us. So, basically, at time t0
you are given a budget of Xus and you are guaranteed to be able to use
it all within time t1=t0+30us. At that time (t1=t0+30us) you are given
another Xus amount of budget, and you are guaranteed to be able to use
it all within t2=t1+30us=t0+60us.

Now, with a period as small as 30us, your budget is also going to be
pretty small (how much was that? If it was in your mail, I must have
missed it). Are you sure that the vCPU is able to wake up and run until
the point that your driver has done all the latency measurement in
_just_one_ instance (i.e., all this takes less than the budget)?.

In fact, lat's say your budget is 10us, and it the vCPU needs 15us for
waking up and doing the measurements. At time t0 the vCPU is scheduler,
and let's say that the latency at that time is exactly 0. The vCPU
start to run but, at time t0+10us (where 10us is the budget) it is
stopped. at time t1=t0+30us, the vCPU receives a budget replenishment
but that does not mean that it will start to run immediately, if the
system is busy.

In fact, what RTDS guarantees is that the vCPU will be able to execute
for 10us within time t2=t1+30us. So, in theory, it can start to run as
far as t2-10us, without violating any guarantee.

If that is, in fact, what happens, i.e., the vCPU starts to run only at
t2-10us, and it is only then that it manages to finish computing and
recording the latency (after running for 5us more, as we said it takes
15us).

What it will therefore record would be a latency to t2-5us, which in
fact is:

t1 + 30us - 5us = t0 + 30us + 30us - 5us =
= 0 + 60us - 5us = 55us ~= 60us

So... May this be the case?

Thanks and Regards
--
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)
Re: IRQ latency measurements in hypervisor [ In reply to ]
Hi Julien,


Julien Grall writes:

> On 15/01/2021 15:45, Volodymyr Babchuk wrote:
>> Hi Julien,
>> Julien Grall writes:
>>
>>> Hi Volodymyr, Stefano,
>>>
>>> On 14/01/2021 23:33, Stefano Stabellini wrote:
>>>> + Bertrand, Andrew (see comment on alloc_heap_pages())
>>>
>>> Long running hypercalls are usually considered security issues.
>>>
>>> In this case, only the control domain can issue large memory
>>> allocation (2GB at a time). Guest, would only be able to allocate 2MB
>>> at the time, so from the numbers below, it would only take 1ms max.
>>>
>>> So I think we are fine here. Next time, you find a large loop, please
>>> provide an explanation why they are not security issues (e.g. cannot
>>> be used by guests) or send an email to the Security Team in doubt.
>> Sure. In this case I took into account that only control domain can
>> issue this call, I just didn't stated this explicitly. Next time will
>> do.
>
> I am afraid that's not correct. The guest can request to populate a
> region. This is used for instance in the ballooning case.
>
> The main difference is a non-privileged guest will not be able to do
> allocation larger than 2MB.

I did some measurements. According to them allocation of order 9 takes
about 265us on my HW. I covered this in detail at the end of email.

>>>> This is very interestingi too. Did you get any spikes with the
>>>> period
>>>> set to 100us? It would be fantastic if there were none.
>>>>
>>>>> 3. Huge latency spike during domain creation. I conducted some
>>>>> additional tests, including use of PV drivers, but this didn't
>>>>> affected the latency in my "real time" domain. But attempt to
>>>>> create another domain with relatively large memory size of 2GB led
>>>>> to huge spike in latency. Debugging led to this call path:
>>>>>
>>>>> XENMEM_populate_physmap -> populate_physmap() ->
>>>>> alloc_domheap_pages() -> alloc_heap_pages()-> huge
>>>>> "for ( i = 0; i < (1 << order); i++ )" loop.
>>>
>>> There are two for loops in alloc_heap_pages() using this syntax. Which
>>> one are your referring to?
>> I did some tracing with Lautrebach. It pointed to the first loop and
>> especially to flush_page_to_ram() call if I remember correctly.
>
> Thanks, I am not entirely surprised because we are clean and
> invalidating the region line by line and across all the CPUs.
>
> If we are assuming 128 bytes cacheline, we will need to issue 32 cache
> instructions per page. This going to involve quite a bit of traffic on
> the system.
>
> One possibility would be to defer the cache flush when the domain is
> created and use the hypercall XEN_DOMCTL_cacheflush to issue the
> flush.

Can we flush caches on first access to a page? What I mean - do not
populate stage 2 tables with allocated memory. Flush caches in abort
handler and then populate them.

> Note that XEN_DOMCTL_cacheflush would need some modification to be
> preemptible. But at least, it will work on a GFN which is easier to
> track.
>


>>>>> I managed to overcome the issue #3 by commenting out all calls to
>>>>> populate_one_size() except the populate_one_size(PFN_4K_SHIFT) in
>>>>> xg_dom_arm.c. This lengthened domain construction, but my "RT" domain
>>>>> didn't experienced so big latency issues. Apparently all other
>>>>> hypercalls which are used during domain creation are either fast or
>>>>> preemptible. No doubts that my hack lead to page tables inflation and
>>>>> overall performance drop.
>>>> I think we need to follow this up and fix this. Maybe just by adding
>>>> a hypercall continuation to the loop.
>>>
>>> When I read "hypercall continuation", I read we will return to the
>>> guest context so it can process interrupts and potentially switch to
>>> another task.
>>>
>>> This means that the guest could issue a second populate_physmap() from
>>> the vCPU. Therefore any restart information should be part of the
>>> hypercall parameters. So far, I don't see how this would be possible.
>>>
>>> Even if we overcome that part, this can be easily abuse by a guest as
>>> the memory is not yet accounted to the domain. Imagine a guest that
>>> never request the continuation of the populate_physmap(). So we would
>>> need to block the vCPU until the allocation is finished.
>> Moreover, most of the alloc_heap_pages() sits under spinlock, so
>> first
>> step would be to split this function into smaller atomic parts.
>
> Do you have any suggestion how to split it?
>

Well, it is quite complex function and I can't tell right away.
At this time I don't quite understand why spin_unlock() is called after
the first (1 << order) loop for instance.

Also, this function does many different things for its simple name.

>>
>>> I think the first step is we need to figure out which part of the
>>> allocation is slow (see my question above). From there, we can figure
>>> out if there is a way to reduce the impact.
>> I'll do more tracing and will return with more accurate numbers.
>> But as far as I can see, any loop on 262144 pages will take some time..
> .
>
> It really depends on the content of the loop. On any modern
> processors, you are very likely not going to notice a loop that update
> just a flag.
>
> However, you are likely going to be see an impact if your loop is
> going to clean & invalidate the cache for each page.
>

Totally agree. I used Xen tracing subsystem to do the measurements and I
can confirm that call to flush_page_to_ram() causes most of the impact.


There is the details:


I added number of tracing points to the function:

static struct page_info *alloc_heap_pages(
unsigned int zone_lo, unsigned int zone_hi,
unsigned int order, unsigned int memflags,
struct domain *d)
{
nodeid_t node;
unsigned int i, buddy_order, zone, first_dirty;
unsigned long request = 1UL << order;
struct page_info *pg;
bool need_tlbflush = false;
uint32_t tlbflush_timestamp = 0;
unsigned int dirty_cnt = 0;

/* Make sure there are enough bits in memflags for nodeID. */
BUILD_BUG_ON((_MEMF_bits - _MEMF_node) < (8 * sizeof(nodeid_t)));

ASSERT(zone_lo <= zone_hi);
ASSERT(zone_hi < NR_ZONES);

if ( unlikely(order > MAX_ORDER) )
return NULL;

spin_lock(&heap_lock);

TRACE_1D(TRC_PGALLOC_PT1, order); // <=================================

/*
* Claimed memory is considered unavailable unless the request
* is made by a domain with sufficient unclaimed pages.
*/
if ( (outstanding_claims + request > total_avail_pages) &&
((memflags & MEMF_no_refcount) ||
!d || d->outstanding_pages < request) )
{
spin_unlock(&heap_lock);
return NULL;
}

pg = get_free_buddy(zone_lo, zone_hi, order, memflags, d);
/* Try getting a dirty buddy if we couldn't get a clean one. */
if ( !pg && !(memflags & MEMF_no_scrub) )
pg = get_free_buddy(zone_lo, zone_hi, order,
memflags | MEMF_no_scrub, d);
if ( !pg )
{
/* No suitable memory blocks. Fail the request. */
spin_unlock(&heap_lock);
return NULL;
}

TRACE_0D(TRC_PGALLOC_PT2); // <=================================

node = phys_to_nid(page_to_maddr(pg));
zone = page_to_zone(pg);
buddy_order = PFN_ORDER(pg);

first_dirty = pg->u.free.first_dirty;

/* We may have to halve the chunk a number of times. */
while ( buddy_order != order )
{
buddy_order--;
page_list_add_scrub(pg, node, zone, buddy_order,
(1U << buddy_order) > first_dirty ?
first_dirty : INVALID_DIRTY_IDX);
pg += 1U << buddy_order;

if ( first_dirty != INVALID_DIRTY_IDX )
{
/* Adjust first_dirty */
if ( first_dirty >= 1U << buddy_order )
first_dirty -= 1U << buddy_order;
else
first_dirty = 0; /* We've moved past original first_dirty */
}
}

TRACE_0D(TRC_PGALLOC_PT3); // <=================================

ASSERT(avail[node][zone] >= request);
avail[node][zone] -= request;
total_avail_pages -= request;
ASSERT(total_avail_pages >= 0);

check_low_mem_virq();

if ( d != NULL )
d->last_alloc_node = node;

for ( i = 0; i < (1 << order); i++ )
{
/* Reference count must continuously be zero for free pages. */
if ( (pg[i].count_info & ~PGC_need_scrub) != PGC_state_free )
{
printk(XENLOG_ERR
"pg[%u] MFN %"PRI_mfn" c=%#lx o=%u v=%#lx t=%#x\n",
i, mfn_x(page_to_mfn(pg + i)),
pg[i].count_info, pg[i].v.free.order,
pg[i].u.free.val, pg[i].tlbflush_timestamp);
BUG();
}

/* PGC_need_scrub can only be set if first_dirty is valid */
ASSERT(first_dirty != INVALID_DIRTY_IDX || !(pg[i].count_info & PGC_need_scrub));

/* Preserve PGC_need_scrub so we can check it after lock is dropped. */
pg[i].count_info = PGC_state_inuse | (pg[i].count_info & PGC_need_scrub);

if ( !(memflags & MEMF_no_tlbflush) )
accumulate_tlbflush(&need_tlbflush, &pg[i],
&tlbflush_timestamp);

/* Initialise fields which have other uses for free pages. */
pg[i].u.inuse.type_info = 0;
page_set_owner(&pg[i], NULL);

/* Ensure cache and RAM are consistent for platforms where the
* guest can control its own visibility of/through the cache.
*/
flush_page_to_ram(mfn_x(page_to_mfn(&pg[i])),
!(memflags & MEMF_no_icache_flush));
}

TRACE_0D(TRC_PGALLOC_PT4); // <=================================
spin_unlock(&heap_lock);

if ( first_dirty != INVALID_DIRTY_IDX ||
(scrub_debug && !(memflags & MEMF_no_scrub)) )
{
for ( i = 0; i < (1U << order); i++ )
{
if ( test_bit(_PGC_need_scrub, &pg[i].count_info) )
{
if ( !(memflags & MEMF_no_scrub) )
scrub_one_page(&pg[i]);

dirty_cnt++;

spin_lock(&heap_lock);
pg[i].count_info &= ~PGC_need_scrub;
spin_unlock(&heap_lock);
}
else if ( !(memflags & MEMF_no_scrub) )
check_one_page(&pg[i]);
}

if ( dirty_cnt )
{
spin_lock(&heap_lock);
node_need_scrub[node] -= dirty_cnt;
spin_unlock(&heap_lock);
}
}

TRACE_0D(TRC_PGALLOC_PT5); // <=================================
if ( need_tlbflush )
filtered_flush_tlb_mask(tlbflush_timestamp);

TRACE_0D(TRC_PGALLOC_PT6); // <=================================

return pg;
}


And wrote a simple Python scripts that parses the output of
xentrace. There are results for different orders:

46.842032: page_alloc trace point 1. Order: 18
46.842035: page_alloc trace point 2 (+ 0.000003)
46.842035: page_alloc trace point 3 (+ 0.000000)
46.975105: page_alloc trace point 4 (+ 0.133069)
46.975106: page_alloc trace point 5 (+ 0.000001)
46.975106: page_alloc trace point 6 (+ 0.000000): total: 0.133074

46.998536: page_alloc trace point 1. Order: 9
46.998538: page_alloc trace point 2 (+ 0.000002)
46.998540: page_alloc trace point 3 (+ 0.000001)
46.998799: page_alloc trace point 4 (+ 0.000259)
46.998800: page_alloc trace point 5 (+ 0.000000)
46.998800: page_alloc trace point 6 (+ 0.000000): total: 0.000264

46.835802: page_alloc trace point 1. Order: 3
46.835803: page_alloc trace point 2 (+ 0.000000)
46.835803: page_alloc trace point 3 (+ 0.000000)
46.835812: page_alloc trace point 4 (+ 0.000009)
46.835813: page_alloc trace point 5 (+ 0.000000)
46.835813: page_alloc trace point 6 (+ 0.000001): total: 0.000011

46.998815: page_alloc trace point 1. Order: 0
46.998816: page_alloc trace point 2 (+ 0.000002)
46.998817: page_alloc trace point 3 (+ 0.000000)
46.998818: page_alloc trace point 4 (+ 0.000002)
46.998819: page_alloc trace point 5 (+ 0.000001)
46.998819: page_alloc trace point 6 (+ 0.000000): total: 0.000005

Then I commented out call to flush_page_to_ram() and got the following
results:

149.561902: page_alloc trace point 1. Order: 18
149.561905: page_alloc trace point 2 (+ 0.000003)
149.561905: page_alloc trace point 3 (+ 0.000000)
149.569450: page_alloc trace point 4 (+ 0.007545)
149.569451: page_alloc trace point 5 (+ 0.000001)
149.569452: page_alloc trace point 6 (+ 0.000000): total: 0.007550

149.592624: page_alloc trace point 1. Order: 9
149.592626: page_alloc trace point 2 (+ 0.000003)
149.592627: page_alloc trace point 3 (+ 0.000001)
149.592639: page_alloc trace point 4 (+ 0.000012)
149.592639: page_alloc trace point 5 (+ 0.000000)
149.592640: page_alloc trace point 6 (+ 0.000000): total: 0.000016

All time units are seconds, by the way.


--
Volodymyr Babchuk at EPAM
Re: IRQ latency measurements in hypervisor [ In reply to ]
Hi Stefano,

Stefano Stabellini writes:

> On Fri, 15 Jan 2021, Julien Grall wrote:
>> On 15/01/2021 15:45, Volodymyr Babchuk wrote:
>> >
>> > Hi Julien,
>> >
>> > Julien Grall writes:
>> >
>> > > Hi Volodymyr, Stefano,
>> > >
>> > > On 14/01/2021 23:33, Stefano Stabellini wrote:
>> > > > + Bertrand, Andrew (see comment on alloc_heap_pages())
>> > >
>> > > Long running hypercalls are usually considered security issues.
>> > >
>> > > In this case, only the control domain can issue large memory
>> > > allocation (2GB at a time). Guest, would only be able to allocate 2MB
>> > > at the time, so from the numbers below, it would only take 1ms max.
>> > >
>> > > So I think we are fine here. Next time, you find a large loop, please
>> > > provide an explanation why they are not security issues (e.g. cannot
>> > > be used by guests) or send an email to the Security Team in doubt.
>> >
>> > Sure. In this case I took into account that only control domain can
>> > issue this call, I just didn't stated this explicitly. Next time will
>> > do.
>>
>> I am afraid that's not correct. The guest can request to populate a region.
>> This is used for instance in the ballooning case.
>>
>> The main difference is a non-privileged guest will not be able to do
>> allocation larger than 2MB.
>>
>> [...]
>>
>> > > > This is very interestingi too. Did you get any spikes with the
>> > > > period
>> > > > set to 100us? It would be fantastic if there were none.
>> > > >
>> > > > > 3. Huge latency spike during domain creation. I conducted some
>> > > > > additional tests, including use of PV drivers, but this didn't
>> > > > > affected the latency in my "real time" domain. But attempt to
>> > > > > create another domain with relatively large memory size of 2GB
>> > > > > led
>> > > > > to huge spike in latency. Debugging led to this call path:
>> > > > >
>> > > > > XENMEM_populate_physmap -> populate_physmap() ->
>> > > > > alloc_domheap_pages() -> alloc_heap_pages()-> huge
>> > > > > "for ( i = 0; i < (1 << order); i++ )" loop.
>> > >
>> > > There are two for loops in alloc_heap_pages() using this syntax. Which
>> > > one are your referring to?
>> >
>> > I did some tracing with Lautrebach. It pointed to the first loop and
>> > especially to flush_page_to_ram() call if I remember correctly.
>>
>> Thanks, I am not entirely surprised because we are clean and invalidating the
>> region line by line and across all the CPUs.
>>
>> If we are assuming 128 bytes cacheline, we will need to issue 32 cache
>> instructions per page. This going to involve quite a bit of traffic on the
>> system.
>
> I think Julien is most likely right. It would be good to verify this
> with an experiment. For instance, you could remove the
> flush_page_to_ram() call for one test and see if you see any latency
> problems.

Yes, I did exactly this and shared results in my reply to Julien.


>> One possibility would be to defer the cache flush when the domain is created
>> and use the hypercall XEN_DOMCTL_cacheflush to issue the flush.
>>
>> Note that XEN_DOMCTL_cacheflush would need some modification to be
>> preemptible. But at least, it will work on a GFN which is easier to track.
>
> This looks like a solid suggestion. XEN_DOMCTL_cacheflush is already
> used by the toolstack in a few places.
>
> I am also wondering if we can get away with fewer flush_page_to_ram()
> calls from alloc_heap_pages() for memory allocations done at boot time
> soon after global boot memory scrubbing.

This is doable, if you are trying to optimize boot time. On the other
hand, this is the extra check in already quite complex function.

BTW, I briefly looked at Xen boot time and saw that Dom0 construction takes
ample amount of time.

--
Volodymyr Babchuk at EPAM
Re: IRQ latency measurements in hypervisor [ In reply to ]
Stefano Stabellini writes:

> On Fri, 15 Jan 2021, Volodymyr Babchuk wrote:
[...]
>> >> [ 83.873162] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 (240 ns)
>> >> [ 94.136632] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 (240 ns)
>> >> [ 104.400098] rt_eval_tmu e6fc0000.tmu: Mean: 50 (1500 ns) stddev: 129 (3870 ns)
>> >> [ 114.663828] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 (240 ns)
>> >> [ 124.927296] rt_eval_tmu e6fc0000.tmu: Mean: 56 (1680 ns) stddev: 183 (5490 ns)
>> >>
>> >> This is the baremetal Linux. And there is Dom0:
>> >>
>> >> [ 237.431003] rt_eval_tmu e6fc0000.tmu: Mean: 306 (9180 ns) stddev: 25 (750 ns)
>> >> [ 247.694506] rt_eval_tmu e6fc0000.tmu: Mean: 302 (9060 ns) stddev: 17 (510 ns)
>> >>
>> >> Driver outputs both the raw timer value (eg. 4) and the same value
>> >> scaled to nanoseconds (eg. 1320 ns). As you can see baremetal setup is
>> >> much faster. But experiments showed that Linux does not provide
>> >> consistent values, even when running in baremetal mode. You can see
>> >> sporadic spikes in "std dev" values.
>> >
>> > So baremetal IRQ latency is 1320-1680ns and Linux IRQ latency is
>> > 9060-9180ns. I am not surprised that Linux results are inconsistent but
>> > I have a couple of observations:
>> >
>> > - 9us is high for Linux
>> > If the system is idle, the latency should be lower, around 2-3us. I
>> > imagine you are actually running some sort of interference from dom0? Or
>> > using RTDS and descheduling vCPUs?
>>
>> 9us was in idle state. Interestingly enough, I got latency if 3us while
>> Dom0 was doing some CPU-intensive tasks. So, under load latency is lower
>> than in idle state. I didn't investigated this, so I can't tell you what
>> causes this behavior.
>
> Did you use vwfi=native? You should definitely be able to see ~3us
> without interference and with 1vCPU <-> 1pCPU

With vwfi=native things in Dom0 got better, but it is still not 3us:

[ 41.563904] rt_eval_tmu e6fc0000.tmu: Mean: 173 (5190 ns) stddev: 15 (450 ns) min: 125 (3750 ns) max: 276102 (8283060 ns)
[ 51.820403] rt_eval_tmu e6fc0000.tmu: Mean: 172 (5160 ns) stddev: 8 (240 ns) min: 125 (3750 ns) max: 276102 (8283060 ns)
[ 62.076898] rt_eval_tmu e6fc0000.tmu: Mean: 172 (5160 ns) stddev: 9 (270 ns) min: 125 (3750 ns) max: 276102 (8283060 ns)
[ 72.333377] rt_eval_tmu e6fc0000.tmu: Mean: 171 (5130 ns) stddev: 9 (270 ns) min: 125 (3750 ns) max: 276102 (8283060 ns)
[ 82.589852] rt_eval_tmu e6fc0000.tmu: Mean: 172 (5160 ns) stddev: 10 (300 ns) min: 125 (3750 ns) max: 276102 (8283060 ns)

And DomU (as Julien suggested) is extremely unhappy:

Mean: 282 (8460 ns) stddev: 412 (12360 ns) above thr: 4% [109 (3270 ns) - 2245 (67350 ns)] global [108 (3240 ns) 74261 (2227830 ns)]
Mean: 958 (28740 ns) stddev: 860 (25800 ns) above thr: 50% [120 (3600 ns) - 2253 (67590 ns)] global [108 (3240 ns) 74261 (2227830 ns)]
Mean: 713 (21390 ns) stddev: 533 (15990 ns) above thr: 37% [114 (3420 ns) - 2186 (65580 ns)] global [108 (3240 ns) 74261 (2227830 ns)]

>
>> > - the stddev of 3870ns is high for baremetal
>> > In the baremetal case the stddev should be minimal if the system is
>> > idle.
>>
>> This is what I expected too. But nevertheless there was spikes. I didn't
>> investigated this as well, so I can only speculate there. My rootfs is
>> on NFS, so maybe network driver caused this spikes.
>
> Yeah, maybe it would be best to reduce the sources of possible spikes
> and get rid of NFS.

I agree there, but I am working remotely, so NFS is easier. I'll try
with eMMC later.

>
>> >
>> >
>> >> So my next step was to use proper RT OS to do the measurements. I
>> >> chose Zephyr. My PR that adds Xen support to Zephyr can be found at
>> >> [1]. Support for RCAR Gen3 is not upstreamed, but is present on my
>> >> GitHub([2]). At [3] you can find the source code for application that
>> >> does the latency measurements. It behaves exactly as my linux driver,
>> >> but provides a bit more information:
>> >>
>> >> *** Booting Zephyr OS build zephyr-v2.4.0-2750-g0f2c858a39fc ***
>> >> RT Eval app
>> >>
>> >> Counter freq is 33280000 Hz. Period is 30 ns
>> >> Set alarm in 0 sec (332800 ticks)
>> >> Mean: 600 (18000 ns) stddev: 3737 (112110 ns) above thr: 0% [265 (7950 ns) - 66955 (2008650 ns)] global [265 (7950 ns) 66955 (2008650 ns)]
>> >> Mean: 388 (11640 ns) stddev: 2059 (61770 ns) above thr: 0% [266 (7980 ns) - 58830 (1764900 ns)] global [265 (7950 ns) 66955 (2008650 ns)]
>> >> Mean: 358 (10740 ns) stddev: 1796 (53880 ns) above thr: 0% [265 (7950 ns) - 57780 (1733400 ns)] global [265 (7950 ns) 66955 (2008650 ns)]
>>
>> This is Zephyr running as DomU.
>>
>> >> ...
>> >>
>> >> So there you can see: mean time, standard deviation, % of interrupts
>> >> that was processed above 30us threshold, minimum and maximum latency
>> >> values for the current 10s run, global minimum and maximum.
>> >>
>> >> Zephyr running as baremetal showed very stable results (this is an
>> >> older build, so no extended statistics there):
>> >>
>> >> ## Starting application at 0x480803C0 ...
>> >> *** Booting Zephyr OS build zephyr-v2.4.0-1137-g5803ee1e8183 ***
>> >> RT Eval app
>> >>
>> >> Counter freq is 33280000 Hz. Period is 30 ns
>> >> Mean: 31 (930 ns) stddev: 0 (0 ns)
>> >> Mean: 31 (930 ns) stddev: 0 (0 ns)
>> >> Mean: 31 (930 ns) stddev: 0 (0 ns)
>> >> Mean: 31 (930 ns) stddev: 0 (0 ns)
>> >> Mean: 31 (930 ns) stddev: 0 (0 ns)
>> >> Mean: 31 (930 ns) stddev: 0 (0 ns)
>>
>> And this is Zephyr is running as baremetal.
>>
>> >> ...
>> >>
>> >> As Zephyr provided stable readouts with no jitter, I used it to do all
>> >> subsequent measurements.
>> >
>> > I am a bit confused here. Looking at the numbers above the stddev is
>> > 112110 ns in the first instance. That is pretty high. Am I looking at
>> > the wrong numbers?
>>
>> I added some clarification above. As for 112110ns in the very first instance
>> - I always ignored the first instance, assuming that things need to
>> settle after domain being created.
>>
>> But your comment is actually correct: what exacelt should "settle"?
>> Domain is already created. All should run smoothly. So, this is worth
>> investigating.
>
> It is fair to ignore the first 2 measurements.
>
> However, the numbers above have high stddev even at the 3rd measurement:
> 53us is high and above the target 30us. I take you didn't apply yet the
> alloc_heap_pages and the serial workarounds?

Yes, this is the very first try. Without workarounds and proper
scheduler configuration.

With all fixes it is much better:

Mean: 307 (9210 ns) stddev: 4 (120 ns) above thr: 0% [265 (7950 ns) - 376 (11280 ns)] global [265 (7950 ns) 376 (11280 ns)]
Mean: 306 (9180 ns) stddev: 3 (90 ns) above thr: 0% [263 (7890 ns) - 360 (10800 ns)] global [263 (7890 ns) 376 (11280 ns)]
Mean: 306 (9180 ns) stddev: 3 (90 ns) above thr: 0% [298 (8940 ns) - 373 (11190 ns)] global [263 (7890 ns) 376 (11280 ns)]
Mean: 307 (9210 ns) stddev: 17 (510 ns) above thr: 0% [265 (7950 ns) - 856 (25680 ns)] global [263 (7890 ns) 1251 (37530 ns)]
Mean: 304 (9120 ns) stddev: 7 (210 ns) above thr: 0% [286 (8580 ns) - 486 (14580 ns)] global [263 (7890 ns) 1251 (37530 ns)]
Mean: 307 (9210 ns) stddev: 47 (1410 ns) above thr: 0% [260 (7800 ns) - 1299 (38970 ns)] global [260 (7800 ns) 1299 (38970 ns)]
Mean: 291 (8730 ns) stddev: 41 (1230 ns) above thr: 0% [250 (7500 ns) - 1324 (39720 ns)] global [250 (7500 ns) 1324 (39720 ns)]

>
>> >> IMPORTANT! All subsequent tests was conducted with only 1 CPU core
>> >> enabled. My goal was to ensure that system can timely react to an
>> >> external interrupt even under load.
>> >
>> > All right. FYI I have no frame of reference for 2 vCPUs on 1 pCPUs, all
>> > my tests were done with 1vCPU <-> 1pCPU and the null scheduler.
>>
>> As I said, I had no issues with 1vCPU <-> 1pCPU setup, so I quickly
>> moved to cases which had issues.
>>
>> >> Test results and latency sources
>> >>
>> >> As you can see, baremetal OS provides latency of ~0.9us without any
>> >> deviations. The same code running as DomU on idle system shows mean
>> >> latency of 12us and deviation of about 10us. Range of latencies in a
>> >> 10s batch can vary from 8us to 25us. This fits into required 30us
>> >> threshold, so no big issue there.
>> >>
>> >> But this worsens under certain conditions.
>> >>
>> >> 1. Serial console. RCAR serial driver (scif) works in synchronous
>> >> mode, so any heavy serial console output leads to higher
>> >> latency. Tests shows mean latency of 1000us and deviation of 1332
>> >> us. 54% of interrupts are handled outside of 30us period. Real
>> >> values may be even higher, because in some cases timer would do
>> >> full cycle and instead of say 11ms ISR would read 11 % 10 = 1ms
>> >> latency. I tried to enable asynchronous mode for the serial
>> >> console. This made things better, but it lead to gaps in output, so
>> >> I just turned the serial console off completely.
>> >
>> > That's very interesting. I wonder if other serial drivers would cause
>> > similar issues, e.g. PL011.
>>
>> They should. This behavior is programmed in serial.c. Driver can enable
>> async mode calling serial_async_transmit(). As I can see, only ns16550
>> driver does this.
>>
>> Maybe you didn't saw problems there because you had more pCPU enabled
>> and Xen used some other pCPU to do UART operations.
>
> Good to know. A workaround would be to let Dom0/DomU have direct access
> to the UART, not giving any UART to Xen at all.

Yes. Or use SSH. I chose the second way.

[...]

--
Volodymyr Babchuk at EPAM
Re: IRQ latency measurements in hypervisor [ In reply to ]
On Wed, 20 Jan 2021, Volodymyr Babchuk wrote:
> > On 15/01/2021 15:45, Volodymyr Babchuk wrote:
> >> Hi Julien,
> >> Julien Grall writes:
> >>
> >>> Hi Volodymyr, Stefano,
> >>>
> >>> On 14/01/2021 23:33, Stefano Stabellini wrote:
> >>>> + Bertrand, Andrew (see comment on alloc_heap_pages())
> >>>
> >>> Long running hypercalls are usually considered security issues.
> >>>
> >>> In this case, only the control domain can issue large memory
> >>> allocation (2GB at a time). Guest, would only be able to allocate 2MB
> >>> at the time, so from the numbers below, it would only take 1ms max.
> >>>
> >>> So I think we are fine here. Next time, you find a large loop, please
> >>> provide an explanation why they are not security issues (e.g. cannot
> >>> be used by guests) or send an email to the Security Team in doubt.
> >> Sure. In this case I took into account that only control domain can
> >> issue this call, I just didn't stated this explicitly. Next time will
> >> do.
> >
> > I am afraid that's not correct. The guest can request to populate a
> > region. This is used for instance in the ballooning case.
> >
> > The main difference is a non-privileged guest will not be able to do
> > allocation larger than 2MB.
>
> I did some measurements. According to them allocation of order 9 takes
> about 265us on my HW. I covered this in detail at the end of email.

That's really a lot! :-(


> >>>> This is very interestingi too. Did you get any spikes with the
> >>>> period
> >>>> set to 100us? It would be fantastic if there were none.
> >>>>
> >>>>> 3. Huge latency spike during domain creation. I conducted some
> >>>>> additional tests, including use of PV drivers, but this didn't
> >>>>> affected the latency in my "real time" domain. But attempt to
> >>>>> create another domain with relatively large memory size of 2GB led
> >>>>> to huge spike in latency. Debugging led to this call path:
> >>>>>
> >>>>> XENMEM_populate_physmap -> populate_physmap() ->
> >>>>> alloc_domheap_pages() -> alloc_heap_pages()-> huge
> >>>>> "for ( i = 0; i < (1 << order); i++ )" loop.
> >>>
> >>> There are two for loops in alloc_heap_pages() using this syntax. Which
> >>> one are your referring to?
> >> I did some tracing with Lautrebach. It pointed to the first loop and
> >> especially to flush_page_to_ram() call if I remember correctly.
> >
> > Thanks, I am not entirely surprised because we are clean and
> > invalidating the region line by line and across all the CPUs.
> >
> > If we are assuming 128 bytes cacheline, we will need to issue 32 cache
> > instructions per page. This going to involve quite a bit of traffic on
> > the system.
> >
> > One possibility would be to defer the cache flush when the domain is
> > created and use the hypercall XEN_DOMCTL_cacheflush to issue the
> > flush.
>
> Can we flush caches on first access to a page? What I mean - do not
> populate stage 2 tables with allocated memory. Flush caches in abort
> handler and then populate them.

Yes, it should be possible with populate on demand. I'll let Julien
reply as he has direct experience with it on ARM.


> > Note that XEN_DOMCTL_cacheflush would need some modification to be
> > preemptible. But at least, it will work on a GFN which is easier to
> > track.

XEN_DOMCTL_cacheflush is a very viable option. The memory allocation is
requested by the toolstack, right? So adding one more call to do the
cacheflush after the memory allocation should be easy. Then, we could
turn the one massive XEN_DOMCTL_cacheflush hypercall into multiple
smaller calls to solve the latency issue.


> >>>>> I managed to overcome the issue #3 by commenting out all calls to
> >>>>> populate_one_size() except the populate_one_size(PFN_4K_SHIFT) in
> >>>>> xg_dom_arm.c. This lengthened domain construction, but my "RT" domain
> >>>>> didn't experienced so big latency issues. Apparently all other
> >>>>> hypercalls which are used during domain creation are either fast or
> >>>>> preemptible. No doubts that my hack lead to page tables inflation and
> >>>>> overall performance drop.
> >>>> I think we need to follow this up and fix this. Maybe just by adding
> >>>> a hypercall continuation to the loop.
> >>>
> >>> When I read "hypercall continuation", I read we will return to the
> >>> guest context so it can process interrupts and potentially switch to
> >>> another task.
> >>>
> >>> This means that the guest could issue a second populate_physmap() from
> >>> the vCPU. Therefore any restart information should be part of the
> >>> hypercall parameters. So far, I don't see how this would be possible.
> >>>
> >>> Even if we overcome that part, this can be easily abuse by a guest as
> >>> the memory is not yet accounted to the domain. Imagine a guest that
> >>> never request the continuation of the populate_physmap(). So we would
> >>> need to block the vCPU until the allocation is finished.
> >> Moreover, most of the alloc_heap_pages() sits under spinlock, so
> >> first
> >> step would be to split this function into smaller atomic parts.
> >
> > Do you have any suggestion how to split it?
> >
>
> Well, it is quite complex function and I can't tell right away.
> At this time I don't quite understand why spin_unlock() is called after
> the first (1 << order) loop for instance.
>
> Also, this function does many different things for its simple name.
>
> >>
> >>> I think the first step is we need to figure out which part of the
> >>> allocation is slow (see my question above). From there, we can figure
> >>> out if there is a way to reduce the impact.
> >> I'll do more tracing and will return with more accurate numbers.
> >> But as far as I can see, any loop on 262144 pages will take some time..
> > .
> >
> > It really depends on the content of the loop. On any modern
> > processors, you are very likely not going to notice a loop that update
> > just a flag.
> >
> > However, you are likely going to be see an impact if your loop is
> > going to clean & invalidate the cache for each page.
> >
>
> Totally agree. I used Xen tracing subsystem to do the measurements and I
> can confirm that call to flush_page_to_ram() causes most of the impact.

Thanks for verifying it, this is very useful. Now we know the biggest
issue to tackle.



> There is the details:
>
>
> I added number of tracing points to the function:
>
> static struct page_info *alloc_heap_pages(
> unsigned int zone_lo, unsigned int zone_hi,
> unsigned int order, unsigned int memflags,
> struct domain *d)
> {
> nodeid_t node;
> unsigned int i, buddy_order, zone, first_dirty;
> unsigned long request = 1UL << order;
> struct page_info *pg;
> bool need_tlbflush = false;
> uint32_t tlbflush_timestamp = 0;
> unsigned int dirty_cnt = 0;
>
> /* Make sure there are enough bits in memflags for nodeID. */
> BUILD_BUG_ON((_MEMF_bits - _MEMF_node) < (8 * sizeof(nodeid_t)));
>
> ASSERT(zone_lo <= zone_hi);
> ASSERT(zone_hi < NR_ZONES);
>
> if ( unlikely(order > MAX_ORDER) )
> return NULL;
>
> spin_lock(&heap_lock);
>
> TRACE_1D(TRC_PGALLOC_PT1, order); // <=================================
>
> /*
> * Claimed memory is considered unavailable unless the request
> * is made by a domain with sufficient unclaimed pages.
> */
> if ( (outstanding_claims + request > total_avail_pages) &&
> ((memflags & MEMF_no_refcount) ||
> !d || d->outstanding_pages < request) )
> {
> spin_unlock(&heap_lock);
> return NULL;
> }
>
> pg = get_free_buddy(zone_lo, zone_hi, order, memflags, d);
> /* Try getting a dirty buddy if we couldn't get a clean one. */
> if ( !pg && !(memflags & MEMF_no_scrub) )
> pg = get_free_buddy(zone_lo, zone_hi, order,
> memflags | MEMF_no_scrub, d);
> if ( !pg )
> {
> /* No suitable memory blocks. Fail the request. */
> spin_unlock(&heap_lock);
> return NULL;
> }
>
> TRACE_0D(TRC_PGALLOC_PT2); // <=================================
>
> node = phys_to_nid(page_to_maddr(pg));
> zone = page_to_zone(pg);
> buddy_order = PFN_ORDER(pg);
>
> first_dirty = pg->u.free.first_dirty;
>
> /* We may have to halve the chunk a number of times. */
> while ( buddy_order != order )
> {
> buddy_order--;
> page_list_add_scrub(pg, node, zone, buddy_order,
> (1U << buddy_order) > first_dirty ?
> first_dirty : INVALID_DIRTY_IDX);
> pg += 1U << buddy_order;
>
> if ( first_dirty != INVALID_DIRTY_IDX )
> {
> /* Adjust first_dirty */
> if ( first_dirty >= 1U << buddy_order )
> first_dirty -= 1U << buddy_order;
> else
> first_dirty = 0; /* We've moved past original first_dirty */
> }
> }
>
> TRACE_0D(TRC_PGALLOC_PT3); // <=================================
>
> ASSERT(avail[node][zone] >= request);
> avail[node][zone] -= request;
> total_avail_pages -= request;
> ASSERT(total_avail_pages >= 0);
>
> check_low_mem_virq();
>
> if ( d != NULL )
> d->last_alloc_node = node;
>
> for ( i = 0; i < (1 << order); i++ )
> {
> /* Reference count must continuously be zero for free pages. */
> if ( (pg[i].count_info & ~PGC_need_scrub) != PGC_state_free )
> {
> printk(XENLOG_ERR
> "pg[%u] MFN %"PRI_mfn" c=%#lx o=%u v=%#lx t=%#x\n",
> i, mfn_x(page_to_mfn(pg + i)),
> pg[i].count_info, pg[i].v.free.order,
> pg[i].u.free.val, pg[i].tlbflush_timestamp);
> BUG();
> }
>
> /* PGC_need_scrub can only be set if first_dirty is valid */
> ASSERT(first_dirty != INVALID_DIRTY_IDX || !(pg[i].count_info & PGC_need_scrub));
>
> /* Preserve PGC_need_scrub so we can check it after lock is dropped. */
> pg[i].count_info = PGC_state_inuse | (pg[i].count_info & PGC_need_scrub);
>
> if ( !(memflags & MEMF_no_tlbflush) )
> accumulate_tlbflush(&need_tlbflush, &pg[i],
> &tlbflush_timestamp);
>
> /* Initialise fields which have other uses for free pages. */
> pg[i].u.inuse.type_info = 0;
> page_set_owner(&pg[i], NULL);
>
> /* Ensure cache and RAM are consistent for platforms where the
> * guest can control its own visibility of/through the cache.
> */
> flush_page_to_ram(mfn_x(page_to_mfn(&pg[i])),
> !(memflags & MEMF_no_icache_flush));
> }
>
> TRACE_0D(TRC_PGALLOC_PT4); // <=================================
> spin_unlock(&heap_lock);
>
> if ( first_dirty != INVALID_DIRTY_IDX ||
> (scrub_debug && !(memflags & MEMF_no_scrub)) )
> {
> for ( i = 0; i < (1U << order); i++ )
> {
> if ( test_bit(_PGC_need_scrub, &pg[i].count_info) )
> {
> if ( !(memflags & MEMF_no_scrub) )
> scrub_one_page(&pg[i]);
>
> dirty_cnt++;
>
> spin_lock(&heap_lock);
> pg[i].count_info &= ~PGC_need_scrub;
> spin_unlock(&heap_lock);
> }
> else if ( !(memflags & MEMF_no_scrub) )
> check_one_page(&pg[i]);
> }
>
> if ( dirty_cnt )
> {
> spin_lock(&heap_lock);
> node_need_scrub[node] -= dirty_cnt;
> spin_unlock(&heap_lock);
> }
> }
>
> TRACE_0D(TRC_PGALLOC_PT5); // <=================================
> if ( need_tlbflush )
> filtered_flush_tlb_mask(tlbflush_timestamp);
>
> TRACE_0D(TRC_PGALLOC_PT6); // <=================================
>
> return pg;
> }
>
>
> And wrote a simple Python scripts that parses the output of
> xentrace. There are results for different orders:
>
> 46.842032: page_alloc trace point 1. Order: 18
> 46.842035: page_alloc trace point 2 (+ 0.000003)
> 46.842035: page_alloc trace point 3 (+ 0.000000)
> 46.975105: page_alloc trace point 4 (+ 0.133069)
> 46.975106: page_alloc trace point 5 (+ 0.000001)
> 46.975106: page_alloc trace point 6 (+ 0.000000): total: 0.133074
>
> 46.998536: page_alloc trace point 1. Order: 9
> 46.998538: page_alloc trace point 2 (+ 0.000002)
> 46.998540: page_alloc trace point 3 (+ 0.000001)
> 46.998799: page_alloc trace point 4 (+ 0.000259)
> 46.998800: page_alloc trace point 5 (+ 0.000000)
> 46.998800: page_alloc trace point 6 (+ 0.000000): total: 0.000264
>
> 46.835802: page_alloc trace point 1. Order: 3
> 46.835803: page_alloc trace point 2 (+ 0.000000)
> 46.835803: page_alloc trace point 3 (+ 0.000000)
> 46.835812: page_alloc trace point 4 (+ 0.000009)
> 46.835813: page_alloc trace point 5 (+ 0.000000)
> 46.835813: page_alloc trace point 6 (+ 0.000001): total: 0.000011
>
> 46.998815: page_alloc trace point 1. Order: 0
> 46.998816: page_alloc trace point 2 (+ 0.000002)
> 46.998817: page_alloc trace point 3 (+ 0.000000)
> 46.998818: page_alloc trace point 4 (+ 0.000002)
> 46.998819: page_alloc trace point 5 (+ 0.000001)
> 46.998819: page_alloc trace point 6 (+ 0.000000): total: 0.000005
>
> Then I commented out call to flush_page_to_ram() and got the following
> results:
>
> 149.561902: page_alloc trace point 1. Order: 18
> 149.561905: page_alloc trace point 2 (+ 0.000003)
> 149.561905: page_alloc trace point 3 (+ 0.000000)
> 149.569450: page_alloc trace point 4 (+ 0.007545)
> 149.569451: page_alloc trace point 5 (+ 0.000001)
> 149.569452: page_alloc trace point 6 (+ 0.000000): total: 0.007550
>
> 149.592624: page_alloc trace point 1. Order: 9
> 149.592626: page_alloc trace point 2 (+ 0.000003)
> 149.592627: page_alloc trace point 3 (+ 0.000001)
> 149.592639: page_alloc trace point 4 (+ 0.000012)
> 149.592639: page_alloc trace point 5 (+ 0.000000)
> 149.592640: page_alloc trace point 6 (+ 0.000000): total: 0.000016
>
> All time units are seconds, by the way.
>
>
> --
> Volodymyr Babchuk at EPAM
Re: IRQ latency measurements in hypervisor [ In reply to ]
On Thu, 21 Jan 2021, Volodymyr Babchuk wrote:
> > On Fri, 15 Jan 2021, Volodymyr Babchuk wrote:
> [...]
> >> >> [ 83.873162] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 (240 ns)
> >> >> [ 94.136632] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 (240 ns)
> >> >> [ 104.400098] rt_eval_tmu e6fc0000.tmu: Mean: 50 (1500 ns) stddev: 129 (3870 ns)
> >> >> [ 114.663828] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 (240 ns)
> >> >> [ 124.927296] rt_eval_tmu e6fc0000.tmu: Mean: 56 (1680 ns) stddev: 183 (5490 ns)
> >> >>
> >> >> This is the baremetal Linux. And there is Dom0:
> >> >>
> >> >> [ 237.431003] rt_eval_tmu e6fc0000.tmu: Mean: 306 (9180 ns) stddev: 25 (750 ns)
> >> >> [ 247.694506] rt_eval_tmu e6fc0000.tmu: Mean: 302 (9060 ns) stddev: 17 (510 ns)
> >> >>
> >> >> Driver outputs both the raw timer value (eg. 4) and the same value
> >> >> scaled to nanoseconds (eg. 1320 ns). As you can see baremetal setup is
> >> >> much faster. But experiments showed that Linux does not provide
> >> >> consistent values, even when running in baremetal mode. You can see
> >> >> sporadic spikes in "std dev" values.
> >> >
> >> > So baremetal IRQ latency is 1320-1680ns and Linux IRQ latency is
> >> > 9060-9180ns. I am not surprised that Linux results are inconsistent but
> >> > I have a couple of observations:
> >> >
> >> > - 9us is high for Linux
> >> > If the system is idle, the latency should be lower, around 2-3us. I
> >> > imagine you are actually running some sort of interference from dom0? Or
> >> > using RTDS and descheduling vCPUs?
> >>
> >> 9us was in idle state. Interestingly enough, I got latency if 3us while
> >> Dom0 was doing some CPU-intensive tasks. So, under load latency is lower
> >> than in idle state. I didn't investigated this, so I can't tell you what
> >> causes this behavior.
> >
> > Did you use vwfi=native? You should definitely be able to see ~3us
> > without interference and with 1vCPU <-> 1pCPU
>
> With vwfi=native things in Dom0 got better, but it is still not 3us:
>
> [ 41.563904] rt_eval_tmu e6fc0000.tmu: Mean: 173 (5190 ns) stddev: 15 (450 ns) min: 125 (3750 ns) max: 276102 (8283060 ns)
> [ 51.820403] rt_eval_tmu e6fc0000.tmu: Mean: 172 (5160 ns) stddev: 8 (240 ns) min: 125 (3750 ns) max: 276102 (8283060 ns)
> [ 62.076898] rt_eval_tmu e6fc0000.tmu: Mean: 172 (5160 ns) stddev: 9 (270 ns) min: 125 (3750 ns) max: 276102 (8283060 ns)
> [ 72.333377] rt_eval_tmu e6fc0000.tmu: Mean: 171 (5130 ns) stddev: 9 (270 ns) min: 125 (3750 ns) max: 276102 (8283060 ns)
> [ 82.589852] rt_eval_tmu e6fc0000.tmu: Mean: 172 (5160 ns) stddev: 10 (300 ns) min: 125 (3750 ns) max: 276102 (8283060 ns)
>
> And DomU (as Julien suggested) is extremely unhappy:
>
> Mean: 282 (8460 ns) stddev: 412 (12360 ns) above thr: 4% [109 (3270 ns) - 2245 (67350 ns)] global [108 (3240 ns) 74261 (2227830 ns)]
> Mean: 958 (28740 ns) stddev: 860 (25800 ns) above thr: 50% [120 (3600 ns) - 2253 (67590 ns)] global [108 (3240 ns) 74261 (2227830 ns)]
> Mean: 713 (21390 ns) stddev: 533 (15990 ns) above thr: 37% [114 (3420 ns) - 2186 (65580 ns)] global [108 (3240 ns) 74261 (2227830 ns)]

I should have clarified "without oversubscription". In other words, with
every vcpu running on a different pcpu. I realize this is not the
scenario you care about, but it was mostly to calculate the best latency
overhead we can achieve on your hardware platform as reference.


> >> >> So my next step was to use proper RT OS to do the measurements. I
> >> >> chose Zephyr. My PR that adds Xen support to Zephyr can be found at
> >> >> [1]. Support for RCAR Gen3 is not upstreamed, but is present on my
> >> >> GitHub([2]). At [3] you can find the source code for application that
> >> >> does the latency measurements. It behaves exactly as my linux driver,
> >> >> but provides a bit more information:
> >> >>
> >> >> *** Booting Zephyr OS build zephyr-v2.4.0-2750-g0f2c858a39fc ***
> >> >> RT Eval app
> >> >>
> >> >> Counter freq is 33280000 Hz. Period is 30 ns
> >> >> Set alarm in 0 sec (332800 ticks)
> >> >> Mean: 600 (18000 ns) stddev: 3737 (112110 ns) above thr: 0% [265 (7950 ns) - 66955 (2008650 ns)] global [265 (7950 ns) 66955 (2008650 ns)]
> >> >> Mean: 388 (11640 ns) stddev: 2059 (61770 ns) above thr: 0% [266 (7980 ns) - 58830 (1764900 ns)] global [265 (7950 ns) 66955 (2008650 ns)]
> >> >> Mean: 358 (10740 ns) stddev: 1796 (53880 ns) above thr: 0% [265 (7950 ns) - 57780 (1733400 ns)] global [265 (7950 ns) 66955 (2008650 ns)]
> >>
> >> This is Zephyr running as DomU.
> >>
> >> >> ...
> >> >>
> >> >> So there you can see: mean time, standard deviation, % of interrupts
> >> >> that was processed above 30us threshold, minimum and maximum latency
> >> >> values for the current 10s run, global minimum and maximum.
> >> >>
> >> >> Zephyr running as baremetal showed very stable results (this is an
> >> >> older build, so no extended statistics there):
> >> >>
> >> >> ## Starting application at 0x480803C0 ...
> >> >> *** Booting Zephyr OS build zephyr-v2.4.0-1137-g5803ee1e8183 ***
> >> >> RT Eval app
> >> >>
> >> >> Counter freq is 33280000 Hz. Period is 30 ns
> >> >> Mean: 31 (930 ns) stddev: 0 (0 ns)
> >> >> Mean: 31 (930 ns) stddev: 0 (0 ns)
> >> >> Mean: 31 (930 ns) stddev: 0 (0 ns)
> >> >> Mean: 31 (930 ns) stddev: 0 (0 ns)
> >> >> Mean: 31 (930 ns) stddev: 0 (0 ns)
> >> >> Mean: 31 (930 ns) stddev: 0 (0 ns)
> >>
> >> And this is Zephyr is running as baremetal.
> >>
> >> >> ...
> >> >>
> >> >> As Zephyr provided stable readouts with no jitter, I used it to do all
> >> >> subsequent measurements.
> >> >
> >> > I am a bit confused here. Looking at the numbers above the stddev is
> >> > 112110 ns in the first instance. That is pretty high. Am I looking at
> >> > the wrong numbers?
> >>
> >> I added some clarification above. As for 112110ns in the very first instance
> >> - I always ignored the first instance, assuming that things need to
> >> settle after domain being created.
> >>
> >> But your comment is actually correct: what exacelt should "settle"?
> >> Domain is already created. All should run smoothly. So, this is worth
> >> investigating.
> >
> > It is fair to ignore the first 2 measurements.
> >
> > However, the numbers above have high stddev even at the 3rd measurement:
> > 53us is high and above the target 30us. I take you didn't apply yet the
> > alloc_heap_pages and the serial workarounds?
>
> Yes, this is the very first try. Without workarounds and proper
> scheduler configuration.
>
> With all fixes it is much better:
>
> Mean: 307 (9210 ns) stddev: 4 (120 ns) above thr: 0% [265 (7950 ns) - 376 (11280 ns)] global [265 (7950 ns) 376 (11280 ns)]
> Mean: 306 (9180 ns) stddev: 3 (90 ns) above thr: 0% [263 (7890 ns) - 360 (10800 ns)] global [263 (7890 ns) 376 (11280 ns)]
> Mean: 306 (9180 ns) stddev: 3 (90 ns) above thr: 0% [298 (8940 ns) - 373 (11190 ns)] global [263 (7890 ns) 376 (11280 ns)]
> Mean: 307 (9210 ns) stddev: 17 (510 ns) above thr: 0% [265 (7950 ns) - 856 (25680 ns)] global [263 (7890 ns) 1251 (37530 ns)]
> Mean: 304 (9120 ns) stddev: 7 (210 ns) above thr: 0% [286 (8580 ns) - 486 (14580 ns)] global [263 (7890 ns) 1251 (37530 ns)]
> Mean: 307 (9210 ns) stddev: 47 (1410 ns) above thr: 0% [260 (7800 ns) - 1299 (38970 ns)] global [260 (7800 ns) 1299 (38970 ns)]
> Mean: 291 (8730 ns) stddev: 41 (1230 ns) above thr: 0% [250 (7500 ns) - 1324 (39720 ns)] global [250 (7500 ns) 1324 (39720 ns)]

OK, this is much better as a baseline
Re: IRQ latency measurements in hypervisor [ In reply to ]
Hi Dario,

Dario Faggioli writes:

> On Thu, 2021-01-14 at 15:33 -0800, Stefano Stabellini wrote:
>> On Tue, 12 Jan 2021, Volodymyr Babchuk wrote:
>> > 2. RTDS scheduler. With console disabled, things like "hexdump -v
>> > /dev/zero" didn't affected the latency so badly, but anyways,
>> > sometimes I got ~600us spikes. This is not a surprise, because
>> > of
>> > default RTDS configuration. I changed period for DomU from
>> > default
>> > 10ms to 100us and things got better: with Dom0 burning CPU I am
>> > rarely getting max latency of about ~30us with mean latency of
>> > ~9us
>> > and deviation of ~0.5us. On other hand, when I tried to set
>> > period
>> > to 30us, max latency rose up to ~60us.
>>
>> This is very interestingi too. Did you get any spikes with the period
>> set to 100us? It would be fantastic if there were none.
>>
> This *probably* makes some sense. Where the *probably* comes from the
> fact that all the following reasoning assumes that what I recall about
> real-time scheduling theory is correct, on which I would not bet.
>
> Perhaps Stefano can ask to my good old friend Marko Bertogna, from the
> Univeristy of Modena, as they're collaborating on cache-coloring, what
> he his team think. He was already much better than me with this things,
> back in the days of the Ph.D... So for sure he's much better than me
> know! :-)
>
> Anyway, as I was saying, having a latency which is ~ 2x of your period
> is ok, and it should be expected (when you size the period). In fact,
> let's say that your budget is Xus, and your period is 30us. This means
> that you get to execute for Xus every 30us. So, basically, at time t0
> you are given a budget of Xus and you are guaranteed to be able to use
> it all within time t1=t0+30us. At that time (t1=t0+30us) you are given
> another Xus amount of budget, and you are guaranteed to be able to use
> it all within t2=t1+30us=t0+60us.

Well, I'm not sure if I got you right. Are you saying that unused budget
is preserved?

If I understood it correct, any unused budget is lost. So, basically
RDTS guarantees that your vcpu will able to run Xus in total every Yus,
where X is the budget and Y is the period. Also, it does not guarantee
that vCPU will be scheduled right away, so for period of 100us and
budget of 30us it will be perfectly fine to have latency of 70+us:

If at t=0 new period begins and in the same time IRQ fires, but RDTS
decides to run another task, it is possible that your vCPU will be
scheduled at only t=70us.

> Now, with a period as small as 30us, your budget is also going to be
> pretty small (how much was that? If it was in your mail, I must have
> missed it). Are you sure that the vCPU is able to wake up and run until
> the point that your driver has done all the latency measurement in
> _just_one_ instance (i.e., all this takes less than the budget)?.

> In fact, lat's say your budget is 10us, and it the vCPU needs 15us for
> waking up and doing the measurements. At time t0 the vCPU is scheduler,
> and let's say that the latency at that time is exactly 0. The vCPU
> start to run but, at time t0+10us (where 10us is the budget) it is
> stopped. at time t1=t0+30us, the vCPU receives a budget replenishment
> but that does not mean that it will start to run immediately, if the
> system is busy.
>
> In fact, what RTDS guarantees is that the vCPU will be able to execute
> for 10us within time t2=t1+30us. So, in theory, it can start to run as
> far as t2-10us, without violating any guarantee.
>
> If that is, in fact, what happens, i.e., the vCPU starts to run only at
> t2-10us, and it is only then that it manages to finish computing and
> recording the latency (after running for 5us more, as we said it takes
> 15us).
>
> What it will therefore record would be a latency to t2-5us, which in
> fact is:
>
> t1 + 30us - 5us = t0 + 30us + 30us - 5us =
> = 0 + 60us - 5us = 55us ~= 60us
>
> So... May this be the case?

Yes, probably this is the issue. I used budget of 15us in this
case. But, taking into account that minimal observed latency is 3us and
typical is ~10us, it is quite possible that budget will be emptied
before IRQ handler will have a chance to complete.

It would be great to have priorities in RTDS, so more critical task can
preempt less critical one. I believe, I have seen corresponding patches
somewhere...

--
Volodymyr Babchuk at EPAM
Re: IRQ latency measurements in hypervisor [ In reply to ]
Hi Andrew,

Andrew Cooper writes:

> On 12/01/2021 23:48, Volodymyr Babchuk wrote:
>> Hello community,
>>
>> Recently I was experimenting with context switching latencies in
>> Xen. I want to share my findings and discuss some issues I have found.
>>
>> My target was to ensure that no more 30us passes between interrupt
>> generation and handling in a guest VM. Main goal was to determine if
>> Xen is ready for soft RT and to identify main issues that prevent
>> this.
>>
>> TLDR: I was able to achieve mean 15us latency with tuned RTDS
>> scheduler, disabled serial console and some hacks to the Xen
>> toolstack. This was done on a setup with 1 PCPU.
>>
>> Test setup.
>>
>> ARMv8 platform. Namely Renesas Rcar H3 SoC on Salvator board.
>>
>> To accurately determine latency, I employed one of timer counter units
>> (TMUs) available on the SoC. This is 32-bit timer with auto-reload,
>> that can generate interrupt on underflow. I fed it with 33.275MHz
>> clock, which gave me resolution of about 30ns. I programmed the timer
>> to generate interrupt every 10ms. My ISR then read the current timer
>> value and determined how much time passed since last underrun. This
>> gave me time interval between IRQ generation and ISR invocation.
>> Those values were collected and every 10 seconds statistics was
>> calculated. There is an example of output from my Linux driver:
>>
>> [ 83.873162] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 (240 ns)
>> [ 94.136632] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 (240 ns)
>> [ 104.400098] rt_eval_tmu e6fc0000.tmu: Mean: 50 (1500 ns) stddev: 129 (3870 ns)
>> [ 114.663828] rt_eval_tmu e6fc0000.tmu: Mean: 44 (1320 ns) stddev: 8 (240 ns)
>> [ 124.927296] rt_eval_tmu e6fc0000.tmu: Mean: 56 (1680 ns) stddev: 183 (5490 ns)
>>
>> This is the baremetal Linux. And there is Dom0:
>>
>> [ 237.431003] rt_eval_tmu e6fc0000.tmu: Mean: 306 (9180 ns) stddev: 25 (750 ns)
>> [ 247.694506] rt_eval_tmu e6fc0000.tmu: Mean: 302 (9060 ns) stddev: 17 (510 ns)
>>
>> Driver outputs both the raw timer value (eg. 4) and the same value
>> scaled to nanoseconds (eg. 1320 ns). As you can see baremetal setup is
>> much faster. But experiments showed that Linux does not provide
>> consistent values, even when running in baremetal mode. You can see
>> sporadic spikes in "std dev" values.
>>
>> So my next step was to use proper RT OS to do the measurements. I
>> chose Zephyr. My PR that adds Xen support to Zephyr can be found at
>> [1]. Support for RCAR Gen3 is not upstreamed, but is present on my
>> GitHub([2]). At [3] you can find the source code for application that
>> does the latency measurements. It behaves exactly as my linux driver,
>> but provides a bit more information:
>>
>> *** Booting Zephyr OS build zephyr-v2.4.0-2750-g0f2c858a39fc ***
>> RT Eval app
>>
>> Counter freq is 33280000 Hz. Period is 30 ns
>> Set alarm in 0 sec (332800 ticks)
>> Mean: 600 (18000 ns) stddev: 3737 (112110 ns) above thr: 0% [265 (7950 ns) - 66955 (2008650 ns)] global [265 (7950 ns) 66955 (2008650 ns)]
>> Mean: 388 (11640 ns) stddev: 2059 (61770 ns) above thr: 0% [266 (7980 ns) - 58830 (1764900 ns)] global [265 (7950 ns) 66955 (2008650 ns)]
>> Mean: 358 (10740 ns) stddev: 1796 (53880 ns) above thr: 0% [265 (7950 ns) - 57780 (1733400 ns)] global [265 (7950 ns) 66955 (2008650 ns)]
>> ...
>>
>> So there you can see: mean time, standard deviation, % of interrupts
>> that was processed above 30us threshold, minimum and maximum latency
>> values for the current 10s run, global minimum and maximum.
>>
>> Zephyr running as baremetal showed very stable results (this is an
>> older build, so no extended statistics there):
>>
>> ## Starting application at 0x480803C0 ...
>> *** Booting Zephyr OS build zephyr-v2.4.0-1137-g5803ee1e8183 ***
>> RT Eval app
>>
>> Counter freq is 33280000 Hz. Period is 30 ns
>> Mean: 31 (930 ns) stddev: 0 (0 ns)
>> Mean: 31 (930 ns) stddev: 0 (0 ns)
>> Mean: 31 (930 ns) stddev: 0 (0 ns)
>> Mean: 31 (930 ns) stddev: 0 (0 ns)
>> Mean: 31 (930 ns) stddev: 0 (0 ns)
>> Mean: 31 (930 ns) stddev: 0 (0 ns)
>> ...
>>
>> As Zephyr provided stable readouts with no jitter, I used it to do all
>> subsequent measurements.
>>
>>
>> IMPORTANT! All subsequent tests was conducted with only 1 CPU core
>> enabled. My goal was to ensure that system can timely react to an
>> external interrupt even under load.
>>
>>
>> Test results and latency sources
>>
>> As you can see, baremetal OS provides latency of ~0.9us without any
>> deviations. The same code running as DomU on idle system shows mean
>> latency of 12us and deviation of about 10us. Range of latencies in a
>> 10s batch can vary from 8us to 25us. This fits into required 30us
>> threshold, so no big issue there.
>
> This is all very interesting work - thankyou for sharing.
>
> However, there is one set of measurements which I think are missing, and
> are needed to describe the full picture.
>
> You've got measurements for a native kernel taking interrupts, with
> excellent results for Zephyr, and good results for Linux.
>
> You've also got measurements for putting Xen into the mix, and the
> results could do with some work.
>
> Adding Xen totally changes the structure of the system. Conceptually
> (if the Xen model were expressed as native), what you've got now is a
> single piece of userspace (pid1, no fork()ing), which is single threaded
> (but doing its own private task switching between "thread"), where some
> threads are making blocking system calls (all Xen hypercalls are fully
> synchronous) while other threads are measuring the latency for a signal
> to hit.

Not quite. I used Zephyr to do the most of the measurements, and
it ran as DomU. So, conceptually it was pid1 (Dom0) with a single
thread(vCPU) and pid2 (Zephyr) with a single thread. Pid1 is doing some
blocking calls and pid2 just waiting for its signal so it can measure
latency. Thus, my setup more resembles system with two processes, not
with one.

> I think it would be incredibly valuable to mock something like that up
> as a 3rd set of measurements, because it allows for a fairer comparison
> of the rough structure of a virtualised system, with and without Xen
> being involved.

Well, I probably can put similar setup in Linux. This will allow me to
compare Linux and Xen. Of course, it will be interesting, but I'm not
sure that this comparison will provide any new information. I am
examining Xen, to see if it can fit into some arbitrary requirements for
soft RT. I doubt that Linux will behave better in this scenario, but
even if it will - Linux and Xen are quite different, so it is not easy
to "port" something scheduling-related from one to another.

>
> It would also be very valuable to see where the latency is in the
> system. Would it be possible to set up a shared page, and have the
> various moving parts write the counter value they see into it, so the
> final stats can do a breakdown?
>
> You probably want counter samples from:
> 1) When Xen has positively identified this single interrupt
> 2) When the interrupt has been marked pending for the guest
> 3) When the guest kernel has positively identified this single interrupt
> 4) When the interrupt handler has run (this is the datapoint you've got,
> AFAICT).

I am using tracing infrastructure already available in Xen to do such
measurements. I described in my reply to Julien how I am doing this. Of
course, it covers only hypervisor part, so it is impossible to cover
points 3 and 4. On other hand, zephyr is quite consistent in its own
latency. Basically, it does nothing other than waiting that single
interrupt. I assume it is safe to take baremetal measurements as a delay
caused by Zephyr itself.

> Maybe for point 2, you also want some metadata as to what context was
> interrupted (EL0..2).
>
> You ought to see reasonably consistent results when interrupting EL0,
> because that should transition immediately to EL1. For interrupting
> EL1, you'll perhaps see variation depending on whether other higher
> priority are being serviced.

Yes, I am planing to do that sort of measurements. One downside is that
you can use TRACE_XD macros only from C code, so it would not yield the
most accurate result.

On other hand, I am experimenting with hardware tracing. It is somewhat
hard to configure, but I'm hope I'll be able to tracing configuration
that does not cause trace buffer overflows. In this case, I will get the
most accurate timing data possible.

> For interrupting EL2, this is where I'd expect you to see all the
> problems with hypercalls, or potentially other work Xen is doing.
> Fundamentally, Xen can't just drop what it's doing and resume guest
> context for the interrupt to be delivered.
>
> We get around this with the hypercall continuation mechanism, which is
> conceptually similar to voluntary preemption (we periodically check to
> see if we should stop doing what we're doing, unwind state safely, and
> get back out into guest context).
>
> However, even using hypercall continuations, there will be a minimum
> unit of work, necessary to guarantee forward progress, which cannot be
> subdivided further. There is also a cost to having finer grain
> continuation checks, but hopefully that is confined to the error margins.
>

I see. Another problem with hypercall continuations is that there is a
single point of preemption. If hypercall invokes some long running
function, that can't be split into smaller ones, like already mentioned
alloc_heap_pages(), it is impossible to make it preemptible.

>
> Alternatively, if we still have problems, something to consider might be
> the task domain concept which has been discussed before. This was an
> idea to try and bring something more like normal tasks to Xen, by having
> a hypervisor context that the control domain could queue work with,
> who's running wouldn't be a synchronous hypercall from from the guest's
> perspective.
>
> Such a task domain would require a mechanism to ensure it got scheduled
> with a high enough priority not to get starved, and would probably
> benefit if we restructured Xen to have fewer irq-off areas (which would
> be a great thing irrespective - the scheduler lock in particular needs
> to become irqs-off). In other words, the task domain concept is a lot
> of work to do, with only a general hope of it resulting in an
> improvement - it shouldn't be jumped upon until lightly to solve perf
> problems.

Task domain seems like an extra entity. Like, how it is better than just
native Xen threads? E.g. linux kernel does not have separate process to
hold own threads.

As I see it, only one reason, why hypercall can't be preempted is the
absence of place where its context can be stored. Thus, all this black
magic of using hypercall arguments as a context storage.

I'm sure that I am oversimplifying and that I'm missing some key aspect,
but how it is hard to just allocate Xen stack for every vCPU and store
its context there? All Xen code is already made thread-safe because of
SMP. So why not just allow preemption at any point outside of atomic context?

--
Volodymyr Babchuk at EPAM
Re: IRQ latency measurements in hypervisor [ In reply to ]
On Thu, 2021-01-21 at 01:20 +0000, Volodymyr Babchuk wrote:
> Hi Dario,
>
Hi :-)

> Dario Faggioli writes:
> > Anyway, as I was saying, having a latency which is ~ 2x of your
> > period
> > is ok, and it should be expected (when you size the period). In
> > fact,
> > let's say that your budget is Xus, and your period is 30us. This
> > means
> > that you get to execute for Xus every 30us. So, basically, at time
> > t0
> > you are given a budget of Xus and you are guaranteed to be able to
> > use
> > it all within time t1=t0+30us. At that time (t1=t0+30us) you are
> > given
> > another Xus amount of budget, and you are guaranteed to be able to
> > use
> > it all within t2=t1+30us=t0+60us.
>
> Well, I'm not sure if I got you right. Are you saying that unused
> budget
> is preserved?
>
No, it's not preserved and that's not what I meant... sorry if I did
not manage to make myself clear.

> If I understood it correct, any unused budget is lost. So, basically
> RDTS guarantees that your vcpu will able to run Xus in total every
> Yus,
> where X is the budget and Y is the period. 
>
Yep. Every Y, you are given X and you can use it. When, after Y, a new
replenishment happens, you are given X again, so if you did not use all
of X during the previous period, the amount you didn't use is lost.

> Also, it does not guarantee
> that vCPU will be scheduled right away, so for period of 100us and
> budget of 30us it will be perfectly fine to have latency of 70+us:
>
Exactly, that kind of more what I was trying to say myself.

In fact, as you say, with period 100 and budget 30, it is perfectly
"legal" for the vcpu to start executing at 70 after the replenishment
(i.e., after the beginning of a new period).

And that is also why, theoretically, the maximum latency between to
consecutive execution of a vcpu could be, in the worst possible case,
2*P-2*B (where, this time, P is the period and B is the budget).

> If at t=0 new period begins and in the same time IRQ fires, but RDTS
> decides to run another task, it is possible that your vCPU will be
> scheduled at only t=70us.
>
Exactly.

> > What it will therefore record would be a latency to t2-5us, which
> > in
> > fact is:
> >
> >   t1 + 30us - 5us = t0 + 30us + 30us - 5us =
> > = 0 + 60us - 5us = 55us ~= 60us
> >
> > So... May this be the case?
>
> Yes, probably this is the issue. I used budget of 15us in this
> case. But, taking into account that minimal observed latency is 3us
> and
> typical is ~10us, it is quite possible that budget will be emptied
> before IRQ handler will have a chance to complete.
>
Indeed.

> It would be great to have priorities in RTDS, so more critical task
> can
> preempt less critical one. 
>
Well, I see from where this comes but it's a dynamic scheduler, so the
priority is given by the (absolute) deadline. The challenge, therefore,
is to determine the proper period and budget to be assigned to each
vcpu, given your workload and your requirement.

With fixed priority schedulers, this "modelling" part could be easier,
but then the scheduler themselves have most of the time worse
performance and only allow you to utilize a fraction of the resources
that you have available.

And adding some kind of priority on top of a dynamic scheduler such as
RTDS --although it may sound appealing when thinking about it-- would
mess up both the theoretical analysis and the implementation of the
algorithm itself, believe me. :-)

> I believe, I have seen corresponding patches
> somewhere...
>
Mmm... not sure. Maybe you're referring to how "extratime" is handled?
Because there is a concept of priority in the way it's handled, but
it's mostly an implementation detail.

That being said, was extratime enabled, in your experiments, for the
various vcpus?

Of course, it's not that you can rely on a particular vcpu to be able
to take advantage of it, because it's not how it works. But at the same
time it does not disrupt the real-time guarantees so it's safe to
use... It may be interesting to give it a try.

Regards
--
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)
Re: IRQ latency measurements in hypervisor [ In reply to ]
Hi Volodymir,

On 20/01/2021 23:03, Volodymyr Babchuk wrote:
> Julien Grall writes:
>>>>> This is very interestingi too. Did you get any spikes with the
>>>>> period
>>>>> set to 100us? It would be fantastic if there were none.
>>>>>
>>>>>> 3. Huge latency spike during domain creation. I conducted some
>>>>>> additional tests, including use of PV drivers, but this didn't
>>>>>> affected the latency in my "real time" domain. But attempt to
>>>>>> create another domain with relatively large memory size of 2GB led
>>>>>> to huge spike in latency. Debugging led to this call path:
>>>>>>
>>>>>> XENMEM_populate_physmap -> populate_physmap() ->
>>>>>> alloc_domheap_pages() -> alloc_heap_pages()-> huge
>>>>>> "for ( i = 0; i < (1 << order); i++ )" loop.
>>>>
>>>> There are two for loops in alloc_heap_pages() using this syntax. Which
>>>> one are your referring to?
>>> I did some tracing with Lautrebach. It pointed to the first loop and
>>> especially to flush_page_to_ram() call if I remember correctly.
>>
>> Thanks, I am not entirely surprised because we are clean and
>> invalidating the region line by line and across all the CPUs.
>>
>> If we are assuming 128 bytes cacheline, we will need to issue 32 cache
>> instructions per page. This going to involve quite a bit of traffic on
>> the system.
>>
>> One possibility would be to defer the cache flush when the domain is
>> created and use the hypercall XEN_DOMCTL_cacheflush to issue the
>> flush.
>
> Can we flush caches on first access to a page? What I mean - do not
> populate stage 2 tables with allocated memory. Flush caches in abort
> handler and then populate them.

We are using a similar approach for implementing set/way but it only
works as long as you don't assign a device (with or without an IOMMU).

Currently, it is quite uncommon to have device that are able to restart
a DMA transaction after faulting. So we would not be able to share the
P2M with IOMMU if we populate the P2M on the first access (a device may
be the first to access the memory).

Even if we decided to unshare the P2M, there would still be a trust
problem if the device is non-coherent (i.e. it cannot snoop the cache).

The same can be said if you are not protecting the device with an IOMMU
(you may have an MPU or something different on the system).

The only case where we could possibly disable the flush is when your
memory is statically partionned as the guest would always receive the
same pages.

[...]

>>
>> Do you have any suggestion how to split it?
>>
>
> Well, it is quite complex function and I can't tell right away.
> At this time I don't quite understand why spin_unlock() is called after
> the first (1 << order) loop for instance.

The part after the spin lock will check whether the pages are scrubbed.
As the content of the pages cannot be touched by someone else (we made
sure the scrub task removed them from its list). Therefore, the
operation is fine to be called without the heap_lock held.

>
> Also, this function does many different things for its simple name.

Everything in this function needs to happen before the page can safely
be handed out to another part of Xen or a guest. AFAICT, the only thing
that can possibly be split out is the call to flush_page_to_ram().

>
>>>
>>>> I think the first step is we need to figure out which part of the
>>>> allocation is slow (see my question above). From there, we can figure
>>>> out if there is a way to reduce the impact.
>>> I'll do more tracing and will return with more accurate numbers.
>>> But as far as I can see, any loop on 262144 pages will take some time..
>> .
>>
>> It really depends on the content of the loop. On any modern
>> processors, you are very likely not going to notice a loop that update
>> just a flag.
>>
>> However, you are likely going to be see an impact if your loop is
>> going to clean & invalidate the cache for each page.
>>
>
> Totally agree. I used Xen tracing subsystem to do the measurements and I
> can confirm that call to flush_page_to_ram() causes most of the impact.
>
>
> There is the details:
>
>
> I added number of tracing points to the function:
>
> static struct page_info *alloc_heap_pages(
> unsigned int zone_lo, unsigned int zone_hi,
> unsigned int order, unsigned int memflags,
> struct domain *d)
> {
> nodeid_t node;
> unsigned int i, buddy_order, zone, first_dirty;
> unsigned long request = 1UL << order;
> struct page_info *pg;
> bool need_tlbflush = false;
> uint32_t tlbflush_timestamp = 0;
> unsigned int dirty_cnt = 0;
>
> /* Make sure there are enough bits in memflags for nodeID. */
> BUILD_BUG_ON((_MEMF_bits - _MEMF_node) < (8 * sizeof(nodeid_t)));
>
> ASSERT(zone_lo <= zone_hi);
> ASSERT(zone_hi < NR_ZONES);
>
> if ( unlikely(order > MAX_ORDER) )
> return NULL;
>
> spin_lock(&heap_lock);
>
> TRACE_1D(TRC_PGALLOC_PT1, order); // <=================================
>
> /*
> * Claimed memory is considered unavailable unless the request
> * is made by a domain with sufficient unclaimed pages.
> */
> if ( (outstanding_claims + request > total_avail_pages) &&
> ((memflags & MEMF_no_refcount) ||
> !d || d->outstanding_pages < request) )
> {
> spin_unlock(&heap_lock);
> return NULL;
> }
>
> pg = get_free_buddy(zone_lo, zone_hi, order, memflags, d);
> /* Try getting a dirty buddy if we couldn't get a clean one. */
> if ( !pg && !(memflags & MEMF_no_scrub) )
> pg = get_free_buddy(zone_lo, zone_hi, order,
> memflags | MEMF_no_scrub, d);
> if ( !pg )
> {
> /* No suitable memory blocks. Fail the request. */
> spin_unlock(&heap_lock);
> return NULL;
> }
>
> TRACE_0D(TRC_PGALLOC_PT2); // <=================================
>
> node = phys_to_nid(page_to_maddr(pg));
> zone = page_to_zone(pg);
> buddy_order = PFN_ORDER(pg);
>
> first_dirty = pg->u.free.first_dirty;
>
> /* We may have to halve the chunk a number of times. */
> while ( buddy_order != order )
> {
> buddy_order--;
> page_list_add_scrub(pg, node, zone, buddy_order,
> (1U << buddy_order) > first_dirty ?
> first_dirty : INVALID_DIRTY_IDX);
> pg += 1U << buddy_order;
>
> if ( first_dirty != INVALID_DIRTY_IDX )
> {
> /* Adjust first_dirty */
> if ( first_dirty >= 1U << buddy_order )
> first_dirty -= 1U << buddy_order;
> else
> first_dirty = 0; /* We've moved past original first_dirty */
> }
> }
>
> TRACE_0D(TRC_PGALLOC_PT3); // <=================================
>
> ASSERT(avail[node][zone] >= request);
> avail[node][zone] -= request;
> total_avail_pages -= request;
> ASSERT(total_avail_pages >= 0);
>
> check_low_mem_virq();
>
> if ( d != NULL )
> d->last_alloc_node = node;
>
> for ( i = 0; i < (1 << order); i++ )
> {
> /* Reference count must continuously be zero for free pages. */
> if ( (pg[i].count_info & ~PGC_need_scrub) != PGC_state_free )
> {
> printk(XENLOG_ERR
> "pg[%u] MFN %"PRI_mfn" c=%#lx o=%u v=%#lx t=%#x\n",
> i, mfn_x(page_to_mfn(pg + i)),
> pg[i].count_info, pg[i].v.free.order,
> pg[i].u.free.val, pg[i].tlbflush_timestamp);
> BUG();
> }
>
> /* PGC_need_scrub can only be set if first_dirty is valid */
> ASSERT(first_dirty != INVALID_DIRTY_IDX || !(pg[i].count_info & PGC_need_scrub));
>
> /* Preserve PGC_need_scrub so we can check it after lock is dropped. */
> pg[i].count_info = PGC_state_inuse | (pg[i].count_info & PGC_need_scrub);
>
> if ( !(memflags & MEMF_no_tlbflush) )
> accumulate_tlbflush(&need_tlbflush, &pg[i],
> &tlbflush_timestamp);
>
> /* Initialise fields which have other uses for free pages. */
> pg[i].u.inuse.type_info = 0;
> page_set_owner(&pg[i], NULL);
>
> /* Ensure cache and RAM are consistent for platforms where the
> * guest can control its own visibility of/through the cache.
> */
> flush_page_to_ram(mfn_x(page_to_mfn(&pg[i])),
> !(memflags & MEMF_no_icache_flush));
> }
>
> TRACE_0D(TRC_PGALLOC_PT4); // <=================================
> spin_unlock(&heap_lock);
>
> if ( first_dirty != INVALID_DIRTY_IDX ||
> (scrub_debug && !(memflags & MEMF_no_scrub)) )
> {
> for ( i = 0; i < (1U << order); i++ )
> {
> if ( test_bit(_PGC_need_scrub, &pg[i].count_info) )
> {
> if ( !(memflags & MEMF_no_scrub) )
> scrub_one_page(&pg[i]);
>
> dirty_cnt++;
>
> spin_lock(&heap_lock);
> pg[i].count_info &= ~PGC_need_scrub;
> spin_unlock(&heap_lock);
> }
> else if ( !(memflags & MEMF_no_scrub) )
> check_one_page(&pg[i]);
> }
>
> if ( dirty_cnt )
> {
> spin_lock(&heap_lock);
> node_need_scrub[node] -= dirty_cnt;
> spin_unlock(&heap_lock);
> }
> }
>
> TRACE_0D(TRC_PGALLOC_PT5); // <=================================
> if ( need_tlbflush )
> filtered_flush_tlb_mask(tlbflush_timestamp);
>
> TRACE_0D(TRC_PGALLOC_PT6); // <=================================
>
> return pg;
> }
>
>
> And wrote a simple Python scripts that parses the output of
> xentrace. There are results for different orders:
>
> 46.842032: page_alloc trace point 1. Order: 18
> 46.842035: page_alloc trace point 2 (+ 0.000003)
> 46.842035: page_alloc trace point 3 (+ 0.000000)
> 46.975105: page_alloc trace point 4 (+ 0.133069)
> 46.975106: page_alloc trace point 5 (+ 0.000001)
> 46.975106: page_alloc trace point 6 (+ 0.000000): total: 0.133074
>
> 46.998536: page_alloc trace point 1. Order: 9
> 46.998538: page_alloc trace point 2 (+ 0.000002)
> 46.998540: page_alloc trace point 3 (+ 0.000001)
> 46.998799: page_alloc trace point 4 (+ 0.000259)
> 46.998800: page_alloc trace point 5 (+ 0.000000)
> 46.998800: page_alloc trace point 6 (+ 0.000000): total: 0.000264
>
> 46.835802: page_alloc trace point 1. Order: 3
> 46.835803: page_alloc trace point 2 (+ 0.000000)
> 46.835803: page_alloc trace point 3 (+ 0.000000)
> 46.835812: page_alloc trace point 4 (+ 0.000009)
> 46.835813: page_alloc trace point 5 (+ 0.000000)
> 46.835813: page_alloc trace point 6 (+ 0.000001): total: 0.000011
>
> 46.998815: page_alloc trace point 1. Order: 0
> 46.998816: page_alloc trace point 2 (+ 0.000002)
> 46.998817: page_alloc trace point 3 (+ 0.000000)
> 46.998818: page_alloc trace point 4 (+ 0.000002)
> 46.998819: page_alloc trace point 5 (+ 0.000001)
> 46.998819: page_alloc trace point 6 (+ 0.000000): total: 0.000005
>
> Then I commented out call to flush_page_to_ram() and got the following
> results:
>
> 149.561902: page_alloc trace point 1. Order: 18
> 149.561905: page_alloc trace point 2 (+ 0.000003)
> 149.561905: page_alloc trace point 3 (+ 0.000000)
> 149.569450: page_alloc trace point 4 (+ 0.007545)
> 149.569451: page_alloc trace point 5 (+ 0.000001)
> 149.569452: page_alloc trace point 6 (+ 0.000000): total: 0.007550
>
> 149.592624: page_alloc trace point 1. Order: 9
> 149.592626: page_alloc trace point 2 (+ 0.000003)
> 149.592627: page_alloc trace point 3 (+ 0.000001)
> 149.592639: page_alloc trace point 4 (+ 0.000012)
> 149.592639: page_alloc trace point 5 (+ 0.000000)
> 149.592640: page_alloc trace point 6 (+ 0.000000): total: 0.000016
>
> All time units are seconds, by the way.

That's quite a big improvement! I would definitely consider to defer the
flush and use the domctl.

However, I am a bit confused with the result. In an earlier, you wrote
the operation would take a second. The numbers for oder=18, only show
133ms spent in the alloc_heap_pages(). Would you be able to clarify
where the rest of the time is spent?

Would it be possible that updating the p2m also take some time?

Cheers,

--
Julien Grall