Mailing List Archive

Enabling VT-d PI by default
Hello, Jan.

As you know, with VT-d PI enabled, hardware can directly deliver external
interrupts to guest without any VMM intervention. It will reduces overall
interrupt latency to guest and reduces overheads otherwise incurred by the
VMM for virtualizing interrupts. In my mind, it's an important feature to
interrupt virtualization.

But VT-d PI feature is disabled by default on Xen for some corner
cases and bugs. Based on Feng's work, we have fixed those corner
cases related to VT-d PI. Do you think it is a time to enable VT-d PI by
default. If no, could you list your concerns so that we can resolve them?

Thanks
Chao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
Re: Enabling VT-d PI by default [ In reply to ]
>>> On 11.04.17 at 02:59, <chao.gao@intel.com> wrote:
> As you know, with VT-d PI enabled, hardware can directly deliver external
> interrupts to guest without any VMM intervention. It will reduces overall
> interrupt latency to guest and reduces overheads otherwise incurred by the
> VMM for virtualizing interrupts. In my mind, it's an important feature to
> interrupt virtualization.
>
> But VT-d PI feature is disabled by default on Xen for some corner
> cases and bugs. Based on Feng's work, we have fixed those corner
> cases related to VT-d PI. Do you think it is a time to enable VT-d PI by
> default. If no, could you list your concerns so that we can resolve them?

I don't recall you addressing the main issue (blocked vCPU-s list
length; see the comment next to the iommu_intpost definition).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
Re: Enabling VT-d PI by default [ In reply to ]
On Tue, Apr 11, 2017 at 02:21:07AM -0600, Jan Beulich wrote:
>>>> On 11.04.17 at 02:59, <chao.gao@intel.com> wrote:
>> As you know, with VT-d PI enabled, hardware can directly deliver external
>> interrupts to guest without any VMM intervention. It will reduces overall
>> interrupt latency to guest and reduces overheads otherwise incurred by the
>> VMM for virtualizing interrupts. In my mind, it's an important feature to
>> interrupt virtualization.
>>
>> But VT-d PI feature is disabled by default on Xen for some corner
>> cases and bugs. Based on Feng's work, we have fixed those corner
>> cases related to VT-d PI. Do you think it is a time to enable VT-d PI by
>> default. If no, could you list your concerns so that we can resolve them?
>
>I don't recall you addressing the main issue (blocked vCPU-s list
>length; see the comment next to the iommu_intpost definition).
>

Indeed. I have gone through the discussion happened in April 2016[1, 2].
[1] https://lists.gt.net/xen/devel/422661?search_string=VT-d%20posted-interrupt%20core%20logic%20handling;#422661
[2] https://lists.gt.net/xen/devel/422567?search_string=%20The%20length%20of%20the%20list%20depends;#422567.

First of all, I admit this is an issue in extreme case and we should
come up with a solution.

The problem we are facing is:
There is a per-cpu list used to maintain all the blocked vCPU on a
pCPU. When a wakeup interrupt comes, the interrupt handler travels
the list to wake the vCPUs whose pi_desc indicates an interrupt has
been posted. There is no policy to restrict the size of the list such
that in some extreme case, the list can be too long to cause some
issues (the most obvious issue is about interrupt latency).

The theoretical max number of entry in the list is 4M as one host can
have 32k domains and every domain can have 128vCPU. If all the vCPUs
are blocked in one list, the list gets its theoretical maximum.

The root cause of this issue, I think, is that the wakeup interrupt
vector is shared by all the vCPUs on one pCPU. Lacking of enough
information (such as which device sends or which IRTE translates this
interrupt), there is no effective method to distinguish the
interrupt's destination vCPU except traveling this list. Right? So we
only can mitigate this issue through decreasing or limiting the
entry's maximum in one list.

Several methods we can take to mitigate this issue:
1. According to your discussions, evenly distributing all the blocked
vCPUs among all pCPUs can mitigate this issue. With this approach, all
vCPUs are blocked in one list can be avoided. It can decrease the
entry's maximum in one list by N times (N is the number of pCPU).

2. Don't put the blocked vCPUs which won't be woken by the wakeup
interrupt into the per-cpu list. Currently, we put the blocked vCPUs
belong to domains who have assigned devices into the list. But if one
blocked vCPU of such domain is not a destination of every posted
format IRTE, it needn't be added to the per-cpu list. The blocked vCPU
will be woken by IPIs or other virtual interrupts. From this aspect, we
can decrease the entries in the per-cpu list.

3. Like what we do in struct irq_guest_action_t, can we limit the
maximum of entry we support in the list. With this approach, during
domain creation, we calculate the available entries and compare with
the domain's vCPU number to decide whether the domain can use VT-d PI.
This method will pose a strict restriction to the maximum of entry in
one list. But it may affect vCPU hotplug.

According to your intuition, which methods are feasible and
acceptable? I will attempt to mitigate this issue per your advices.

Thanks
Chao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
Re: Enabling VT-d PI by default [ In reply to ]
On Tue, Apr 18, 2017 at 02:24:05PM +0800, Tian, Kevin wrote:
>> From: Gao, Chao
>> Sent: Monday, April 17, 2017 4:14 AM
>>
>> On Tue, Apr 11, 2017 at 02:21:07AM -0600, Jan Beulich wrote:
>> >>>> On 11.04.17 at 02:59, <chao.gao@intel.com> wrote:
>> 3. Like what we do in struct irq_guest_action_t, can we limit the
>> maximum of entry we support in the list. With this approach, during
>> domain creation, we calculate the available entries and compare with
>> the domain's vCPU number to decide whether the domain can use VT-d PI.
>
>VT-d PI is global instead of per-domain. I guess you actually mean
>failing device assignment operation if counting new domain's #VCPUs
>exceeds the limitation.

Almost agree. But I think device assignment is also allowed in that
case. We just disable the new created domain to use VT-d PI.

>
>> This method will pose a strict restriction to the maximum of entry in
>> one list. But it may affect vCPU hotplug.
>>
>> According to your intuition, which methods are feasible and
>> acceptable? I will attempt to mitigate this issue per your advices.
>>
>
>My understanding is that we need them all. #1 is the baseline,
>with #2/#3 as further optimization. :-)

Thanks your input. I will have a try.

Thanks
Chao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
Re: Enabling VT-d PI by default [ In reply to ]
On Tue, Apr 18, 2017 at 02:13:36AM -0600, Jan Beulich wrote:
>>>> On 16.04.17 at 22:13, <chao.gao@intel.com> wrote:
>> 3. Like what we do in struct irq_guest_action_t, can we limit the
>> maximum of entry we support in the list. With this approach, during
>> domain creation, we calculate the available entries and compare with
>> the domain's vCPU number to decide whether the domain can use VT-d PI.
>> This method will pose a strict restriction to the maximum of entry in
>> one list. But it may affect vCPU hotplug.
>
>I don't view this as really suitable - irq_guest_action is quite different,
>as one can reasonably place expectations on how many devices may
>share an interrupt line. If someone really hit this boundary, (s)he
>could likely re-configure their system by moving expansion cards
>between slots. Neither of this is comparable with the PI situation, as
>it looks to me.
>
>Furthermore, whether a guest would be able to start / use PI would
>be quite hard to tell for an admin as it seems, again as opposed to
>the case with the shared interrupt lines.
>

Indeed. It would annoy the admin. What's your opinion on the
first and second methods? Do you think we need such policy to
restrict the #entry in the list even with the first two methods?

Thanks
Chao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
Re: Enabling VT-d PI by default [ In reply to ]
> From: Gao, Chao
> Sent: Monday, April 17, 2017 4:14 AM
>
> On Tue, Apr 11, 2017 at 02:21:07AM -0600, Jan Beulich wrote:
> >>>> On 11.04.17 at 02:59, <chao.gao@intel.com> wrote:
> >> As you know, with VT-d PI enabled, hardware can directly deliver external
> >> interrupts to guest without any VMM intervention. It will reduces overall
> >> interrupt latency to guest and reduces overheads otherwise incurred by
> the
> >> VMM for virtualizing interrupts. In my mind, it's an important feature to
> >> interrupt virtualization.
> >>
> >> But VT-d PI feature is disabled by default on Xen for some corner
> >> cases and bugs. Based on Feng's work, we have fixed those corner
> >> cases related to VT-d PI. Do you think it is a time to enable VT-d PI by
> >> default. If no, could you list your concerns so that we can resolve them?
> >
> >I don't recall you addressing the main issue (blocked vCPU-s list
> >length; see the comment next to the iommu_intpost definition).
> >
>
> Indeed. I have gone through the discussion happened in April 2016[1, 2].
> [1] https://lists.gt.net/xen/devel/422661?search_string=VT-d%20posted-
> interrupt%20core%20logic%20handling;#422661
> [2]
> https://lists.gt.net/xen/devel/422567?search_string=%20The%20length%20o
> f%20the%20list%20depends;#422567.
>
> First of all, I admit this is an issue in extreme case and we should
> come up with a solution.
>
> The problem we are facing is:
> There is a per-cpu list used to maintain all the blocked vCPU on a
> pCPU. When a wakeup interrupt comes, the interrupt handler travels
> the list to wake the vCPUs whose pi_desc indicates an interrupt has
> been posted. There is no policy to restrict the size of the list such
> that in some extreme case, the list can be too long to cause some
> issues (the most obvious issue is about interrupt latency).
>
> The theoretical max number of entry in the list is 4M as one host can
> have 32k domains and every domain can have 128vCPU. If all the vCPUs
> are blocked in one list, the list gets its theoretical maximum.
>
> The root cause of this issue, I think, is that the wakeup interrupt
> vector is shared by all the vCPUs on one pCPU. Lacking of enough
> information (such as which device sends or which IRTE translates this
> interrupt), there is no effective method to distinguish the
> interrupt's destination vCPU except traveling this list. Right? So we
> only can mitigate this issue through decreasing or limiting the
> entry's maximum in one list.
>
> Several methods we can take to mitigate this issue:
> 1. According to your discussions, evenly distributing all the blocked
> vCPUs among all pCPUs can mitigate this issue. With this approach, all
> vCPUs are blocked in one list can be avoided. It can decrease the
> entry's maximum in one list by N times (N is the number of pCPU).
>
> 2. Don't put the blocked vCPUs which won't be woken by the wakeup
> interrupt into the per-cpu list. Currently, we put the blocked vCPUs
> belong to domains who have assigned devices into the list. But if one
> blocked vCPU of such domain is not a destination of every posted
> format IRTE, it needn't be added to the per-cpu list. The blocked vCPU
> will be woken by IPIs or other virtual interrupts. From this aspect, we
> can decrease the entries in the per-cpu list.
>
> 3. Like what we do in struct irq_guest_action_t, can we limit the
> maximum of entry we support in the list. With this approach, during
> domain creation, we calculate the available entries and compare with
> the domain's vCPU number to decide whether the domain can use VT-d PI.

VT-d PI is global instead of per-domain. I guess you actually mean
failing device assignment operation if counting new domain's #VCPUs
exceeds the limitation.

> This method will pose a strict restriction to the maximum of entry in
> one list. But it may affect vCPU hotplug.
>
> According to your intuition, which methods are feasible and
> acceptable? I will attempt to mitigate this issue per your advices.
>

My understanding is that we need them all. #1 is the baseline,
with #2/#3 as further optimization. :-)

Thanks
Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
Re: Enabling VT-d PI by default [ In reply to ]
>>> On 16.04.17 at 22:13, <chao.gao@intel.com> wrote:
> 3. Like what we do in struct irq_guest_action_t, can we limit the
> maximum of entry we support in the list. With this approach, during
> domain creation, we calculate the available entries and compare with
> the domain's vCPU number to decide whether the domain can use VT-d PI.
> This method will pose a strict restriction to the maximum of entry in
> one list. But it may affect vCPU hotplug.

I don't view this as really suitable - irq_guest_action is quite different,
as one can reasonably place expectations on how many devices may
share an interrupt line. If someone really hit this boundary, (s)he
could likely re-configure their system by moving expansion cards
between slots. Neither of this is comparable with the PI situation, as
it looks to me.

Furthermore, whether a guest would be able to start / use PI would
be quite hard to tell for an admin as it seems, again as opposed to
the case with the shared interrupt lines.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
Re: Enabling VT-d PI by default [ In reply to ]
>>> On 18.04.17 at 05:41, <chao.gao@intel.com> wrote:
> On Tue, Apr 18, 2017 at 02:13:36AM -0600, Jan Beulich wrote:
>>>>> On 16.04.17 at 22:13, <chao.gao@intel.com> wrote:
>>> 3. Like what we do in struct irq_guest_action_t, can we limit the
>>> maximum of entry we support in the list. With this approach, during
>>> domain creation, we calculate the available entries and compare with
>>> the domain's vCPU number to decide whether the domain can use VT-d PI.
>>> This method will pose a strict restriction to the maximum of entry in
>>> one list. But it may affect vCPU hotplug.
>>
>>I don't view this as really suitable - irq_guest_action is quite different,
>>as one can reasonably place expectations on how many devices may
>>share an interrupt line. If someone really hit this boundary, (s)he
>>could likely re-configure their system by moving expansion cards
>>between slots. Neither of this is comparable with the PI situation, as
>>it looks to me.
>>
>>Furthermore, whether a guest would be able to start / use PI would
>>be quite hard to tell for an admin as it seems, again as opposed to
>>the case with the shared interrupt lines.
>
> Indeed. It would annoy the admin. What's your opinion on the
> first and second methods? Do you think we need such policy to
> restrict the #entry in the list even with the first two methods?

Well, I'm in agreement with Kevin that all reasonable approaches
should be made use of here, so I'd like to defer a decision on a
forced limit until we see what effects can be achieved by the other
two methods.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
Re: Enabling VT-d PI by default [ In reply to ]
On 18/04/17 07:24, Tian, Kevin wrote:
>> From: Gao, Chao
>> Sent: Monday, April 17, 2017 4:14 AM
>>
>> On Tue, Apr 11, 2017 at 02:21:07AM -0600, Jan Beulich wrote:
>>>>>> On 11.04.17 at 02:59, <chao.gao@intel.com> wrote:
>>>> As you know, with VT-d PI enabled, hardware can directly deliver external
>>>> interrupts to guest without any VMM intervention. It will reduces overall
>>>> interrupt latency to guest and reduces overheads otherwise incurred by
>> the
>>>> VMM for virtualizing interrupts. In my mind, it's an important feature to
>>>> interrupt virtualization.
>>>>
>>>> But VT-d PI feature is disabled by default on Xen for some corner
>>>> cases and bugs. Based on Feng's work, we have fixed those corner
>>>> cases related to VT-d PI. Do you think it is a time to enable VT-d PI by
>>>> default. If no, could you list your concerns so that we can resolve them?
>>>
>>> I don't recall you addressing the main issue (blocked vCPU-s list
>>> length; see the comment next to the iommu_intpost definition).
>>>
>>
>> Indeed. I have gone through the discussion happened in April 2016[1, 2].
>> [1] https://lists.gt.net/xen/devel/422661?search_string=VT-d%20posted-
>> interrupt%20core%20logic%20handling;#422661
>> [2]
>> https://lists.gt.net/xen/devel/422567?search_string=%20The%20length%20o
>> f%20the%20list%20depends;#422567.
>>
>> First of all, I admit this is an issue in extreme case and we should
>> come up with a solution.
>>
>> The problem we are facing is:
>> There is a per-cpu list used to maintain all the blocked vCPU on a
>> pCPU. When a wakeup interrupt comes, the interrupt handler travels
>> the list to wake the vCPUs whose pi_desc indicates an interrupt has
>> been posted. There is no policy to restrict the size of the list such
>> that in some extreme case, the list can be too long to cause some
>> issues (the most obvious issue is about interrupt latency).
>>
>> The theoretical max number of entry in the list is 4M as one host can
>> have 32k domains and every domain can have 128vCPU. If all the vCPUs
>> are blocked in one list, the list gets its theoretical maximum.
>>
>> The root cause of this issue, I think, is that the wakeup interrupt
>> vector is shared by all the vCPUs on one pCPU. Lacking of enough
>> information (such as which device sends or which IRTE translates this
>> interrupt), there is no effective method to distinguish the
>> interrupt's destination vCPU except traveling this list. Right? So we
>> only can mitigate this issue through decreasing or limiting the
>> entry's maximum in one list.
>>
>> Several methods we can take to mitigate this issue:
>> 1. According to your discussions, evenly distributing all the blocked
>> vCPUs among all pCPUs can mitigate this issue. With this approach, all
>> vCPUs are blocked in one list can be avoided. It can decrease the
>> entry's maximum in one list by N times (N is the number of pCPU).
>>
>> 2. Don't put the blocked vCPUs which won't be woken by the wakeup
>> interrupt into the per-cpu list. Currently, we put the blocked vCPUs
>> belong to domains who have assigned devices into the list. But if one
>> blocked vCPU of such domain is not a destination of every posted
>> format IRTE, it needn't be added to the per-cpu list. The blocked vCPU
>> will be woken by IPIs or other virtual interrupts. From this aspect, we
>> can decrease the entries in the per-cpu list.
>>
>> 3. Like what we do in struct irq_guest_action_t, can we limit the
>> maximum of entry we support in the list. With this approach, during
>> domain creation, we calculate the available entries and compare with
>> the domain's vCPU number to decide whether the domain can use VT-d PI.
>
> VT-d PI is global instead of per-domain. I guess you actually mean
> failing device assignment operation if counting new domain's #VCPUs
> exceeds the limitation.
>
>> This method will pose a strict restriction to the maximum of entry in
>> one list. But it may affect vCPU hotplug.
>>
>> According to your intuition, which methods are feasible and
>> acceptable? I will attempt to mitigate this issue per your advices.
>>
>
> My understanding is that we need them all. #1 is the baseline,
> with #2/#3 as further optimization. :-)

Actually, regarding #2, is that the case?

If we do reference counting (as in patches 3 and 4 of Chao Gao's recent
series), then we are guaranteed never to have more vcpus on any given
wakeup list than there are machine IRQs on the system. Are we ever
going to have a system with so many IRQs that going through such a list
would be problematic?

-George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
Re: Enabling VT-d PI by default [ In reply to ]
>>> On 26.04.17 at 19:11, <george.dunlap@citrix.com> wrote:
> On 18/04/17 07:24, Tian, Kevin wrote:
>>> From: Gao, Chao
>>> Sent: Monday, April 17, 2017 4:14 AM
>>>
>>> On Tue, Apr 11, 2017 at 02:21:07AM -0600, Jan Beulich wrote:
>>>>>>> On 11.04.17 at 02:59, <chao.gao@intel.com> wrote:
>>>>> As you know, with VT-d PI enabled, hardware can directly deliver external
>>>>> interrupts to guest without any VMM intervention. It will reduces overall
>>>>> interrupt latency to guest and reduces overheads otherwise incurred by
>>> the
>>>>> VMM for virtualizing interrupts. In my mind, it's an important feature to
>>>>> interrupt virtualization.
>>>>>
>>>>> But VT-d PI feature is disabled by default on Xen for some corner
>>>>> cases and bugs. Based on Feng's work, we have fixed those corner
>>>>> cases related to VT-d PI. Do you think it is a time to enable VT-d PI by
>>>>> default. If no, could you list your concerns so that we can resolve them?
>>>>
>>>> I don't recall you addressing the main issue (blocked vCPU-s list
>>>> length; see the comment next to the iommu_intpost definition).
>>>>
>>>
>>> Indeed. I have gone through the discussion happened in April 2016[1, 2].
>>> [1] https://lists.gt.net/xen/devel/422661?search_string=VT-d%20posted-
>>> interrupt%20core%20logic%20handling;#422661
>>> [2]
>>> https://lists.gt.net/xen/devel/422567?search_string=%20The%20length%20o
>>> f%20the%20list%20depends;#422567.
>>>
>>> First of all, I admit this is an issue in extreme case and we should
>>> come up with a solution.
>>>
>>> The problem we are facing is:
>>> There is a per-cpu list used to maintain all the blocked vCPU on a
>>> pCPU. When a wakeup interrupt comes, the interrupt handler travels
>>> the list to wake the vCPUs whose pi_desc indicates an interrupt has
>>> been posted. There is no policy to restrict the size of the list such
>>> that in some extreme case, the list can be too long to cause some
>>> issues (the most obvious issue is about interrupt latency).
>>>
>>> The theoretical max number of entry in the list is 4M as one host can
>>> have 32k domains and every domain can have 128vCPU. If all the vCPUs
>>> are blocked in one list, the list gets its theoretical maximum.
>>>
>>> The root cause of this issue, I think, is that the wakeup interrupt
>>> vector is shared by all the vCPUs on one pCPU. Lacking of enough
>>> information (such as which device sends or which IRTE translates this
>>> interrupt), there is no effective method to distinguish the
>>> interrupt's destination vCPU except traveling this list. Right? So we
>>> only can mitigate this issue through decreasing or limiting the
>>> entry's maximum in one list.
>>>
>>> Several methods we can take to mitigate this issue:
>>> 1. According to your discussions, evenly distributing all the blocked
>>> vCPUs among all pCPUs can mitigate this issue. With this approach, all
>>> vCPUs are blocked in one list can be avoided. It can decrease the
>>> entry's maximum in one list by N times (N is the number of pCPU).
>>>
>>> 2. Don't put the blocked vCPUs which won't be woken by the wakeup
>>> interrupt into the per-cpu list. Currently, we put the blocked vCPUs
>>> belong to domains who have assigned devices into the list. But if one
>>> blocked vCPU of such domain is not a destination of every posted
>>> format IRTE, it needn't be added to the per-cpu list. The blocked vCPU
>>> will be woken by IPIs or other virtual interrupts. From this aspect, we
>>> can decrease the entries in the per-cpu list.
>>>
>>> 3. Like what we do in struct irq_guest_action_t, can we limit the
>>> maximum of entry we support in the list. With this approach, during
>>> domain creation, we calculate the available entries and compare with
>>> the domain's vCPU number to decide whether the domain can use VT-d PI.
>>
>> VT-d PI is global instead of per-domain. I guess you actually mean
>> failing device assignment operation if counting new domain's #VCPUs
>> exceeds the limitation.
>>
>>> This method will pose a strict restriction to the maximum of entry in
>>> one list. But it may affect vCPU hotplug.
>>>
>>> According to your intuition, which methods are feasible and
>>> acceptable? I will attempt to mitigate this issue per your advices.
>>>
>>
>> My understanding is that we need them all. #1 is the baseline,
>> with #2/#3 as further optimization. :-)
>
> Actually, regarding #2, is that the case?
>
> If we do reference counting (as in patches 3 and 4 of Chao Gao's recent
> series), then we are guaranteed never to have more vcpus on any given
> wakeup list than there are machine IRQs on the system. Are we ever
> going to have a system with so many IRQs that going through such a list
> would be problematic?

I'm afraid this is not impossible, considering that people have already
run into the interrupt vector limitation coming from there only being
about 200 vectors per CPU (and there not being, in physical mode,
any sharing of vectors between multiple CPUs, iirc). Devices using
namely MSI-X can use an awful lot of vectors. Perhaps Andrew
remembers numbers observed on actual systems here...

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
Re: Enabling VT-d PI by default [ In reply to ]
On 27/04/17 08:08, Jan Beulich wrote:
>>>> On 26.04.17 at 19:11, <george.dunlap@citrix.com> wrote:
>> On 18/04/17 07:24, Tian, Kevin wrote:
>>>> From: Gao, Chao
>>>> Sent: Monday, April 17, 2017 4:14 AM
>>>>
>>>> On Tue, Apr 11, 2017 at 02:21:07AM -0600, Jan Beulich wrote:
>>>>>>>> On 11.04.17 at 02:59, <chao.gao@intel.com> wrote:
>>>>>> As you know, with VT-d PI enabled, hardware can directly deliver external
>>>>>> interrupts to guest without any VMM intervention. It will reduces overall
>>>>>> interrupt latency to guest and reduces overheads otherwise incurred by
>>>> the
>>>>>> VMM for virtualizing interrupts. In my mind, it's an important feature to
>>>>>> interrupt virtualization.
>>>>>>
>>>>>> But VT-d PI feature is disabled by default on Xen for some corner
>>>>>> cases and bugs. Based on Feng's work, we have fixed those corner
>>>>>> cases related to VT-d PI. Do you think it is a time to enable VT-d PI by
>>>>>> default. If no, could you list your concerns so that we can resolve them?
>>>>> I don't recall you addressing the main issue (blocked vCPU-s list
>>>>> length; see the comment next to the iommu_intpost definition).
>>>>>
>>>> Indeed. I have gone through the discussion happened in April 2016[1, 2].
>>>> [1] https://lists.gt.net/xen/devel/422661?search_string=VT-d%20posted-
>>>> interrupt%20core%20logic%20handling;#422661
>>>> [2]
>>>> https://lists.gt.net/xen/devel/422567?search_string=%20The%20length%20o
>>>> f%20the%20list%20depends;#422567.
>>>>
>>>> First of all, I admit this is an issue in extreme case and we should
>>>> come up with a solution.
>>>>
>>>> The problem we are facing is:
>>>> There is a per-cpu list used to maintain all the blocked vCPU on a
>>>> pCPU. When a wakeup interrupt comes, the interrupt handler travels
>>>> the list to wake the vCPUs whose pi_desc indicates an interrupt has
>>>> been posted. There is no policy to restrict the size of the list such
>>>> that in some extreme case, the list can be too long to cause some
>>>> issues (the most obvious issue is about interrupt latency).
>>>>
>>>> The theoretical max number of entry in the list is 4M as one host can
>>>> have 32k domains and every domain can have 128vCPU. If all the vCPUs
>>>> are blocked in one list, the list gets its theoretical maximum.
>>>>
>>>> The root cause of this issue, I think, is that the wakeup interrupt
>>>> vector is shared by all the vCPUs on one pCPU. Lacking of enough
>>>> information (such as which device sends or which IRTE translates this
>>>> interrupt), there is no effective method to distinguish the
>>>> interrupt's destination vCPU except traveling this list. Right? So we
>>>> only can mitigate this issue through decreasing or limiting the
>>>> entry's maximum in one list.
>>>>
>>>> Several methods we can take to mitigate this issue:
>>>> 1. According to your discussions, evenly distributing all the blocked
>>>> vCPUs among all pCPUs can mitigate this issue. With this approach, all
>>>> vCPUs are blocked in one list can be avoided. It can decrease the
>>>> entry's maximum in one list by N times (N is the number of pCPU).
>>>>
>>>> 2. Don't put the blocked vCPUs which won't be woken by the wakeup
>>>> interrupt into the per-cpu list. Currently, we put the blocked vCPUs
>>>> belong to domains who have assigned devices into the list. But if one
>>>> blocked vCPU of such domain is not a destination of every posted
>>>> format IRTE, it needn't be added to the per-cpu list. The blocked vCPU
>>>> will be woken by IPIs or other virtual interrupts. From this aspect, we
>>>> can decrease the entries in the per-cpu list.
>>>>
>>>> 3. Like what we do in struct irq_guest_action_t, can we limit the
>>>> maximum of entry we support in the list. With this approach, during
>>>> domain creation, we calculate the available entries and compare with
>>>> the domain's vCPU number to decide whether the domain can use VT-d PI.
>>> VT-d PI is global instead of per-domain. I guess you actually mean
>>> failing device assignment operation if counting new domain's #VCPUs
>>> exceeds the limitation.
>>>
>>>> This method will pose a strict restriction to the maximum of entry in
>>>> one list. But it may affect vCPU hotplug.
>>>>
>>>> According to your intuition, which methods are feasible and
>>>> acceptable? I will attempt to mitigate this issue per your advices.
>>>>
>>> My understanding is that we need them all. #1 is the baseline,
>>> with #2/#3 as further optimization. :-)
>> Actually, regarding #2, is that the case?
>>
>> If we do reference counting (as in patches 3 and 4 of Chao Gao's recent
>> series), then we are guaranteed never to have more vcpus on any given
>> wakeup list than there are machine IRQs on the system. Are we ever
>> going to have a system with so many IRQs that going through such a list
>> would be problematic?
> I'm afraid this is not impossible, considering that people have already
> run into the interrupt vector limitation coming from there only being
> about 200 vectors per CPU (and there not being, in physical mode,
> any sharing of vectors between multiple CPUs, iirc). Devices using
> namely MSI-X can use an awful lot of vectors. Perhaps Andrew
> remembers numbers observed on actual systems here...

Citrix Netscalar SDX boxes have more MSI-X interrupts than fit in the
cumulative IDTs of a top end dual-socket Xeon server systems. Some of
the device drivers are purposefully modelled to use fewer interrupts
than they otherwise would want to.

Using PI is the proper solution longterm, because doing so would remove
any need to allocate IDT vectors for the interrupts; the IOMMU could be
programmed to dump device vectors straight into the PI block without
them ever going through Xen's IDT.

However, fixing that requires rewriting Xen's Interrupt remapping
handling so it doesn't rewrite the cpu/vector in every interrupt source,
and only rewrites the interrupt remapping table.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
Re: Enabling VT-d PI by default [ In reply to ]
On Fri, May 12, 2017 at 12:05 PM, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
> Citrix Netscalar SDX boxes have more MSI-X interrupts than fit in the
> cumulative IDTs of a top end dual-socket Xeon server systems. Some of
> the device drivers are purposefully modelled to use fewer interrupts
> than they otherwise would want to.
>
> Using PI is the proper solution longterm, because doing so would remove
> any need to allocate IDT vectors for the interrupts; the IOMMU could be
> programmed to dump device vectors straight into the PI block without
> them ever going through Xen's IDT.

I wouldn't necessarily call that a "proper" solution. With PI, instead
of an interrupt telling you exactly which VM to wake up and/or which
routine you need to run, instead you have to search through
(potentially) thousands of entries to see which vcpu the interrupt you
received wanted to wake up; and you need to do that on every single
interrupt. (Obviously it does have the advantage that if the vcpu
happens to be running Xen doesn't get an interrupt at all.)

-George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
Re: Enabling VT-d PI by default [ In reply to ]
On 15/05/17 11:27, George Dunlap wrote:
> On Fri, May 12, 2017 at 12:05 PM, Andrew Cooper
> <andrew.cooper3@citrix.com> wrote:
>> Citrix Netscalar SDX boxes have more MSI-X interrupts than fit in the
>> cumulative IDTs of a top end dual-socket Xeon server systems. Some of
>> the device drivers are purposefully modelled to use fewer interrupts
>> than they otherwise would want to.
>>
>> Using PI is the proper solution longterm, because doing so would remove
>> any need to allocate IDT vectors for the interrupts; the IOMMU could be
>> programmed to dump device vectors straight into the PI block without
>> them ever going through Xen's IDT.
> I wouldn't necessarily call that a "proper" solution. With PI, instead
> of an interrupt telling you exactly which VM to wake up and/or which
> routine you need to run, instead you have to search through
> (potentially) thousands of entries to see which vcpu the interrupt you
> received wanted to wake up; and you need to do that on every single
> interrupt. (Obviously it does have the advantage that if the vcpu
> happens to be running Xen doesn't get an interrupt at all.)

Having spoken to the PI architects, this is not how the technology was
designed to be used.

On systems with this number of in-flight interrupts, trying to track
"who got what interrupt" for priority boosting purposes is a waste of
time, as we spend ages taking vmexits to process interrupt notifications
for out-of-context vcpus.

The way the PI architects envisaged the technology being used is that
Suppress Notification is set at all points other than executing in
non-root mode for the vcpu in question (there is a small race window
around clearing SN on vmentry), and that the scheduler uses Outstanding
Notification on each of the PI blocks when it rebalances credit to see
which vcpus have had interrupts in the last 30ms.

This current behaviour of leaving SN clear until an interrupt arrives is
devastating for performance, especially in combination with the 3-step
mechanism Xen uses to rewrite the interrupt source information, which
pretty much guarantees that interrupts arrive on the wrong pcpu (unless
strict pinning is in effect).

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
Re: Enabling VT-d PI by default [ In reply to ]
On Mon, May 15, 2017 at 2:35 PM, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
> On 15/05/17 11:27, George Dunlap wrote:
>> On Fri, May 12, 2017 at 12:05 PM, Andrew Cooper
>> <andrew.cooper3@citrix.com> wrote:
>>> Citrix Netscalar SDX boxes have more MSI-X interrupts than fit in the
>>> cumulative IDTs of a top end dual-socket Xeon server systems. Some of
>>> the device drivers are purposefully modelled to use fewer interrupts
>>> than they otherwise would want to.
>>>
>>> Using PI is the proper solution longterm, because doing so would remove
>>> any need to allocate IDT vectors for the interrupts; the IOMMU could be
>>> programmed to dump device vectors straight into the PI block without
>>> them ever going through Xen's IDT.
>> I wouldn't necessarily call that a "proper" solution. With PI, instead
>> of an interrupt telling you exactly which VM to wake up and/or which
>> routine you need to run, instead you have to search through
>> (potentially) thousands of entries to see which vcpu the interrupt you
>> received wanted to wake up; and you need to do that on every single
>> interrupt. (Obviously it does have the advantage that if the vcpu
>> happens to be running Xen doesn't get an interrupt at all.)
>
> Having spoken to the PI architects, this is not how the technology was
> designed to be used.
>
> On systems with this number of in-flight interrupts, trying to track
> "who got what interrupt" for priority boosting purposes is a waste of
> time, as we spend ages taking vmexits to process interrupt notifications
> for out-of-context vcpus.
>
> The way the PI architects envisaged the technology being used is that
> Suppress Notification is set at all points other than executing in
> non-root mode for the vcpu in question (there is a small race window
> around clearing SN on vmentry), and that the scheduler uses Outstanding
> Notification on each of the PI blocks when it rebalances credit to see
> which vcpus have had interrupts in the last 30ms.

It sounds like they may have made the mistake that the Credit1
designers made, in analyzing only a system that was overloaded; and
one where all workloads were identical, as opposed to analyzing a
system that was at least sometimes partially loaded, and where
workloads were very different.

You're right that if you weren't going to preempt the currently
running vcpu anyway, there's no need for Xen to get the interrupt.

But it should be obvious that on a system that's idle (even for a
relatively short amount of time) that we want to get the interrupt and
wake up the appropriate vcpu immediately. It should also be obvious
that in a mixed workload, where one vcpu is doing tons of computation
and another is mainly handling interrupts quickly and going to sleep
again, that we would want Xen at regular intervals to check to see if
it should run the vcpu that's mostly handling interrupts. We
generally wouldn't want to delay waking up the lower-priority vcpu
longer than 1ms.

In both cases, waiting 30ms to see if we should wake somebody up is
far too long.

-George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
Re: Enabling VT-d PI by default [ In reply to ]
On Mon, 2017-05-15 at 15:32 +0100, George Dunlap wrote:
> On Mon, May 15, 2017 at 2:35 PM, Andrew Cooper
> <andrew.cooper3@citrix.com> wrote:
> > On systems with this number of in-flight interrupts, trying to
> > track
> > "who got what interrupt" for priority
> > boosting purposes is a waste of
> > time, as we spend ages taking vmexits to process interrupt
> > notifications
> > for out-of-context vcpus.
> >
> > The way the PI architects envisaged the technology being used is
> > that
> > Suppress Notification is set at all points other than executing in
> > non-root mode for the vcpu in question (there is a small race
> > window
> > around clearing SN on vmentry), and that the scheduler uses
> > Outstanding
> > Notification on each of the PI blocks when it rebalances credit to
> > see
> > which vcpus have had interrupts in the last 30ms.
>
> It sounds like they may have made the mistake that the Credit1
> designers made, in analyzing only a system that was overloaded; and
> one where all workloads were identical, as opposed to analyzing a
> system that was at least sometimes partially loaded, and where
> workloads were very different.
>
Totally agree.

Also, I'm not sure I follow why PI architects would be basing hardware
design on specific characteristics of a particular Xen scheduler. E.g.,
in Linux --which I'd think they also had in mind when envisioning uses
of the technology-- there is no such thing as 30ms timeslice, nor
credits redistribution.

And AFAICU what you seem to suggest, not notifying an interrupt/not
waking up anyone, at the time at which it happens, means there must be
some kind of list_for_each_vcpu() anyway, for checking which vCPUs have
pending notifications. Hence the problem we're discussing here, would
just be moved between subsystems, rather than going away.

And, finally, I don't get what you mean when you say that we're trying
to use PI "for priority boosting purposes". I don't think we do that.

FTR, I've quickly checked how this is done in Linux, and the solution
pushed there looks really similar to the one that has been pushed to
Xen as well. E.g., the also there, the handler scans the blocked vCPUs
list:
http://elixir.free-electrons.com/linux/latest/source/arch/x86/kvm/vmx.c#L6464

> In both cases, waiting 30ms to see if we should wake somebody up is
> far too long.
>
Absoluely!

Regards,
Dario
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)