Mailing List Archive

Re: vmx: VT-d posted-interrupt core logic handling
>>> On 10.03.16 at 06:09, <kevin.tian@intel.com> wrote:
> It's always good to have a clear definition to which extend a performance
> issue would become a security risk. I saw 200us/500us used as example
> in this thread, however no one can give an accrual criteria. In that case,
> how do we call it a problem even when Feng collected some data? Based
> on mindset from all maintainers?

I think I've already made clear in previous comments that such
measurements won't lead anywhere. What we need is a
guarantee (by way of enforcement in source code) that the
lists can't grow overly large, compared to the total load placed
on the system.

> I think a good way of looking at this is based on which capability is
> impacted.
> In this specific case the directly impacted metric is the interrupt delivery
> latency. However today Xen is not RT-capable. Xen doesn't commit to
> deliver a worst-case 10us interrupt latency. The whole interrupt delivery
> path
> (from Xen into Guest) has not been optimized yet, then there could be other
> reasons impacting latency too beside the concern on this specific list walk.
> There is no baseline worst-case data w/o PI. There is no final goal to hit.
> There is no test case to measure.
>
> Then why blocking this feature due to this unmeasurable concern and why
> not enabling it and then improving it later when it becomes a measurable
> concern when Xen will commit a clear interrupt latency goal will be
> committed
> by Xen (at that time people working on that effort will have to identify all
> kinds
> of problems impacting interrupt latency and then can optimize together)?
> People should understand possibly bad interrupt latency in extreme cases
> like discussed in this thread (w/ or w/o PI), since Xen doesn't commit
> anything
> here.

I've never made any reference to this being an interrupt latency
issue; I think it was George who somehow implied this from earlier
comments. Interrupt latency, at least generally, isn't a security
concern (generally because of course latency can get so high that
it might become a concern). All my previous remarks regarding the
issue are solely from the common perspective of long running
operations (which we've been dealing with outside of interrupt
context in a variety of cases, as you may recall). Hence the purely
theoretical basis for some sort of measurement would be to
determine how long a worst case list traversal would take. With
"worst case" being derived from the theoretical limits the
hypervisor implementation so far implies: 128 vCPU-s per domain
(a limit which we sooner or later will need to lift, i.e. taking into
consideration a larger value - like the 8k for PV guests - wouldn't
hurt) by 32k domains per host, totaling to 4M possible list entries.
Yes, it is obvious that this limit won't be reachable in practice, but
no, any lower limit can't be guaranteed to be good enough.

But I'm just now noticing this is the wrong thread to have this
discussion in - George specifically branched off the thread with
the new topic to separate the general discussion from the
specific case of the criteria for default enabling VT-d PI. So let's
please move this back to the other sub-thread (and I've
changed to subject back to express this).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
Re: vmx: VT-d posted-interrupt core logic handling [ In reply to ]
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Thursday, March 10, 2016 4:07 PM
>
> >>> On 10.03.16 at 06:09, <kevin.tian@intel.com> wrote:
> > It's always good to have a clear definition to which extend a performance
> > issue would become a security risk. I saw 200us/500us used as example
> > in this thread, however no one can give an accrual criteria. In that case,
> > how do we call it a problem even when Feng collected some data? Based
> > on mindset from all maintainers?
>
> I think I've already made clear in previous comments that such
> measurements won't lead anywhere. What we need is a
> guarantee (by way of enforcement in source code) that the
> lists can't grow overly large, compared to the total load placed
> on the system.

Thanks for clarity here.

>
> > I think a good way of looking at this is based on which capability is
> > impacted.
> > In this specific case the directly impacted metric is the interrupt delivery
> > latency. However today Xen is not RT-capable. Xen doesn't commit to
> > deliver a worst-case 10us interrupt latency. The whole interrupt delivery
> > path
> > (from Xen into Guest) has not been optimized yet, then there could be other
> > reasons impacting latency too beside the concern on this specific list walk.
> > There is no baseline worst-case data w/o PI. There is no final goal to hit.
> > There is no test case to measure.
> >
> > Then why blocking this feature due to this unmeasurable concern and why
> > not enabling it and then improving it later when it becomes a measurable
> > concern when Xen will commit a clear interrupt latency goal will be
> > committed
> > by Xen (at that time people working on that effort will have to identify all
> > kinds
> > of problems impacting interrupt latency and then can optimize together)?
> > People should understand possibly bad interrupt latency in extreme cases
> > like discussed in this thread (w/ or w/o PI), since Xen doesn't commit
> > anything
> > here.
>
> I've never made any reference to this being an interrupt latency
> issue; I think it was George who somehow implied this from earlier
> comments. Interrupt latency, at least generally, isn't a security
> concern (generally because of course latency can get so high that
> it might become a concern). All my previous remarks regarding the
> issue are solely from the common perspective of long running
> operations (which we've been dealing with outside of interrupt
> context in a variety of cases, as you may recall). Hence the purely

Yes, that concern makes sense.

> theoretical basis for some sort of measurement would be to
> determine how long a worst case list traversal would take. With
> "worst case" being derived from the theoretical limits the
> hypervisor implementation so far implies: 128 vCPU-s per domain
> (a limit which we sooner or later will need to lift, i.e. taking into
> consideration a larger value - like the 8k for PV guests - wouldn't
> hurt) by 32k domains per host, totaling to 4M possible list entries.
> Yes, it is obvious that this limit won't be reachable in practice, but
> no, any lower limit can't be guaranteed to be good enough.

Here do you think whether '4M' possible entries are 'overly large'
so we must have some enforcement in code, or still some experiments
required to verify '4M' does been a problem (since total overhead
depends on what we do with each entry)? If the latter what's the
criteria to define it as a problem (e.g. 200us in total)?

There are many linked list usages today in Xen hypervisor, which
have different theoretical maximum possible number. The closest
one to PI might be the usage in tmem (pool->share_list) which is
page based so could grow 'overly large'. Other examples are
magnitude lower, e.g. s->ioreq_vcpu_list in ioreq server (which
could be 8K in above example), and d->arch.hvm_domain.msixtbl_list
in MSI-x virtualization (which could be 2^11 per spec). Do we
also want to create some artificial scenarios to examine them
since based on actual operation K-level entries may also become
a problem?

Just want to figure out how best we can solve all related linked-list
usages in current hypervisor.

>
> But I'm just now noticing this is the wrong thread to have this
> discussion in - George specifically branched off the thread with
> the new topic to separate the general discussion from the
> specific case of the criteria for default enabling VT-d PI. So let's
> please move this back to the other sub-thread (and I've
> changed to subject back to express this).
>

Sorry for cross-posting.

Thanks
Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
Re: vmx: VT-d posted-interrupt core logic handling [ In reply to ]
>>> On 10.03.16 at 09:43, <kevin.tian@intel.com> wrote:
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Thursday, March 10, 2016 4:07 PM
>>
>> theoretical basis for some sort of measurement would be to
>> determine how long a worst case list traversal would take. With
>> "worst case" being derived from the theoretical limits the
>> hypervisor implementation so far implies: 128 vCPU-s per domain
>> (a limit which we sooner or later will need to lift, i.e. taking into
>> consideration a larger value - like the 8k for PV guests - wouldn't
>> hurt) by 32k domains per host, totaling to 4M possible list entries.
>> Yes, it is obvious that this limit won't be reachable in practice, but
>> no, any lower limit can't be guaranteed to be good enough.
>
> Here do you think whether '4M' possible entries are 'overly large'
> so we must have some enforcement in code, or still some experiments
> required to verify '4M' does been a problem (since total overhead
> depends on what we do with each entry)? If the latter what's the
> criteria to define it as a problem (e.g. 200us in total)?

Well, 4M makes, even with a single loop iteration just taking 1ns,
4ms already. Anything reaching the order of the minimum
scheduler time slice is potentially problematic. Anything reaching
the order of 1s is known to be actively bad outside of interrupt
context; within interrupt context you need to also consider
interrupt rate of course, so 4ms likely would already open the
potential of a CPU not making any forward progress anymore.

> There are many linked list usages today in Xen hypervisor, which
> have different theoretical maximum possible number. The closest
> one to PI might be the usage in tmem (pool->share_list) which is
> page based so could grow 'overly large'. Other examples are
> magnitude lower, e.g. s->ioreq_vcpu_list in ioreq server (which
> could be 8K in above example), and d->arch.hvm_domain.msixtbl_list
> in MSI-x virtualization (which could be 2^11 per spec). Do we
> also want to create some artificial scenarios to examine them
> since based on actual operation K-level entries may also become
> a problem?
>
> Just want to figure out how best we can solve all related linked-list
> usages in current hypervisor.

As you say, those are (perhaps with the exception of tmem, which
isn't supported anyway due to XSA-15, and which therefore also
isn't on by default) in the order of a few thousand list elements.
And as mentioned above, different bounds apply for lists traversed
in interrupt context vs such traversed only in "normal" context.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
Re: vmx: VT-d posted-interrupt core logic handling [ In reply to ]
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Thursday, March 10, 2016 5:06 PM
>
>
> > There are many linked list usages today in Xen hypervisor, which
> > have different theoretical maximum possible number. The closest
> > one to PI might be the usage in tmem (pool->share_list) which is
> > page based so could grow 'overly large'. Other examples are
> > magnitude lower, e.g. s->ioreq_vcpu_list in ioreq server (which
> > could be 8K in above example), and d->arch.hvm_domain.msixtbl_list
> > in MSI-x virtualization (which could be 2^11 per spec). Do we
> > also want to create some artificial scenarios to examine them
> > since based on actual operation K-level entries may also become
> > a problem?
> >
> > Just want to figure out how best we can solve all related linked-list
> > usages in current hypervisor.
>
> As you say, those are (perhaps with the exception of tmem, which
> isn't supported anyway due to XSA-15, and which therefore also
> isn't on by default) in the order of a few thousand list elements.
> And as mentioned above, different bounds apply for lists traversed
> in interrupt context vs such traversed only in "normal" context.
>

That's a good point. Interrupt context should have more restrictions.

Thanks
Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
Re: vmx: VT-d posted-interrupt core logic handling [ In reply to ]
> From: Tian, Kevin
> Sent: Thursday, March 10, 2016 5:20 PM
>
> > From: Jan Beulich [mailto:JBeulich@suse.com]
> > Sent: Thursday, March 10, 2016 5:06 PM
> >
> >
> > > There are many linked list usages today in Xen hypervisor, which
> > > have different theoretical maximum possible number. The closest
> > > one to PI might be the usage in tmem (pool->share_list) which is
> > > page based so could grow 'overly large'. Other examples are
> > > magnitude lower, e.g. s->ioreq_vcpu_list in ioreq server (which
> > > could be 8K in above example), and d->arch.hvm_domain.msixtbl_list
> > > in MSI-x virtualization (which could be 2^11 per spec). Do we
> > > also want to create some artificial scenarios to examine them
> > > since based on actual operation K-level entries may also become
> > > a problem?
> > >
> > > Just want to figure out how best we can solve all related linked-list
> > > usages in current hypervisor.
> >
> > As you say, those are (perhaps with the exception of tmem, which
> > isn't supported anyway due to XSA-15, and which therefore also
> > isn't on by default) in the order of a few thousand list elements.
> > And as mentioned above, different bounds apply for lists traversed
> > in interrupt context vs such traversed only in "normal" context.
> >
>
> That's a good point. Interrupt context should have more restrictions.

Hi, Jan,

I'm thinking your earlier idea about evenly distributed list:

--
Ah, right, I think that limitation was named before, yet I've
forgotten about it again. But that only slightly alters the
suggestion: To distribute vCPU-s evenly would then require to
change their placement on the pCPU in the course of entering
blocked state.
--

Actually after more thinking, there is no hard requirement that
the vcpu must block on the pcpu which is configured in 'NDST'
of that vcpu's PI descriptor. What really matters, is that the
vcpu is added to the linked list of the very pcpu, then when PI
notification comes we can always find out the vcpu struct from
that pcpu's linked list. Of course one drawback of such placement
is additional IPI incurred in wake up path.

Then one possible optimized policy within vmx_vcpu_block could
be:

(Say PCPU1 which VCPU1 is currently blocked on)
- As long as the #vcpus in the linked list on PCPU1 is below a
threshold (say 16), add VCPU1 to the list. NDST set to PCPU1;
Upon PI notification on PCPU1, local linked list is searched to
find VCPU1 and then VCPU1 will be unblocked on PCPU1;

- Otherwise, add VCPU1 to PCPU2 based on a simple distribution
algorithm (based on vcpu_id/vm_id). VCPU1 still blocks on PCPU1
but NDST set to PCPU2. Upon notification on PCPU2, local linked
list is searched to find VCPU1 and then an IPI is sent to PCPU1 to
unblock VCPU1;

Feng, do you see any overlook here? :-)

Thanks
Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
Re: vmx: VT-d posted-interrupt core logic handling [ In reply to ]
>>> On 10.03.16 at 11:05, <kevin.tian@intel.com> wrote:
>> From: Tian, Kevin
>> Sent: Thursday, March 10, 2016 5:20 PM
>>
>> > From: Jan Beulich [mailto:JBeulich@suse.com]
>> > Sent: Thursday, March 10, 2016 5:06 PM
>> >
>> >
>> > > There are many linked list usages today in Xen hypervisor, which
>> > > have different theoretical maximum possible number. The closest
>> > > one to PI might be the usage in tmem (pool->share_list) which is
>> > > page based so could grow 'overly large'. Other examples are
>> > > magnitude lower, e.g. s->ioreq_vcpu_list in ioreq server (which
>> > > could be 8K in above example), and d->arch.hvm_domain.msixtbl_list
>> > > in MSI-x virtualization (which could be 2^11 per spec). Do we
>> > > also want to create some artificial scenarios to examine them
>> > > since based on actual operation K-level entries may also become
>> > > a problem?
>> > >
>> > > Just want to figure out how best we can solve all related linked-list
>> > > usages in current hypervisor.
>> >
>> > As you say, those are (perhaps with the exception of tmem, which
>> > isn't supported anyway due to XSA-15, and which therefore also
>> > isn't on by default) in the order of a few thousand list elements.
>> > And as mentioned above, different bounds apply for lists traversed
>> > in interrupt context vs such traversed only in "normal" context.
>> >
>>
>> That's a good point. Interrupt context should have more restrictions.
>
> Hi, Jan,
>
> I'm thinking your earlier idea about evenly distributed list:
>
> --
> Ah, right, I think that limitation was named before, yet I've
> forgotten about it again. But that only slightly alters the
> suggestion: To distribute vCPU-s evenly would then require to
> change their placement on the pCPU in the course of entering
> blocked state.
> --
>
> Actually after more thinking, there is no hard requirement that
> the vcpu must block on the pcpu which is configured in 'NDST'
> of that vcpu's PI descriptor. What really matters, is that the
> vcpu is added to the linked list of the very pcpu, then when PI
> notification comes we can always find out the vcpu struct from
> that pcpu's linked list. Of course one drawback of such placement
> is additional IPI incurred in wake up path.
>
> Then one possible optimized policy within vmx_vcpu_block could
> be:
>
> (Say PCPU1 which VCPU1 is currently blocked on)
> - As long as the #vcpus in the linked list on PCPU1 is below a
> threshold (say 16), add VCPU1 to the list. NDST set to PCPU1;
> Upon PI notification on PCPU1, local linked list is searched to
> find VCPU1 and then VCPU1 will be unblocked on PCPU1;
>
> - Otherwise, add VCPU1 to PCPU2 based on a simple distribution
> algorithm (based on vcpu_id/vm_id). VCPU1 still blocks on PCPU1
> but NDST set to PCPU2. Upon notification on PCPU2, local linked
> list is searched to find VCPU1 and then an IPI is sent to PCPU1 to
> unblock VCPU1;

Sounds possible, if the lock handling can be got right. But of
course there can't be any hard limit like 16, at least not alone
(on a systems with extremely many mostly idle vCPU-s we'd
need to allow larger counts - see my earlier explanations in this
regard).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
Re: vmx: VT-d posted-interrupt core logic handling [ In reply to ]
On 10/03/16 10:18, Jan Beulich wrote:
>>>> On 10.03.16 at 11:05, <kevin.tian@intel.com> wrote:
>>> From: Tian, Kevin
>>> Sent: Thursday, March 10, 2016 5:20 PM
>>>
>>>> From: Jan Beulich [mailto:JBeulich@suse.com]
>>>> Sent: Thursday, March 10, 2016 5:06 PM
>>>>
>>>>
>>>>> There are many linked list usages today in Xen hypervisor, which
>>>>> have different theoretical maximum possible number. The closest
>>>>> one to PI might be the usage in tmem (pool->share_list) which is
>>>>> page based so could grow 'overly large'. Other examples are
>>>>> magnitude lower, e.g. s->ioreq_vcpu_list in ioreq server (which
>>>>> could be 8K in above example), and d->arch.hvm_domain.msixtbl_list
>>>>> in MSI-x virtualization (which could be 2^11 per spec). Do we
>>>>> also want to create some artificial scenarios to examine them
>>>>> since based on actual operation K-level entries may also become
>>>>> a problem?
>>>>>
>>>>> Just want to figure out how best we can solve all related linked-list
>>>>> usages in current hypervisor.
>>>>
>>>> As you say, those are (perhaps with the exception of tmem, which
>>>> isn't supported anyway due to XSA-15, and which therefore also
>>>> isn't on by default) in the order of a few thousand list elements.
>>>> And as mentioned above, different bounds apply for lists traversed
>>>> in interrupt context vs such traversed only in "normal" context.
>>>>
>>>
>>> That's a good point. Interrupt context should have more restrictions.
>>
>> Hi, Jan,
>>
>> I'm thinking your earlier idea about evenly distributed list:
>>
>> --
>> Ah, right, I think that limitation was named before, yet I've
>> forgotten about it again. But that only slightly alters the
>> suggestion: To distribute vCPU-s evenly would then require to
>> change their placement on the pCPU in the course of entering
>> blocked state.
>> --
>>
>> Actually after more thinking, there is no hard requirement that
>> the vcpu must block on the pcpu which is configured in 'NDST'
>> of that vcpu's PI descriptor. What really matters, is that the
>> vcpu is added to the linked list of the very pcpu, then when PI
>> notification comes we can always find out the vcpu struct from
>> that pcpu's linked list. Of course one drawback of such placement
>> is additional IPI incurred in wake up path.
>>
>> Then one possible optimized policy within vmx_vcpu_block could
>> be:
>>
>> (Say PCPU1 which VCPU1 is currently blocked on)
>> - As long as the #vcpus in the linked list on PCPU1 is below a
>> threshold (say 16), add VCPU1 to the list. NDST set to PCPU1;
>> Upon PI notification on PCPU1, local linked list is searched to
>> find VCPU1 and then VCPU1 will be unblocked on PCPU1;
>>
>> - Otherwise, add VCPU1 to PCPU2 based on a simple distribution
>> algorithm (based on vcpu_id/vm_id). VCPU1 still blocks on PCPU1
>> but NDST set to PCPU2. Upon notification on PCPU2, local linked
>> list is searched to find VCPU1 and then an IPI is sent to PCPU1 to
>> unblock VCPU1;
>
> Sounds possible, if the lock handling can be got right. But of
> course there can't be any hard limit like 16, at least not alone
> (on a systems with extremely many mostly idle vCPU-s we'd
> need to allow larger counts - see my earlier explanations in this
> regard).

You could also consider only waking the first N VCPUs and just making
the rest runnable. If you wake more VCPUs than PCPUs at the same time
most of them won't actually be scheduled.

N would be some measure of how many VCPUs could be run immediately (with
N <= number of PCPUs).

David


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
Re: vmx: VT-d posted-interrupt core logic handling [ In reply to ]
On 10/03/16 08:07, Jan Beulich wrote:
>>>> On 10.03.16 at 06:09, <kevin.tian@intel.com> wrote:
>> It's always good to have a clear definition to which extend a performance
>> issue would become a security risk. I saw 200us/500us used as example
>> in this thread, however no one can give an accrual criteria. In that case,
>> how do we call it a problem even when Feng collected some data? Based
>> on mindset from all maintainers?
>
> I think I've already made clear in previous comments that such
> measurements won't lead anywhere. What we need is a
> guarantee (by way of enforcement in source code) that the
> lists can't grow overly large, compared to the total load placed
> on the system.
>
>> I think a good way of looking at this is based on which capability is
>> impacted.
>> In this specific case the directly impacted metric is the interrupt delivery
>> latency. However today Xen is not RT-capable. Xen doesn't commit to
>> deliver a worst-case 10us interrupt latency. The whole interrupt delivery
>> path
>> (from Xen into Guest) has not been optimized yet, then there could be other
>> reasons impacting latency too beside the concern on this specific list walk.
>> There is no baseline worst-case data w/o PI. There is no final goal to hit.
>> There is no test case to measure.
>>
>> Then why blocking this feature due to this unmeasurable concern and why
>> not enabling it and then improving it later when it becomes a measurable
>> concern when Xen will commit a clear interrupt latency goal will be
>> committed
>> by Xen (at that time people working on that effort will have to identify all
>> kinds
>> of problems impacting interrupt latency and then can optimize together)?
>> People should understand possibly bad interrupt latency in extreme cases
>> like discussed in this thread (w/ or w/o PI), since Xen doesn't commit
>> anything
>> here.
>
> I've never made any reference to this being an interrupt latency
> issue; I think it was George who somehow implied this from earlier
> comments. Interrupt latency, at least generally, isn't a security
> concern (generally because of course latency can get so high that
> it might become a concern). All my previous remarks regarding the
> issue are solely from the common perspective of long running
> operations (which we've been dealing with outside of interrupt
> context in a variety of cases, as you may recall). Hence the purely
> theoretical basis for some sort of measurement would be to
> determine how long a worst case list traversal would take. With
> "worst case" being derived from the theoretical limits the
> hypervisor implementation so far implies: 128 vCPU-s per domain
> (a limit which we sooner or later will need to lift, i.e. taking into
> consideration a larger value - like the 8k for PV guests - wouldn't
> hurt) by 32k domains per host, totaling to 4M possible list entries.
> Yes, it is obvious that this limit won't be reachable in practice, but
> no, any lower limit can't be guaranteed to be good enough.

Can I suggest we suspend the discussion of what would or would not be
reasonable and come back to it next week? I definitely feel myself
digging my heels in here, so it might be good to go away and come back
to the discussion with a bit of distance.

(Potential technical solutions are still game I think.)

-George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
Re: vmx: VT-d posted-interrupt core logic handling [ In reply to ]
On 10/03/16 10:35, David Vrabel wrote:
> On 10/03/16 10:18, Jan Beulich wrote:
>>>>> On 10.03.16 at 11:05, <kevin.tian@intel.com> wrote:
>>>> From: Tian, Kevin
>>>> Sent: Thursday, March 10, 2016 5:20 PM
>>>>
>>>>> From: Jan Beulich [mailto:JBeulich@suse.com]
>>>>> Sent: Thursday, March 10, 2016 5:06 PM
>>>>>
>>>>>
>>>>>> There are many linked list usages today in Xen hypervisor, which
>>>>>> have different theoretical maximum possible number. The closest
>>>>>> one to PI might be the usage in tmem (pool->share_list) which is
>>>>>> page based so could grow 'overly large'. Other examples are
>>>>>> magnitude lower, e.g. s->ioreq_vcpu_list in ioreq server (which
>>>>>> could be 8K in above example), and d->arch.hvm_domain.msixtbl_list
>>>>>> in MSI-x virtualization (which could be 2^11 per spec). Do we
>>>>>> also want to create some artificial scenarios to examine them
>>>>>> since based on actual operation K-level entries may also become
>>>>>> a problem?
>>>>>>
>>>>>> Just want to figure out how best we can solve all related linked-list
>>>>>> usages in current hypervisor.
>>>>>
>>>>> As you say, those are (perhaps with the exception of tmem, which
>>>>> isn't supported anyway due to XSA-15, and which therefore also
>>>>> isn't on by default) in the order of a few thousand list elements.
>>>>> And as mentioned above, different bounds apply for lists traversed
>>>>> in interrupt context vs such traversed only in "normal" context.
>>>>>
>>>>
>>>> That's a good point. Interrupt context should have more restrictions.
>>>
>>> Hi, Jan,
>>>
>>> I'm thinking your earlier idea about evenly distributed list:
>>>
>>> --
>>> Ah, right, I think that limitation was named before, yet I've
>>> forgotten about it again. But that only slightly alters the
>>> suggestion: To distribute vCPU-s evenly would then require to
>>> change their placement on the pCPU in the course of entering
>>> blocked state.
>>> --
>>>
>>> Actually after more thinking, there is no hard requirement that
>>> the vcpu must block on the pcpu which is configured in 'NDST'
>>> of that vcpu's PI descriptor. What really matters, is that the
>>> vcpu is added to the linked list of the very pcpu, then when PI
>>> notification comes we can always find out the vcpu struct from
>>> that pcpu's linked list. Of course one drawback of such placement
>>> is additional IPI incurred in wake up path.
>>>
>>> Then one possible optimized policy within vmx_vcpu_block could
>>> be:
>>>
>>> (Say PCPU1 which VCPU1 is currently blocked on)
>>> - As long as the #vcpus in the linked list on PCPU1 is below a
>>> threshold (say 16), add VCPU1 to the list. NDST set to PCPU1;
>>> Upon PI notification on PCPU1, local linked list is searched to
>>> find VCPU1 and then VCPU1 will be unblocked on PCPU1;
>>>
>>> - Otherwise, add VCPU1 to PCPU2 based on a simple distribution
>>> algorithm (based on vcpu_id/vm_id). VCPU1 still blocks on PCPU1
>>> but NDST set to PCPU2. Upon notification on PCPU2, local linked
>>> list is searched to find VCPU1 and then an IPI is sent to PCPU1 to
>>> unblock VCPU1;
>>
>> Sounds possible, if the lock handling can be got right. But of
>> course there can't be any hard limit like 16, at least not alone
>> (on a systems with extremely many mostly idle vCPU-s we'd
>> need to allow larger counts - see my earlier explanations in this
>> regard).
>
> You could also consider only waking the first N VCPUs and just making
> the rest runnable. If you wake more VCPUs than PCPUs at the same time
> most of them won't actually be scheduled.

"Waking" a vcpu means "changing from blocked to runnable", so those two
things are the same. And I can't figure out what you mean instead --
can you elaborate?

Waking up 1000 vcpus is going to take strictly more time than checking
whether there's a PI interrupt pending on 1000 vcpus to see if they need
to be woken up.

-George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
Re: vmx: VT-d posted-interrupt core logic handling [ In reply to ]
On 10/03/16 10:18, Jan Beulich wrote:
>>>> On 10.03.16 at 11:05, <kevin.tian@intel.com> wrote:
>>> From: Tian, Kevin
>>> Sent: Thursday, March 10, 2016 5:20 PM
>>>
>>>> From: Jan Beulich [mailto:JBeulich@suse.com]
>>>> Sent: Thursday, March 10, 2016 5:06 PM
>>>>
>>>>
>>>>> There are many linked list usages today in Xen hypervisor, which
>>>>> have different theoretical maximum possible number. The closest
>>>>> one to PI might be the usage in tmem (pool->share_list) which is
>>>>> page based so could grow 'overly large'. Other examples are
>>>>> magnitude lower, e.g. s->ioreq_vcpu_list in ioreq server (which
>>>>> could be 8K in above example), and d->arch.hvm_domain.msixtbl_list
>>>>> in MSI-x virtualization (which could be 2^11 per spec). Do we
>>>>> also want to create some artificial scenarios to examine them
>>>>> since based on actual operation K-level entries may also become
>>>>> a problem?
>>>>>
>>>>> Just want to figure out how best we can solve all related linked-list
>>>>> usages in current hypervisor.
>>>>
>>>> As you say, those are (perhaps with the exception of tmem, which
>>>> isn't supported anyway due to XSA-15, and which therefore also
>>>> isn't on by default) in the order of a few thousand list elements.
>>>> And as mentioned above, different bounds apply for lists traversed
>>>> in interrupt context vs such traversed only in "normal" context.
>>>>
>>>
>>> That's a good point. Interrupt context should have more restrictions.
>>
>> Hi, Jan,
>>
>> I'm thinking your earlier idea about evenly distributed list:
>>
>> --
>> Ah, right, I think that limitation was named before, yet I've
>> forgotten about it again. But that only slightly alters the
>> suggestion: To distribute vCPU-s evenly would then require to
>> change their placement on the pCPU in the course of entering
>> blocked state.
>> --
>>
>> Actually after more thinking, there is no hard requirement that
>> the vcpu must block on the pcpu which is configured in 'NDST'
>> of that vcpu's PI descriptor. What really matters, is that the
>> vcpu is added to the linked list of the very pcpu, then when PI
>> notification comes we can always find out the vcpu struct from
>> that pcpu's linked list. Of course one drawback of such placement
>> is additional IPI incurred in wake up path.
>>
>> Then one possible optimized policy within vmx_vcpu_block could
>> be:
>>
>> (Say PCPU1 which VCPU1 is currently blocked on)
>> - As long as the #vcpus in the linked list on PCPU1 is below a
>> threshold (say 16), add VCPU1 to the list. NDST set to PCPU1;
>> Upon PI notification on PCPU1, local linked list is searched to
>> find VCPU1 and then VCPU1 will be unblocked on PCPU1;
>>
>> - Otherwise, add VCPU1 to PCPU2 based on a simple distribution
>> algorithm (based on vcpu_id/vm_id). VCPU1 still blocks on PCPU1
>> but NDST set to PCPU2. Upon notification on PCPU2, local linked
>> list is searched to find VCPU1 and then an IPI is sent to PCPU1 to
>> unblock VCPU1;
>
> Sounds possible, if the lock handling can be got right. But of
> course there can't be any hard limit like 16, at least not alone
> (on a systems with extremely many mostly idle vCPU-s we'd
> need to allow larger counts - see my earlier explanations in this
> regard).

A lot of the scheduling code uses spin_trylock() to just skip over pcpus
that are busy when doing this sort of load-balancing. Using a hash to
choose a default and then cycling through pcpus until you find one whose
lock you can grab should be reasonably efficient.

Re "an IPI is sent to PCPU1", all that should be transparent to the PI
code -- it already calls vcpu_unblock(), which will call vcpu_wake(),
which calls the scheduling wake code, which will DTRT.

FWIW I have much less objection to this sort of solution if it were
confined to the PI arch_block() callback, rather than something that
required changes to the schedulers.

-George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
Re: vmx: VT-d posted-interrupt core logic handling [ In reply to ]
On 10/03/16 10:46, George Dunlap wrote:
> On 10/03/16 10:35, David Vrabel wrote:
>> On 10/03/16 10:18, Jan Beulich wrote:
>>>>>> On 10.03.16 at 11:05, <kevin.tian@intel.com> wrote:
>>>>> From: Tian, Kevin
>>>>> Sent: Thursday, March 10, 2016 5:20 PM
>>>>>
>>>>>> From: Jan Beulich [mailto:JBeulich@suse.com]
>>>>>> Sent: Thursday, March 10, 2016 5:06 PM
>>>>>>
>>>>>>
>>>>>>> There are many linked list usages today in Xen hypervisor, which
>>>>>>> have different theoretical maximum possible number. The closest
>>>>>>> one to PI might be the usage in tmem (pool->share_list) which is
>>>>>>> page based so could grow 'overly large'. Other examples are
>>>>>>> magnitude lower, e.g. s->ioreq_vcpu_list in ioreq server (which
>>>>>>> could be 8K in above example), and d->arch.hvm_domain.msixtbl_list
>>>>>>> in MSI-x virtualization (which could be 2^11 per spec). Do we
>>>>>>> also want to create some artificial scenarios to examine them
>>>>>>> since based on actual operation K-level entries may also become
>>>>>>> a problem?
>>>>>>>
>>>>>>> Just want to figure out how best we can solve all related linked-list
>>>>>>> usages in current hypervisor.
>>>>>>
>>>>>> As you say, those are (perhaps with the exception of tmem, which
>>>>>> isn't supported anyway due to XSA-15, and which therefore also
>>>>>> isn't on by default) in the order of a few thousand list elements.
>>>>>> And as mentioned above, different bounds apply for lists traversed
>>>>>> in interrupt context vs such traversed only in "normal" context.
>>>>>>
>>>>>
>>>>> That's a good point. Interrupt context should have more restrictions.
>>>>
>>>> Hi, Jan,
>>>>
>>>> I'm thinking your earlier idea about evenly distributed list:
>>>>
>>>> --
>>>> Ah, right, I think that limitation was named before, yet I've
>>>> forgotten about it again. But that only slightly alters the
>>>> suggestion: To distribute vCPU-s evenly would then require to
>>>> change their placement on the pCPU in the course of entering
>>>> blocked state.
>>>> --
>>>>
>>>> Actually after more thinking, there is no hard requirement that
>>>> the vcpu must block on the pcpu which is configured in 'NDST'
>>>> of that vcpu's PI descriptor. What really matters, is that the
>>>> vcpu is added to the linked list of the very pcpu, then when PI
>>>> notification comes we can always find out the vcpu struct from
>>>> that pcpu's linked list. Of course one drawback of such placement
>>>> is additional IPI incurred in wake up path.
>>>>
>>>> Then one possible optimized policy within vmx_vcpu_block could
>>>> be:
>>>>
>>>> (Say PCPU1 which VCPU1 is currently blocked on)
>>>> - As long as the #vcpus in the linked list on PCPU1 is below a
>>>> threshold (say 16), add VCPU1 to the list. NDST set to PCPU1;
>>>> Upon PI notification on PCPU1, local linked list is searched to
>>>> find VCPU1 and then VCPU1 will be unblocked on PCPU1;
>>>>
>>>> - Otherwise, add VCPU1 to PCPU2 based on a simple distribution
>>>> algorithm (based on vcpu_id/vm_id). VCPU1 still blocks on PCPU1
>>>> but NDST set to PCPU2. Upon notification on PCPU2, local linked
>>>> list is searched to find VCPU1 and then an IPI is sent to PCPU1 to
>>>> unblock VCPU1;
>>>
>>> Sounds possible, if the lock handling can be got right. But of
>>> course there can't be any hard limit like 16, at least not alone
>>> (on a systems with extremely many mostly idle vCPU-s we'd
>>> need to allow larger counts - see my earlier explanations in this
>>> regard).
>>
>> You could also consider only waking the first N VCPUs and just making
>> the rest runnable. If you wake more VCPUs than PCPUs at the same time
>> most of them won't actually be scheduled.
>
> "Waking" a vcpu means "changing from blocked to runnable", so those two
> things are the same. And I can't figure out what you mean instead --
> can you elaborate?
>
> Waking up 1000 vcpus is going to take strictly more time than checking
> whether there's a PI interrupt pending on 1000 vcpus to see if they need
> to be woken up.

Waking means making it runnable /and/ attempt to make it running.

So I mean, for the > N'th VCPU don't call __runq_tickle(), only call
__runq_insert().

David

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
Re: vmx: VT-d posted-interrupt core logic handling [ In reply to ]
On Thu, 2016-03-10 at 11:00 +0000, George Dunlap wrote:
> > > > > On 10.03.16 at 11:05, <kevin.tian@intel.com> wrote:
> > > >  From: Tian, Kevin
> > > > Sent: Thursday, March 10, 2016 5:20 PM
> > > > 
> > > Actually after more thinking, there is no hard requirement that
> > > the vcpu must block on the pcpu which is configured in 'NDST'
> > > of that vcpu's PI descriptor. What really matters, is that the
> > > vcpu is added to the linked list of the very pcpu, then when PI
> > > notification comes we can always find out the vcpu struct from
> > > that pcpu's linked list. Of course one drawback of such placement
> > > is additional IPI incurred in wake up path.
> > >
> > > 
> Re "an IPI is sent to PCPU1", all that should be transparent to the
> PI
> code -- it already calls vcpu_unblock(), which will call vcpu_wake(),
> which calls the scheduling wake code, which will DTRT.
>
Exactly. In fact, whether there will be any IPI involved is under
control of the scheduler, rather than of PI code, even right now.

In fact, no matter in what pCPU's blocked list a vCPU is, it is the
'tickling' logic (for all Credit, Credit2 and RTDS) that really decides
on which pCPU the vCPU should wakeup, and send the IPI, if that is not
the pCPU we're running on.

It can be argued that having a vCPU in the blocked list of the pCPU
where it was running when it blocked could be a good thing, because it
may then be able to restart running there when waking, which may have
positive cache effects, etc.
But that is not at all guaranteed. In fact, it could well be the case
in fairly idle systems, but under high load and/or if hard and soft
affinity are in use (which may well be the case, .e.g, on large NUMA
servers), that isn't necessarily true, and we should not base on such
assumption.

> FWIW I have much less objection to this sort of solution if it were
> confined to the PI arch_block() callback, rather than something that
> required changes to the schedulers.
>
Same here, and I think this can well be done in such a way... Worth a
shot, IMO.

Regards,
Dario
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
Re: vmx: VT-d posted-interrupt core logic handling [ In reply to ]
On 10/03/16 11:16, David Vrabel wrote:
> On 10/03/16 10:46, George Dunlap wrote:
>> On 10/03/16 10:35, David Vrabel wrote:
>>> On 10/03/16 10:18, Jan Beulich wrote:
>>>>>>> On 10.03.16 at 11:05, <kevin.tian@intel.com> wrote:
>>>>>> From: Tian, Kevin
>>>>>> Sent: Thursday, March 10, 2016 5:20 PM
>>>>>>
>>>>>>> From: Jan Beulich [mailto:JBeulich@suse.com]
>>>>>>> Sent: Thursday, March 10, 2016 5:06 PM
>>>>>>>
>>>>>>>
>>>>>>>> There are many linked list usages today in Xen hypervisor, which
>>>>>>>> have different theoretical maximum possible number. The closest
>>>>>>>> one to PI might be the usage in tmem (pool->share_list) which is
>>>>>>>> page based so could grow 'overly large'. Other examples are
>>>>>>>> magnitude lower, e.g. s->ioreq_vcpu_list in ioreq server (which
>>>>>>>> could be 8K in above example), and d->arch.hvm_domain.msixtbl_list
>>>>>>>> in MSI-x virtualization (which could be 2^11 per spec). Do we
>>>>>>>> also want to create some artificial scenarios to examine them
>>>>>>>> since based on actual operation K-level entries may also become
>>>>>>>> a problem?
>>>>>>>>
>>>>>>>> Just want to figure out how best we can solve all related linked-list
>>>>>>>> usages in current hypervisor.
>>>>>>>
>>>>>>> As you say, those are (perhaps with the exception of tmem, which
>>>>>>> isn't supported anyway due to XSA-15, and which therefore also
>>>>>>> isn't on by default) in the order of a few thousand list elements.
>>>>>>> And as mentioned above, different bounds apply for lists traversed
>>>>>>> in interrupt context vs such traversed only in "normal" context.
>>>>>>>
>>>>>>
>>>>>> That's a good point. Interrupt context should have more restrictions.
>>>>>
>>>>> Hi, Jan,
>>>>>
>>>>> I'm thinking your earlier idea about evenly distributed list:
>>>>>
>>>>> --
>>>>> Ah, right, I think that limitation was named before, yet I've
>>>>> forgotten about it again. But that only slightly alters the
>>>>> suggestion: To distribute vCPU-s evenly would then require to
>>>>> change their placement on the pCPU in the course of entering
>>>>> blocked state.
>>>>> --
>>>>>
>>>>> Actually after more thinking, there is no hard requirement that
>>>>> the vcpu must block on the pcpu which is configured in 'NDST'
>>>>> of that vcpu's PI descriptor. What really matters, is that the
>>>>> vcpu is added to the linked list of the very pcpu, then when PI
>>>>> notification comes we can always find out the vcpu struct from
>>>>> that pcpu's linked list. Of course one drawback of such placement
>>>>> is additional IPI incurred in wake up path.
>>>>>
>>>>> Then one possible optimized policy within vmx_vcpu_block could
>>>>> be:
>>>>>
>>>>> (Say PCPU1 which VCPU1 is currently blocked on)
>>>>> - As long as the #vcpus in the linked list on PCPU1 is below a
>>>>> threshold (say 16), add VCPU1 to the list. NDST set to PCPU1;
>>>>> Upon PI notification on PCPU1, local linked list is searched to
>>>>> find VCPU1 and then VCPU1 will be unblocked on PCPU1;
>>>>>
>>>>> - Otherwise, add VCPU1 to PCPU2 based on a simple distribution
>>>>> algorithm (based on vcpu_id/vm_id). VCPU1 still blocks on PCPU1
>>>>> but NDST set to PCPU2. Upon notification on PCPU2, local linked
>>>>> list is searched to find VCPU1 and then an IPI is sent to PCPU1 to
>>>>> unblock VCPU1;
>>>>
>>>> Sounds possible, if the lock handling can be got right. But of
>>>> course there can't be any hard limit like 16, at least not alone
>>>> (on a systems with extremely many mostly idle vCPU-s we'd
>>>> need to allow larger counts - see my earlier explanations in this
>>>> regard).
>>>
>>> You could also consider only waking the first N VCPUs and just making
>>> the rest runnable. If you wake more VCPUs than PCPUs at the same time
>>> most of them won't actually be scheduled.
>>
>> "Waking" a vcpu means "changing from blocked to runnable", so those two
>> things are the same. And I can't figure out what you mean instead --
>> can you elaborate?
>>
>> Waking up 1000 vcpus is going to take strictly more time than checking
>> whether there's a PI interrupt pending on 1000 vcpus to see if they need
>> to be woken up.
>
> Waking means making it runnable /and/ attempt to make it running.
>
> So I mean, for the > N'th VCPU don't call __runq_tickle(), only call
> __runq_insert().

I'm not sure that would satisfy Jan; inserting 1000 vcpus into the
runqueue (much less inserting 4 million vcpus) is still going to take
quite a while, even without looking for a place to run them.

-George


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
Re: vmx: VT-d posted-interrupt core logic handling [ In reply to ]
>>> On 10.03.16 at 12:16, <david.vrabel@citrix.com> wrote:
> On 10/03/16 10:46, George Dunlap wrote:
>> On 10/03/16 10:35, David Vrabel wrote:
>>> You could also consider only waking the first N VCPUs and just making
>>> the rest runnable. If you wake more VCPUs than PCPUs at the same time
>>> most of them won't actually be scheduled.
>>
>> "Waking" a vcpu means "changing from blocked to runnable", so those two
>> things are the same. And I can't figure out what you mean instead --
>> can you elaborate?
>>
>> Waking up 1000 vcpus is going to take strictly more time than checking
>> whether there's a PI interrupt pending on 1000 vcpus to see if they need
>> to be woken up.
>
> Waking means making it runnable /and/ attempt to make it running.
>
> So I mean, for the > N'th VCPU don't call __runq_tickle(), only call
> __runq_insert().

We expect only some (hopefully small) percentage of the vCPU-s
on the list to actually need unblocking anyway. As George said,
list traversal alone can be an issue here, and we can't stop going
through the list half way. The case where a large portion of the
vCPU-s on the list actually need waking up would be even more
worrying.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
Re: vmx: VT-d posted-interrupt core logic handling [ In reply to ]
> -----Original Message-----
> From: Tian, Kevin
> Sent: Thursday, March 10, 2016 6:06 PM
> To: Jan Beulich <JBeulich@suse.com>
> Cc: Andrew Cooper <andrew.cooper3@citrix.com>; Dario Faggioli
> <dario.faggioli@citrix.com>; David Vrabel <david.vrabel@citrix.com>;
> GeorgeDunlap <george.dunlap@citrix.com>; Lars Kurth <lars.kurth@citrix.com>;
> George Dunlap <George.Dunlap@eu.citrix.com>; Ian Jackson
> <Ian.Jackson@eu.citrix.com>; Wu, Feng <feng.wu@intel.com>; xen-
> devel@lists.xen.org; Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Subject: RE: vmx: VT-d posted-interrupt core logic handling
>
>
> Hi, Jan,
>
> I'm thinking your earlier idea about evenly distributed list:
>
> --
> Ah, right, I think that limitation was named before, yet I've
> forgotten about it again. But that only slightly alters the
> suggestion: To distribute vCPU-s evenly would then require to
> change their placement on the pCPU in the course of entering
> blocked state.
> --
>
> Actually after more thinking, there is no hard requirement that
> the vcpu must block on the pcpu which is configured in 'NDST'
> of that vcpu's PI descriptor. What really matters, is that the
> vcpu is added to the linked list of the very pcpu, then when PI
> notification comes we can always find out the vcpu struct from
> that pcpu's linked list. Of course one drawback of such placement
> is additional IPI incurred in wake up path.
>
> Then one possible optimized policy within vmx_vcpu_block could
> be:
>
> (Say PCPU1 which VCPU1 is currently blocked on)
> - As long as the #vcpus in the linked list on PCPU1 is below a
> threshold (say 16), add VCPU1 to the list. NDST set to PCPU1;
> Upon PI notification on PCPU1, local linked list is searched to
> find VCPU1 and then VCPU1 will be unblocked on PCPU1;
>
> - Otherwise, add VCPU1 to PCPU2 based on a simple distribution
> algorithm (based on vcpu_id/vm_id). VCPU1 still blocks on PCPU1
> but NDST set to PCPU2. Upon notification on PCPU2, local linked
> list is searched to find VCPU1 and then an IPI is sent to PCPU1 to
> unblock VCPU1;
>
> Feng, do you see any overlook here? :-)

Kevin, thanks for the suggestion, it sounds a good idea, I will think
it a bit more and do some trials based on it.

Thanks,
Feng

>
> Thanks
> Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel