Mailing List Archive

[PATCH v2] x86/hvm: re-work viridian APIC assist code
It appears there is a case where Windows enables the APIC assist
enlightenment[1] but does not use it. This scenario is perfectly valid
according to the documentation, but causes the state machine in Xen to
become confused leading to a domain_crash() such as the following:

(XEN) d4: VIRIDIAN GUEST_OS_ID: vendor: 1 os: 4 major: 6 minor: 1 sp: 0
build: 1db0
(XEN) d4: VIRIDIAN HYPERCALL: enabled: 1 pfn: 3ffff
(XEN) d4v0: VIRIDIAN VP_ASSIST_PAGE: enabled: 1 pfn: 3fffe
(XEN) domain_crash called from viridian.c:452
(XEN) Domain 4 (vcpu#0) crashed on cpu#1:

The following sequence of events is an example of how this can happen:

- On return to guest vlapic_has_pending_irq() finds a bit set in the IRR.
- vlapic_ack_pending_irq() calls viridian_start_apic_assist() which latches
the vector, sets the bit in the ISR and clears it from the IRR.
- The guest then processes the interrupt but EOIs it normally, therefore
clearing the bit in the ISR.
- On next return to guest vlapic_has_pending_irq() calls
viridian_complete_apic_assist(), which discovers the assist bit still set
in the shared page and therefore leaves the latched vector in place, but
also finds another bit set in the IRR.
- vlapic_ack_pending_irq() is then called but, because the ISR is was
cleared by the EOI, another call is made to viridian_start_apic_assist()
and this then calls domain_crash() because it finds the latched vector
has not been cleared.

Having re-visited the code I also conclude that Xen's implementation of the
enlightenment is currently wrong and we are not properly following the
specification.

The specification says:

"The hypervisor sets the “No EOI required” bit when it injects a virtual
interrupt if the following conditions are satisfied:

- The virtual interrupt is edge-triggered, and
- There are no lower priority interrupts pending.

If, at a later time, a lower priority interrupt is requested, the
hypervisor clears the “No EOI required” such that a subsequent EOI causes
an intercept.
In case of nested interrupts, the EOI intercept is avoided only for the
highest priority interrupt. This is necessary since no count is maintained
for the number of EOIs performed by the OS. Therefore only the first EOI
can be avoided and since the first EOI clears the “No EOI Required” bit,
the next EOI generates an intercept."

Thus it is quite legitimate to set the "No EOI required" bit and then
subsequently take a higher priority interrupt without clearing the bit.
Thus the avoided EOI will then relate to that subsequent interrupt rather
than the highest priority interrupt when the bit was set. Hence latching
the vector when setting the bit is not entirely useful and somewhat
misleading.

This patch re-works the APIC assist code to simply track when the "No EOI
required" bit is set and test if it has been cleared by the guest (i.e.
'completing' the APIC assist), thus indicating a 'missed EOI'. Missed EOIs
need to be dealt with in two places:

- In vlapic_has_pending_irq(), to avoid comparing the IRR against a stale
ISR, and
- In vlapic_EOI_set() because a missed EOI for a higher priority vector
should be dealt with before the actual EOI for the lower priority
vector.

Furthermore, because the guest is at liberty to ignore the "No EOI required"
bit (which lead the crash detailed above) vlapic_EOI_set() must also make
sure the bit is cleared to avoid confusing the state machine.

Lastly the previous code did not properly emulate an EOI if a missed EOI
was discovered in vlapic_has_pending_irq(); it merely cleared the bit in
the ISR. The new code instead calls vlapic_EOI_set().

[1] See section 10.3.5 of Microsoft's "Hypervisor Top Level Functional
Specification v5.0b".

NOTE: The changes to the save/restore code are safe because the layout
of struct hvm_viridian_vcpu_context is unchanged and the new
interpretation of the (previously so named) vp_assist_vector field
as the boolean pending flag maintains the correct semantics.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
---
Cc: Jan Beulich <jbeulich@suse.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>

v2:
- Extend patch to completely re-work APIC assist code
---
xen/arch/x86/hvm/viridian.c | 36 ++++++++--------
xen/arch/x86/hvm/vlapic.c | 75 ++++++++++++++++++++++++----------
xen/include/asm-x86/hvm/viridian.h | 8 ++--
xen/include/public/arch-x86/hvm/save.h | 2 +-
4 files changed, 76 insertions(+), 45 deletions(-)

diff --git a/xen/arch/x86/hvm/viridian.c b/xen/arch/x86/hvm/viridian.c
index f0fa59d7d5..cc40af1fb5 100644
--- a/xen/arch/x86/hvm/viridian.c
+++ b/xen/arch/x86/hvm/viridian.c
@@ -433,46 +433,44 @@ static void teardown_vp_assist(struct vcpu *v)
put_page_and_type(page);
}

-void viridian_start_apic_assist(struct vcpu *v, int vector)
+void viridian_set_apic_assist(struct vcpu *v)
{
uint32_t *va = v->arch.hvm_vcpu.viridian.vp_assist.va;

if ( !va )
return;

- if ( vector < 0x10 )
- return;
-
/*
* If there is already an assist pending then something has gone
* wrong and the VM will most likely hang so force a crash now
* to make the problem clear.
*/
- if ( v->arch.hvm_vcpu.viridian.vp_assist.vector )
+ if ( v->arch.hvm_vcpu.viridian.vp_assist.pending )
domain_crash(v->domain);

- v->arch.hvm_vcpu.viridian.vp_assist.vector = vector;
+ v->arch.hvm_vcpu.viridian.vp_assist.pending = true;
*va |= 1u;
}

-int viridian_complete_apic_assist(struct vcpu *v)
+bool viridian_apic_assist_completed(struct vcpu *v)
{
uint32_t *va = v->arch.hvm_vcpu.viridian.vp_assist.va;
- int vector;

if ( !va )
- return 0;
+ return false;

- if ( *va & 1u )
- return 0; /* Interrupt not yet processed by the guest. */
-
- vector = v->arch.hvm_vcpu.viridian.vp_assist.vector;
- v->arch.hvm_vcpu.viridian.vp_assist.vector = 0;
+ if ( v->arch.hvm_vcpu.viridian.vp_assist.pending &&
+ !(*va & 1u) )
+ {
+ /* An EOI has been avoided */
+ v->arch.hvm_vcpu.viridian.vp_assist.pending = false;
+ return true;
+ }

- return vector;
+ return false;
}

-void viridian_abort_apic_assist(struct vcpu *v)
+void viridian_clear_apic_assist(struct vcpu *v)
{
uint32_t *va = v->arch.hvm_vcpu.viridian.vp_assist.va;

@@ -480,7 +478,7 @@ void viridian_abort_apic_assist(struct vcpu *v)
return;

*va &= ~1u;
- v->arch.hvm_vcpu.viridian.vp_assist.vector = 0;
+ v->arch.hvm_vcpu.viridian.vp_assist.pending = false;
}

static void update_reference_tsc(struct domain *d, bool_t initialize)
@@ -1040,7 +1038,7 @@ static int viridian_save_vcpu_ctxt(struct domain *d, hvm_domain_context_t *h)
for_each_vcpu( d, v ) {
struct hvm_viridian_vcpu_context ctxt = {
.vp_assist_msr = v->arch.hvm_vcpu.viridian.vp_assist.msr.raw,
- .vp_assist_vector = v->arch.hvm_vcpu.viridian.vp_assist.vector,
+ .vp_assist_pending = v->arch.hvm_vcpu.viridian.vp_assist.pending,
};

if ( hvm_save_entry(VIRIDIAN_VCPU, v->vcpu_id, h, &ctxt) != 0 )
@@ -1075,7 +1073,7 @@ static int viridian_load_vcpu_ctxt(struct domain *d, hvm_domain_context_t *h)
!v->arch.hvm_vcpu.viridian.vp_assist.va )
initialize_vp_assist(v);

- v->arch.hvm_vcpu.viridian.vp_assist.vector = ctxt.vp_assist_vector;
+ v->arch.hvm_vcpu.viridian.vp_assist.pending = !!ctxt.vp_assist_pending;

return 0;
}
diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
index 50f53bdaef..a6ace4fb21 100644
--- a/xen/arch/x86/hvm/vlapic.c
+++ b/xen/arch/x86/hvm/vlapic.c
@@ -416,18 +416,45 @@ struct vlapic *vlapic_lowest_prio(

void vlapic_EOI_set(struct vlapic *vlapic)
{
- int vector = vlapic_find_highest_isr(vlapic);
+ struct vcpu *v = vlapic_vcpu(vlapic);
+ /*
+ * If APIC assist was set then an EOI may have been avoided and
+ * hence this EOI actually relates to a lower priority vector.
+ * Thus it is necessary to first emulate the EOI for the higher
+ * priority vector and then recurse to handle the lower priority
+ * vector.
+ */
+ bool missed_eoi = viridian_apic_assist_completed(v);
+ int vector;
+
+ again:
+ vector = vlapic_find_highest_isr(vlapic);

/* Some EOI writes may not have a matching to an in-service interrupt. */
if ( vector == -1 )
return;

+ /*
+ * If APIC assist was set but the guest chose to EOI anyway then
+ * we need to clean up state.
+ * NOTE: It is harmless to call viridian_clear_apic_assist() on a
+ * recursion, even though it is not necessary.
+ */
+ if ( !missed_eoi )
+ viridian_clear_apic_assist(v);
+
vlapic_clear_vector(vector, &vlapic->regs->data[APIC_ISR]);

if ( hvm_funcs.handle_eoi )
hvm_funcs.handle_eoi(vector);

vlapic_handle_EOI(vlapic, vector);
+
+ if ( missed_eoi )
+ {
+ missed_eoi = false;
+ goto again;
+ }
}

void vlapic_handle_EOI(struct vlapic *vlapic, u8 vector)
@@ -1239,7 +1266,7 @@ int vlapic_virtual_intr_delivery_enabled(void)
int vlapic_has_pending_irq(struct vcpu *v)
{
struct vlapic *vlapic = vcpu_vlapic(v);
- int irr, vector, isr;
+ int irr, isr;

if ( !vlapic_enabled(vlapic) )
return -1;
@@ -1252,27 +1279,32 @@ int vlapic_has_pending_irq(struct vcpu *v)
!nestedhvm_vcpu_in_guestmode(v) )
return irr;

+ isr = vlapic_find_highest_isr(vlapic);
+
/*
- * If APIC assist was used then there may have been no EOI so
- * we need to clear the requisite bit from the ISR here, before
- * comparing with the IRR.
+ * If APIC assist was set then an EOI may have been avoided.
+ * If so, we need to emulate the EOI here before comparing ISR
+ * with IRR.
*/
- vector = viridian_complete_apic_assist(v);
- if ( vector )
- vlapic_clear_vector(vector, &vlapic->regs->data[APIC_ISR]);
-
- isr = vlapic_find_highest_isr(vlapic);
- if ( isr == -1 )
- return irr;
+ if ( viridian_apic_assist_completed(v) )
+ {
+ vlapic_EOI_set(vlapic);
+ isr = vlapic_find_highest_isr(vlapic);
+ }

/*
- * A vector is pending in the ISR so, regardless of whether the new
- * vector in the IRR is lower or higher in priority, any pending
- * APIC assist must be aborted to ensure an EOI.
+ * The specification says that if APIC assist is set and a
+ * subsequent interrupt of lower priority occurs then APIC assist
+ * needs to be cleared.
*/
- viridian_abort_apic_assist(v);
+ if ( isr >= 0 &&
+ (irr & 0xf0) <= (isr & 0xf0) )
+ {
+ viridian_clear_apic_assist(v);
+ return -1;
+ }

- return ((isr & 0xf0) < (irr & 0xf0)) ? irr : -1;
+ return irr;
}

int vlapic_ack_pending_irq(struct vcpu *v, int vector, bool_t force_ack)
@@ -1290,13 +1322,14 @@ int vlapic_ack_pending_irq(struct vcpu *v, int vector, bool_t force_ack)
goto done;

isr = vlapic_find_highest_isr(vlapic);
- if ( isr == -1 )
+ if ( isr == -1 && vector > 0x10 )
{
/*
- * This vector is edge triggered and no other vectors are pending
- * in the ISR so we can use APIC assist to avoid exiting for EOI.
+ * This vector is edge triggered, not in the legacy range, and no
+ * lower priority vectors are pending in the ISR. Thus we can set
+ * APIC assist to avoid exiting for EOI.
*/
- viridian_start_apic_assist(v, vector);
+ viridian_set_apic_assist(v);
}

done:
diff --git a/xen/include/asm-x86/hvm/viridian.h b/xen/include/asm-x86/hvm/viridian.h
index 30259e91b0..7e3a08c73d 100644
--- a/xen/include/asm-x86/hvm/viridian.h
+++ b/xen/include/asm-x86/hvm/viridian.h
@@ -24,7 +24,7 @@ struct viridian_vcpu
struct {
union viridian_vp_assist msr;
void *va;
- int vector;
+ bool pending;
} vp_assist;
uint64_t crash_param[5];
};
@@ -120,9 +120,9 @@ void viridian_time_ref_count_thaw(struct domain *d);
void viridian_vcpu_deinit(struct vcpu *v);
void viridian_domain_deinit(struct domain *d);

-void viridian_start_apic_assist(struct vcpu *v, int vector);
-int viridian_complete_apic_assist(struct vcpu *v);
-void viridian_abort_apic_assist(struct vcpu *v);
+void viridian_set_apic_assist(struct vcpu *v);
+bool viridian_apic_assist_completed(struct vcpu *v);
+void viridian_clear_apic_assist(struct vcpu *v);

#endif /* __ASM_X86_HVM_VIRIDIAN_H__ */

diff --git a/xen/include/public/arch-x86/hvm/save.h b/xen/include/public/arch-x86/hvm/save.h
index fd7bf3fb38..4691d4d4aa 100644
--- a/xen/include/public/arch-x86/hvm/save.h
+++ b/xen/include/public/arch-x86/hvm/save.h
@@ -600,7 +600,7 @@ DECLARE_HVM_SAVE_TYPE(VIRIDIAN_DOMAIN, 15, struct hvm_viridian_domain_context);

struct hvm_viridian_vcpu_context {
uint64_t vp_assist_msr;
- uint8_t vp_assist_vector;
+ uint8_t vp_assist_pending;
uint8_t _pad[7];
};

--
2.11.0


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [PATCH v2] x86/hvm: re-work viridian APIC assist code [ In reply to ]
>>> On 18.01.18 at 16:10, <paul.durrant@citrix.com> wrote:
> -int viridian_complete_apic_assist(struct vcpu *v)
> +bool viridian_apic_assist_completed(struct vcpu *v)
> {
> uint32_t *va = v->arch.hvm_vcpu.viridian.vp_assist.va;
> - int vector;
>
> if ( !va )
> - return 0;
> + return false;
>
> - if ( *va & 1u )
> - return 0; /* Interrupt not yet processed by the guest. */
> -
> - vector = v->arch.hvm_vcpu.viridian.vp_assist.vector;
> - v->arch.hvm_vcpu.viridian.vp_assist.vector = 0;
> + if ( v->arch.hvm_vcpu.viridian.vp_assist.pending &&
> + !(*va & 1u) )
> + {
> + /* An EOI has been avoided */
> + v->arch.hvm_vcpu.viridian.vp_assist.pending = false;
> + return true;
> + }

Especially with such a non-atomic update, please remind me: Despite
them having struct vcpu * parameters, these functions are all only
ever called with v == current? If the answer is yes, then
Reviewed-by: Jan Beulich <jbeulich@suse.com>
with one more cosmetic remark:

> @@ -120,9 +120,9 @@ void viridian_time_ref_count_thaw(struct domain *d);
> void viridian_vcpu_deinit(struct vcpu *v);
> void viridian_domain_deinit(struct domain *d);
>
> -void viridian_start_apic_assist(struct vcpu *v, int vector);
> -int viridian_complete_apic_assist(struct vcpu *v);
> -void viridian_abort_apic_assist(struct vcpu *v);
> +void viridian_set_apic_assist(struct vcpu *v);
> +bool viridian_apic_assist_completed(struct vcpu *v);
> +void viridian_clear_apic_assist(struct vcpu *v);

Could I talk you into naming them all viridian_apic_assist_...()?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [PATCH v2] x86/hvm: re-work viridian APIC assist code [ In reply to ]
> -----Original Message-----
> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: 18 January 2018 16:21
> To: Paul Durrant <Paul.Durrant@citrix.com>
> Cc: Andrew Cooper <Andrew.Cooper3@citrix.com>; xen-
> devel@lists.xenproject.org
> Subject: Re: [PATCH v2] x86/hvm: re-work viridian APIC assist code
>
> >>> On 18.01.18 at 16:10, <paul.durrant@citrix.com> wrote:
> > -int viridian_complete_apic_assist(struct vcpu *v)
> > +bool viridian_apic_assist_completed(struct vcpu *v)
> > {
> > uint32_t *va = v->arch.hvm_vcpu.viridian.vp_assist.va;
> > - int vector;
> >
> > if ( !va )
> > - return 0;
> > + return false;
> >
> > - if ( *va & 1u )
> > - return 0; /* Interrupt not yet processed by the guest. */
> > -
> > - vector = v->arch.hvm_vcpu.viridian.vp_assist.vector;
> > - v->arch.hvm_vcpu.viridian.vp_assist.vector = 0;
> > + if ( v->arch.hvm_vcpu.viridian.vp_assist.pending &&
> > + !(*va & 1u) )
> > + {
> > + /* An EOI has been avoided */
> > + v->arch.hvm_vcpu.viridian.vp_assist.pending = false;
> > + return true;
> > + }
>
> Especially with such a non-atomic update, please remind me: Despite
> them having struct vcpu * parameters, these functions are all only
> ever called with v == current? If the answer is yes, then
> Reviewed-by: Jan Beulich <jbeulich@suse.com>

Thanks. The answer is yes. They are only called from the vlapic code for the current vlapic AFAICT.

> with one more cosmetic remark:
>
> > @@ -120,9 +120,9 @@ void viridian_time_ref_count_thaw(struct domain
> *d);
> > void viridian_vcpu_deinit(struct vcpu *v);
> > void viridian_domain_deinit(struct domain *d);
> >
> > -void viridian_start_apic_assist(struct vcpu *v, int vector);
> > -int viridian_complete_apic_assist(struct vcpu *v);
> > -void viridian_abort_apic_assist(struct vcpu *v);
> > +void viridian_set_apic_assist(struct vcpu *v);
> > +bool viridian_apic_assist_completed(struct vcpu *v);
> > +void viridian_clear_apic_assist(struct vcpu *v);
>
> Could I talk you into naming them all viridian_apic_assist_...()?
>

Sure. I'll send a v3 with that fixed and your R-b.

Paul

> Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [PATCH v2] x86/hvm: re-work viridian APIC assist code [ In reply to ]
On Thu, 2018-01-18 at 10:10 -0500, Paul Durrant wrote:
> Lastly the previous code did not properly emulate an EOI if a missed EOI
> was discovered in vlapic_has_pending_irq(); it merely cleared the bit in
> the ISR. The new code instead calls vlapic_EOI_set().

Hm, this *halves* my observed performance running a 32-thread
'diskspd.exe' on a Windows box with attached NVME devices, which makes
me sad.

It's the call to hvm_dpci_msi_eoi() that does it.

Commenting out the call to pt_pirq_iterate() and leaving *just* the
domain-global spinlock bouncing cache lines between all my CPUs, it's
already down to 1.6MIOPS/s from 2.2M on my test box before it does
*anything* at all.

Calling an *inline* version of pt_pirq_iterate so no retpoline for the
indirect calls, and I'm down to 1.1M even when I've nopped out the
whole of the _hvm_dpci_msi_eoi function that it's calling. Put it all
back, and I'm down to about 1.0M. So it's worse than halved.

And what's all this for? The code here is making my eyes bleed but I
believe it's for unmaskable MSIs, and these aren't unmaskable.

Tempted to make it all go away by having a per-domain bitmap of vectors
for which all this work is actually required, and bypassing the whole
bloody lot in hvm_dpci_msi_eoi() if the corresponding in bit that
bitmap isn't set.

The hackish version of that (which seems to work, but would probably
want testing with an actual unmaskable MSI in the system, and I have
absolutely no confidence I understand what's going on here) looks
something like this:

diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index bab3aa3..24df008 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -24,6 +24,7 @@
 #include <asm/hvm/irq.h>
 #include <asm/hvm/support.h>
 #include <asm/io_apic.h>
+#include <asm/msi.h>
 
 static DEFINE_PER_CPU(struct list_head, dpci_list);
 
@@ -282,6 +283,7 @@ int pt_irq_create_bind(
     struct hvm_pirq_dpci *pirq_dpci;
     struct pirq *info;
     int rc, pirq = pt_irq_bind->machine_irq;
+    irq_desc_t *desc;
 
     if ( pirq < 0 || pirq >= d->nr_pirqs )
         return -EINVAL;
@@ -422,6 +425,13 @@ int pt_irq_create_bind(
 
         dest_vcpu_id = hvm_girq_dest_2_vcpu_id(d, dest, dest_mode);
         pirq_dpci->gmsi.dest_vcpu_id = dest_vcpu_id;
+ BUG_ON(!local_irq_is_enabled());
+        desc = pirq_spin_lock_irq_desc(info, NULL);
+        if ( desc && desc->msi_desc && !msi_maskable_irq(desc->msi_desc) )
+            set_bit(pirq_dpci->gmsi.gvec,
+                    hvm_domain_irq(d)->unmaskable_msi_vecs);
+        spin_unlock_irq(&desc->lock);
+
         spin_unlock(&d->event_lock);
 
         pirq_dpci->gmsi.posted = false;
@@ -869,7 +874,8 @@ static int _hvm_dpci_msi_eoi(struct domain *d,
 
 void hvm_dpci_msi_eoi(struct domain *d, int vector)
 {
-    if ( !iommu_enabled || !hvm_domain_irq(d)->dpci )
+    if ( !iommu_enabled || !hvm_domain_irq(d)->dpci ||
+  !test_bit(vector, hvm_domain_irq(d)->unmaskable_msi_vecs) )
        return;
 
     spin_lock(&d->event_lock);
diff --git a/xen/include/asm-x86/hvm/irq.h b/xen/include/asm-x86/hvm/irq.h
index 8a43cb9..d9d4652 100644
--- a/xen/include/asm-x86/hvm/irq.h
+++ b/xen/include/asm-x86/hvm/irq.h
@@ -78,6 +78,7 @@ struct hvm_irq {
     u8 round_robin_prev_vcpu;
 
     struct hvm_irq_dpci *dpci;
+    DECLARE_BITMAP(unmaskable_msi_vecs, 256);
 
     /*
      * Number of wires asserting each GSI.
Re: [PATCH v2] x86/hvm: re-work viridian APIC assist code [ In reply to ]
> -----Original Message-----
> From: David Woodhouse [mailto:dwmw2@infradead.org]
> Sent: 25 August 2018 00:38
> To: Paul Durrant <Paul.Durrant@citrix.com>; xen-devel@lists.xenproject.org
> Cc: Andrew Cooper <Andrew.Cooper3@citrix.com>; Jan Beulich
> <jbeulich@suse.com>; Eslam Elnikety <elnikety@amazon.de>; Shan Haitao
> <haitao.shan@intel.com>
> Subject: Re: [Xen-devel] [PATCH v2] x86/hvm: re-work viridian APIC assist
> code
>
> On Thu, 2018-01-18 at 10:10 -0500, Paul Durrant wrote:
> > Lastly the previous code did not properly emulate an EOI if a missed EOI
> > was discovered in vlapic_has_pending_irq(); it merely cleared the bit in
> > the ISR. The new code instead calls vlapic_EOI_set().
>
> Hm, this *halves* my observed performance running a 32-thread
> 'diskspd.exe' on a Windows box with attached NVME devices, which makes
> me sad.

Yes, that's clearly not what it is expected :-(

>
> It's the call to hvm_dpci_msi_eoi() that does it.
>
> Commenting out the call to pt_pirq_iterate() and leaving *just* the
> domain-global spinlock bouncing cache lines between all my CPUs, it's
> already down to 1.6MIOPS/s from 2.2M on my test box before it does
> *anything* at all.
>
> Calling an *inline* version of pt_pirq_iterate so no retpoline for the
> indirect calls, and I'm down to 1.1M even when I've nopped out the
> whole of the _hvm_dpci_msi_eoi function that it's calling. Put it all
> back, and I'm down to about 1.0M. So it's worse than halved.
>
> And what's all this for? The code here is making my eyes bleed but I
> believe it's for unmaskable MSIs, and these aren't unmaskable.
>

I believe APIC assist is intended for fully synthetic interrupts. Is it definitely this patch that causes the problem? It was only intended to fix previous incorrectness but, if this is the culprit, then it's clearly caused collateral damage in a logically unrelated area.

Paul

> Tempted to make it all go away by having a per-domain bitmap of vectors
> for which all this work is actually required, and bypassing the whole
> bloody lot in hvm_dpci_msi_eoi() if the corresponding in bit that
> bitmap isn't set.
>
> The hackish version of that (which seems to work, but would probably
> want testing with an actual unmaskable MSI in the system, and I have
> absolutely no confidence I understand what's going on here) looks
> something like this:
>
> diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
> index bab3aa3..24df008 100644
> --- a/xen/drivers/passthrough/io.c
> +++ b/xen/drivers/passthrough/io.c
> @@ -24,6 +24,7 @@
>  #include <asm/hvm/irq.h>
>  #include <asm/hvm/support.h>
>  #include <asm/io_apic.h>
> +#include <asm/msi.h>
>
>  static DEFINE_PER_CPU(struct list_head, dpci_list);
>
> @@ -282,6 +283,7 @@ int pt_irq_create_bind(
>      struct hvm_pirq_dpci *pirq_dpci;
>      struct pirq *info;
>      int rc, pirq = pt_irq_bind->machine_irq;
> +    irq_desc_t *desc;
>
>      if ( pirq < 0 || pirq >= d->nr_pirqs )
>          return -EINVAL;
> @@ -422,6 +425,13 @@ int pt_irq_create_bind(
>
>          dest_vcpu_id = hvm_girq_dest_2_vcpu_id(d, dest, dest_mode);
>          pirq_dpci->gmsi.dest_vcpu_id = dest_vcpu_id;
> + BUG_ON(!local_irq_is_enabled());
> +        desc = pirq_spin_lock_irq_desc(info, NULL);
> +        if ( desc && desc->msi_desc && !msi_maskable_irq(desc->msi_desc) )
> +            set_bit(pirq_dpci->gmsi.gvec,
> +                    hvm_domain_irq(d)->unmaskable_msi_vecs);
> +        spin_unlock_irq(&desc->lock);
> +
>          spin_unlock(&d->event_lock);
>
>          pirq_dpci->gmsi.posted = false;
> @@ -869,7 +874,8 @@ static int _hvm_dpci_msi_eoi(struct domain *d,
>
>  void hvm_dpci_msi_eoi(struct domain *d, int vector)
>  {
> -    if ( !iommu_enabled || !hvm_domain_irq(d)->dpci )
> +    if ( !iommu_enabled || !hvm_domain_irq(d)->dpci ||
> +  !test_bit(vector, hvm_domain_irq(d)->unmaskable_msi_vecs) )
>         return;
>
>      spin_lock(&d->event_lock);
> diff --git a/xen/include/asm-x86/hvm/irq.h b/xen/include/asm-
> x86/hvm/irq.h
> index 8a43cb9..d9d4652 100644
> --- a/xen/include/asm-x86/hvm/irq.h
> +++ b/xen/include/asm-x86/hvm/irq.h
> @@ -78,6 +78,7 @@ struct hvm_irq {
>      u8 round_robin_prev_vcpu;
>
>      struct hvm_irq_dpci *dpci;
> +    DECLARE_BITMAP(unmaskable_msi_vecs, 256);
>
>      /*
>       * Number of wires asserting each GSI.
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [PATCH v2] x86/hvm: re-work viridian APIC assist code [ In reply to ]
On Mon, 2018-09-03 at 10:12 +0000, Paul Durrant wrote:
>
> I believe APIC assist is intended for fully synthetic interrupts.

Hm, if by 'fully synthetic interrupts' you mean
vlapic_virtual_intr_delivery_enabled(), then no I think APIC assist
doesn't get used in that case at all.

> Is it definitely this patch that causes the problem? It was only
> intended to fix previous incorrectness but, if this is the culprit,
> then it's clearly caused collateral damage in a logically unrelated
> area.

Not entirely. The performance gain we observed with APIC assist in the
first place was basically stolen. It wasn't just bypassing the vmexit
for that EOI; it was *so* much faster because it actually didn't ever
do the EOI properly at all.

You fixed that omission and unsurprisingly it got slower again; most of
the apparent benefit of APIC assist is lost. But that's because it was
never really doing the right thing in the first place.

That EOI handling for unmaskable MSI is really painfully slow, so my 
hack bypasses it in the common case where it isn't really necessary.
FWIW I've done it in my tree with a single per-domain flag rather than
a per-vector bitmap now, which makes it slightly simpler.
Re: [PATCH v2] x86/hvm: re-work viridian APIC assist code [ In reply to ]
> -----Original Message-----
> From: David Woodhouse [mailto:dwmw2@infradead.org]
> Sent: 04 September 2018 21:31
> To: Paul Durrant <Paul.Durrant@citrix.com>; xen-devel@lists.xenproject.org
> Cc: Andrew Cooper <Andrew.Cooper3@citrix.com>; Jan Beulich
> <jbeulich@suse.com>; Eslam Elnikety <elnikety@amazon.de>; Shan Haitao
> <haitao.shan@intel.com>
> Subject: Re: [Xen-devel] [PATCH v2] x86/hvm: re-work viridian APIC assist
> code
>
> On Mon, 2018-09-03 at 10:12 +0000, Paul Durrant wrote:
> >
> > I believe APIC assist is intended for fully synthetic interrupts.
>
> Hm, if by 'fully synthetic interrupts' you mean
> vlapic_virtual_intr_delivery_enabled(), then no I think APIC assist
> doesn't get used in that case at all.

No, what I meant was that it was my belief that Windows would avoid APIC assist for what it sees as devices on a hardware bus but - now I think about it - that would not be wonderfully useful behaviour and would certainly squash the reason I put the support in anyway, which is for the edge triggered per-cpu event channel upcalls from Xen (since the driver registering the interrupts is bound to the Xen platform PCI device).

>
> > Is it definitely this patch that causes the problem? It was only
> > intended to fix previous incorrectness but, if this is the culprit,
> > then it's clearly caused collateral damage in a logically unrelated
> > area.
>
> Not entirely. The performance gain we observed with APIC assist in the
> first place was basically stolen. It wasn't just bypassing the vmexit
> for that EOI; it was *so* much faster because it actually didn't ever
> do the EOI properly at all.
>
> You fixed that omission and unsurprisingly it got slower again; most of
> the apparent benefit of APIC assist is lost. But that's because it was
> never really doing the right thing in the first place.
>
> That EOI handling for unmaskable MSI is really painfully slow, so my
> hack bypasses it in the common case where it isn't really necessary.
> FWIW I've done it in my tree with a single per-domain flag rather than
> a per-vector bitmap now, which makes it slightly simpler.

I see. Given that Windows has used APIC assist to circumvent its EOI then I wonder whether we can get away with essentially doing the same. I.e. for a completed APIC assist found in vlapic_has_pending_irq() we simply clear the APIC assist and highest vector in the ISR, rather than calling through to vlapic_EOI_set() and suffering the overhead. I'll spin up a patch and give it a whirl.

Paul
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [PATCH v2] x86/hvm: re-work viridian APIC assist code [ In reply to ]
On Wed, 2018-09-05 at 09:36 +0000, Paul Durrant wrote:
>
> I see. Given that Windows has used APIC assist to circumvent its EOI
> then I wonder whether we can get away with essentially doing the
> same. I.e. for a completed APIC assist found in
> vlapic_has_pending_irq() we simply clear the APIC assist and highest
> vector in the ISR, rather than calling through to vlapic_EOI_set()
> and suffering the overhead. I'll spin up a patch and give it a whirl.

I think that's fine if you don't actually pass unmaskable MSIs through
to the guest in question, but if you *do* then you still need the EOI
to happen properly to "unmask" it.

Hence my approach which is basically doing what you said and bypassing
the expensive part of hvm_dpci_msi_eoi(), but *only* if there's no
unmaskable MSI to worry about.
Re: [PATCH v2] x86/hvm: re-work viridian APIC assist code [ In reply to ]
> -----Original Message-----
> From: David Woodhouse [mailto:dwmw2@infradead.org]
> Sent: 05 September 2018 10:40
> To: Paul Durrant <Paul.Durrant@citrix.com>; xen-devel@lists.xenproject.org
> Cc: Andrew Cooper <Andrew.Cooper3@citrix.com>; Jan Beulich
> <jbeulich@suse.com>; Eslam Elnikety <elnikety@amazon.de>; Shan Haitao
> <haitao.shan@intel.com>
> Subject: Re: [Xen-devel] [PATCH v2] x86/hvm: re-work viridian APIC assist
> code
>
> On Wed, 2018-09-05 at 09:36 +0000, Paul Durrant wrote:
> >
> > I see. Given that Windows has used APIC assist to circumvent its EOI
> > then I wonder whether we can get away with essentially doing the
> > same. I.e. for a completed APIC assist found in
> > vlapic_has_pending_irq() we simply clear the APIC assist and highest
> > vector in the ISR, rather than calling through to vlapic_EOI_set()
> > and suffering the overhead. I'll spin up a patch and give it a whirl.
>
> I think that's fine if you don't actually pass unmaskable MSIs through
> to the guest in question, but if you *do* then you still need the EOI
> to happen properly to "unmask" it.
>
> Hence my approach which is basically doing what you said and bypassing
> the expensive part of hvm_dpci_msi_eoi(), but *only* if there's no
> unmaskable MSI to worry about.

Yeah, I'm kind of trusting Windows to only use APIC assist in the case where it is sure there is no need to EOI. Perhaps your approach is safer.

Paul
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [PATCH v2] x86/hvm: re-work viridian APIC assist code [ In reply to ]
> -----Original Message-----
> From: Xen-devel [mailto:xen-devel-bounces@lists.xenproject.org] On Behalf
> Of Paul Durrant
> Sent: 05 September 2018 10:43
> To: 'David Woodhouse' <dwmw2@infradead.org>; xen-
> devel@lists.xenproject.org
> Cc: Eslam Elnikety <elnikety@amazon.de>; Andrew Cooper
> <Andrew.Cooper3@citrix.com>; Shan Haitao <haitao.shan@intel.com>; Jan
> Beulich <jbeulich@suse.com>
> Subject: Re: [Xen-devel] [PATCH v2] x86/hvm: re-work viridian APIC assist
> code
>
> > -----Original Message-----
> > From: David Woodhouse [mailto:dwmw2@infradead.org]
> > Sent: 05 September 2018 10:40
> > To: Paul Durrant <Paul.Durrant@citrix.com>; xen-
> devel@lists.xenproject.org
> > Cc: Andrew Cooper <Andrew.Cooper3@citrix.com>; Jan Beulich
> > <jbeulich@suse.com>; Eslam Elnikety <elnikety@amazon.de>; Shan Haitao
> > <haitao.shan@intel.com>
> > Subject: Re: [Xen-devel] [PATCH v2] x86/hvm: re-work viridian APIC assist
> > code
> >
> > On Wed, 2018-09-05 at 09:36 +0000, Paul Durrant wrote:
> > >
> > > I see. Given that Windows has used APIC assist to circumvent its EOI
> > > then I wonder whether we can get away with essentially doing the
> > > same. I.e. for a completed APIC assist found in
> > > vlapic_has_pending_irq() we simply clear the APIC assist and highest
> > > vector in the ISR, rather than calling through to vlapic_EOI_set()
> > > and suffering the overhead. I'll spin up a patch and give it a whirl.
> >
> > I think that's fine if you don't actually pass unmaskable MSIs through
> > to the guest in question, but if you *do* then you still need the EOI
> > to happen properly to "unmask" it.
> >
> > Hence my approach which is basically doing what you said and bypassing
> > the expensive part of hvm_dpci_msi_eoi(), but *only* if there's no
> > unmaskable MSI to worry about.
>
> Yeah, I'm kind of trusting Windows to only use APIC assist in the case where
> it is sure there is no need to EOI. Perhaps your approach is safer.
>

Actually the neatest approach would be to get information into the vlapic code as to whether APIC assist is suitable for the given vector so that the code there can selectively enable it, and then Xen would know it was safe to avoid fully emulating an EOI for anything that did have assist enabled.

Paul

> Paul
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [PATCH v2] x86/hvm: re-work viridian APIC assist code [ In reply to ]
On Wed, 2018-09-05 at 10:40 +0000, Paul Durrant wrote:
>
> Actually the neatest approach would be to get information into the
> vlapic code as to whether APIC assist is suitable for the given
> vector so that the code there can selectively enable it, and then Xen
> would know it was safe to avoid fully emulating an EOI for anything
> that did have assist enabled.

I'm not sure I understand why an assisted EOI should be any different
from "normal" EOI.

The global lock and indirect function calls and all that stuff that
hvm_dpci_msi_eoi() does are *expensive*, for a rare case.

Why not bypass all that when it's not needed, for "normal" EOI and
APIC-assisted EOI alike? Why have a distinction between the two?
Re: [PATCH v2] x86/hvm: re-work viridian APIC assist code [ In reply to ]
> -----Original Message-----
> From: David Woodhouse [mailto:dwmw2@infradead.org]
> Sent: 05 September 2018 11:45
> To: Paul Durrant <Paul.Durrant@citrix.com>; xen-devel@lists.xenproject.org
> Cc: Eslam Elnikety <elnikety@amazon.de>; Andrew Cooper
> <Andrew.Cooper3@citrix.com>; Shan Haitao <haitao.shan@intel.com>; Jan
> Beulich <jbeulich@suse.com>
> Subject: Re: [Xen-devel] [PATCH v2] x86/hvm: re-work viridian APIC assist
> code
>
> On Wed, 2018-09-05 at 10:40 +0000, Paul Durrant wrote:
> >
> > Actually the neatest approach would be to get information into the
> > vlapic code as to whether APIC assist is suitable for the given
> > vector so that the code there can selectively enable it, and then Xen
> > would know it was safe to avoid fully emulating an EOI for anything
> > that did have assist enabled.
>
> I'm not sure I understand why an assisted EOI should be any different
> from "normal" EOI.
>
> The global lock and indirect function calls and all that stuff that
> hvm_dpci_msi_eoi() does are *expensive*, for a rare case.
>
> Why not bypass all that when it's not needed, for "normal" EOI and
> APIC-assisted EOI alike? Why have a distinction between the two?

True. I guess that's more globally applicable... i.e. it would help guests not aware of the enlightenment, or those with it disabled.

Paul
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [PATCH v2] x86/hvm: re-work viridian APIC assist code [ In reply to ]
Resending this straw man patch at Roger's request, to restart discussion.

Redux: In order to cope with the relatively rare case of unmaskable
legacy MSIs, each vlapic EOI takes a domain-global spinlock just to
iterate over all IRQs and determine that there's actually nothing to
do.

In my testing, I observe that this drops Windows performance on passed-
through NVMe from 2.2M IOPS down to about 1.0M IOPS.

I have a variant of this patch which just has a single per-domain "I
attached legacy unmaskable MSIs" flag, which is never cleared. The
patch below is per-vector (but Roger points out it should be per-vCPU
per-vector). I don't know that we really care enough to do more than
the single per-domain flag, which in real life would never happen
anyway unless you have crappy hardware, at which point you get back to
today's status quo.

My main concern is that this code is fairly sparsely documented and I'm
only 99% sure that this code path really *is* only for unmaskable MSIs,
and doesn't have some other esoteric use. A second opinion on that
would be particularly welcome.


(NB: APIC assist was a red herring here, except that it had a bug which
stopped the gratuitous EOI work from ever happening at all — which
nobody ever cared about since it doesn't matter on sane hardware, but
then when Paul *fixed* it, we saw it as a performance regression.)


On Sat, 2018-08-25 at 00:38 +0100, David Woodhouse wrote:
> On Thu, 2018-01-18 at 10:10 -0500, Paul Durrant wrote:
> > Lastly the previous code did not properly emulate an EOI if a missed EOI
> > was discovered in vlapic_has_pending_irq(); it merely cleared the bit in
> > the ISR. The new code instead calls vlapic_EOI_set().
>
> Hm, this *halves* my observed performance running a 32-thread
> 'diskspd.exe' on a Windows box with attached NVME devices, which makes
> me sad.
>
> It's the call to hvm_dpci_msi_eoi() that does it.
>
> Commenting out the call to pt_pirq_iterate() and leaving *just* the
> domain-global spinlock bouncing cache lines between all my CPUs, it's
> already down to 1.6MIOPS/s from 2.2M on my test box before it does
> *anything* at all.
>
> Calling an *inline* version of pt_pirq_iterate so no retpoline for the
> indirect calls, and I'm down to 1.1M even when I've nopped out the
> whole of the _hvm_dpci_msi_eoi function that it's calling. Put it all
> back, and I'm down to about 1.0M. So it's worse than halved.
>
> And what's all this for? The code here is making my eyes bleed but I
> believe it's for unmaskable MSIs, and these aren't unmaskable.
>
> Tempted to make it all go away by having a per-domain bitmap of vectors
> for which all this work is actually required, and bypassing the whole
> bloody lot in hvm_dpci_msi_eoi() if the corresponding in bit that
> bitmap isn't set.
>
> The hackish version of that (which seems to work, but would probably
> want testing with an actual unmaskable MSI in the system, and I have
> absolutely no confidence I understand what's going on here) looks
> something like this:
>
> diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
> index bab3aa3..24df008 100644
> --- a/xen/drivers/passthrough/io.c
> +++ b/xen/drivers/passthrough/io.c
> @@ -24,6 +24,7 @@
> #include <asm/hvm/irq.h>
> #include <asm/hvm/support.h>
> #include <asm/io_apic.h>
> +#include <asm/msi.h>
>
> static DEFINE_PER_CPU(struct list_head, dpci_list);
>
> @@ -282,6 +283,7 @@ int pt_irq_create_bind(
> struct hvm_pirq_dpci *pirq_dpci;
> struct pirq *info;
> int rc, pirq = pt_irq_bind->machine_irq;
> + irq_desc_t *desc;
>
> if ( pirq < 0 || pirq >= d->nr_pirqs )
> return -EINVAL;
> @@ -422,6 +425,13 @@ int pt_irq_create_bind(
>
> dest_vcpu_id = hvm_girq_dest_2_vcpu_id(d, dest, dest_mode);
> pirq_dpci->gmsi.dest_vcpu_id = dest_vcpu_id;
> + BUG_ON(!local_irq_is_enabled());
> + desc = pirq_spin_lock_irq_desc(info, NULL);
> + if ( desc && desc->msi_desc && !msi_maskable_irq(desc-
> >msi_desc) )
> + set_bit(pirq_dpci->gmsi.gvec,
> + hvm_domain_irq(d)->unmaskable_msi_vecs);
> + spin_unlock_irq(&desc->lock);
> +
> spin_unlock(&d->event_lock);
>
> pirq_dpci->gmsi.posted = false;
> @@ -869,7 +874,8 @@ static int _hvm_dpci_msi_eoi(struct domain *d,
>
> void hvm_dpci_msi_eoi(struct domain *d, int vector)
> {
> - if ( !iommu_enabled || !hvm_domain_irq(d)->dpci )
> + if ( !iommu_enabled || !hvm_domain_irq(d)->dpci ||
> + !test_bit(vector, hvm_domain_irq(d)->unmaskable_msi_vecs) )
> return;
>
> spin_lock(&d->event_lock);
> diff --git a/xen/include/asm-x86/hvm/irq.h b/xen/include/asm-
> x86/hvm/irq.h
> index 8a43cb9..d9d4652 100644
> --- a/xen/include/asm-x86/hvm/irq.h
> +++ b/xen/include/asm-x86/hvm/irq.h
> @@ -78,6 +78,7 @@ struct hvm_irq {
> u8 round_robin_prev_vcpu;
>
> struct hvm_irq_dpci *dpci;
> + DECLARE_BITMAP(unmaskable_msi_vecs, 256);
>
> /*
> * Number of wires asserting each GSI.
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [PATCH v2] x86/hvm: re-work viridian APIC assist code [ In reply to ]
Sorry for not replying earlier, wanted to finish something first.

On Tue, Aug 11, 2020 at 02:25:16PM +0100, David Woodhouse wrote:
> Resending this straw man patch at Roger's request, to restart discussion.
>
> Redux: In order to cope with the relatively rare case of unmaskable
> legacy MSIs, each vlapic EOI takes a domain-global spinlock just to
> iterate over all IRQs and determine that there's actually nothing to
> do.
>
> In my testing, I observe that this drops Windows performance on passed-
> through NVMe from 2.2M IOPS down to about 1.0M IOPS.
>
> I have a variant of this patch which just has a single per-domain "I
> attached legacy unmaskable MSIs" flag, which is never cleared. The
> patch below is per-vector (but Roger points out it should be per-vCPU
> per-vector). I don't know that we really care enough to do more than
> the single per-domain flag, which in real life would never happen
> anyway unless you have crappy hardware, at which point you get back to
> today's status quo.

I've just posted a series [0] that should allow setting of EOI
callbacks for lapic vectors, which I think could be relevant for the
use-case here. Note the series itself doesn't change the current
behavior much, as it will still register a callback for each injected
MSI, regardless of whether it's maskable or non-maskable.

I think you could easily make a patch on top of my series that
prevents registering the EOI callback for maskable MSIs, and it would
avoid having to add a per-domain flag or anything like that.

> My main concern is that this code is fairly sparsely documented and I'm
> only 99% sure that this code path really *is* only for unmaskable MSIs,
> and doesn't have some other esoteric use. A second opinion on that
> would be particularly welcome.

That's also my reading, maskable MSIs will have ack_type set to
ACKTYPE_NONE according to irq_acktype, which will then prevent
do_IRQ_guest from setting pirq->masked and thus turn desc_guest_eoi
into a noop.

I assume this is because maskable MSIs can always be masked by Xen if
required, so there's no reason to hold them pending in the lapic. OTOH
Xen has no way to prevent non-maskable MSIs except from keeping the
vector ISR bit set.

[0] https://lore.kernel.org/xen-devel/20200812124709.4165-1-roger.pau@citrix.com/T/#mb300899b0d95d5b6e78ca8caea21482d91b1cea8

Roger.
RE: [PATCH v2] x86/hvm: re-work viridian APIC assist code [ In reply to ]
> -----Original Message-----
> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of David Woodhouse
> Sent: 11 August 2020 14:25
> To: Paul Durrant <paul.durrant@citrix.com>; xen-devel@lists.xenproject.org; Roger Pau Monne
> <roger.pau@citrix.com>
> Cc: Eslam Elnikety <elnikety@amazon.de>; Andrew Cooper <andrew.cooper3@citrix.com>; Shan Haitao
> <haitao.shan@intel.com>; Jan Beulich <jbeulich@suse.com>
> Subject: Re: [Xen-devel] [PATCH v2] x86/hvm: re-work viridian APIC assist code
>
> Resending this straw man patch at Roger's request, to restart discussion.
>
> Redux: In order to cope with the relatively rare case of unmaskable
> legacy MSIs, each vlapic EOI takes a domain-global spinlock just to
> iterate over all IRQs and determine that there's actually nothing to
> do.
>
> In my testing, I observe that this drops Windows performance on passed-
> through NVMe from 2.2M IOPS down to about 1.0M IOPS.
>
> I have a variant of this patch which just has a single per-domain "I
> attached legacy unmaskable MSIs" flag, which is never cleared. The
> patch below is per-vector (but Roger points out it should be per-vCPU
> per-vector). I don't know that we really care enough to do more than
> the single per-domain flag, which in real life would never happen
> anyway unless you have crappy hardware, at which point you get back to
> today's status quo.
>
> My main concern is that this code is fairly sparsely documented and I'm
> only 99% sure that this code path really *is* only for unmaskable MSIs,
> and doesn't have some other esoteric use. A second opinion on that
> would be particularly welcome.
>

The loop appears to be there to handle the case where multiple devices assigned to a domain have MSIs programmed with the same dest/vector... which seems like an odd thing for a guest to do but I guess it is at liberty to do it. Does it matter whether they are maskable or not?

Paul

>
> (NB: APIC assist was a red herring here, except that it had a bug which
> stopped the gratuitous EOI work from ever happening at all — which
> nobody ever cared about since it doesn't matter on sane hardware, but
> then when Paul *fixed* it, we saw it as a performance regression.)
>
>
> On Sat, 2018-08-25 at 00:38 +0100, David Woodhouse wrote:
> > On Thu, 2018-01-18 at 10:10 -0500, Paul Durrant wrote:
> > > Lastly the previous code did not properly emulate an EOI if a missed EOI
> > > was discovered in vlapic_has_pending_irq(); it merely cleared the bit in
> > > the ISR. The new code instead calls vlapic_EOI_set().
> >
> > Hm, this *halves* my observed performance running a 32-thread
> > 'diskspd.exe' on a Windows box with attached NVME devices, which makes
> > me sad.
> >
> > It's the call to hvm_dpci_msi_eoi() that does it.
> >
> > Commenting out the call to pt_pirq_iterate() and leaving *just* the
> > domain-global spinlock bouncing cache lines between all my CPUs, it's
> > already down to 1.6MIOPS/s from 2.2M on my test box before it does
> > *anything* at all.
> >
> > Calling an *inline* version of pt_pirq_iterate so no retpoline for the
> > indirect calls, and I'm down to 1.1M even when I've nopped out the
> > whole of the _hvm_dpci_msi_eoi function that it's calling. Put it all
> > back, and I'm down to about 1.0M. So it's worse than halved.
> >
> > And what's all this for? The code here is making my eyes bleed but I
> > believe it's for unmaskable MSIs, and these aren't unmaskable.
> >
> > Tempted to make it all go away by having a per-domain bitmap of vectors
> > for which all this work is actually required, and bypassing the whole
> > bloody lot in hvm_dpci_msi_eoi() if the corresponding in bit that
> > bitmap isn't set.
> >
> > The hackish version of that (which seems to work, but would probably
> > want testing with an actual unmaskable MSI in the system, and I have
> > absolutely no confidence I understand what's going on here) looks
> > something like this:
> >
> > diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
> > index bab3aa3..24df008 100644
> > --- a/xen/drivers/passthrough/io.c
> > +++ b/xen/drivers/passthrough/io.c
> > @@ -24,6 +24,7 @@
> > #include <asm/hvm/irq.h>
> > #include <asm/hvm/support.h>
> > #include <asm/io_apic.h>
> > +#include <asm/msi.h>
> >
> > static DEFINE_PER_CPU(struct list_head, dpci_list);
> >
> > @@ -282,6 +283,7 @@ int pt_irq_create_bind(
> > struct hvm_pirq_dpci *pirq_dpci;
> > struct pirq *info;
> > int rc, pirq = pt_irq_bind->machine_irq;
> > + irq_desc_t *desc;
> >
> > if ( pirq < 0 || pirq >= d->nr_pirqs )
> > return -EINVAL;
> > @@ -422,6 +425,13 @@ int pt_irq_create_bind(
> >
> > dest_vcpu_id = hvm_girq_dest_2_vcpu_id(d, dest, dest_mode);
> > pirq_dpci->gmsi.dest_vcpu_id = dest_vcpu_id;
> > + BUG_ON(!local_irq_is_enabled());
> > + desc = pirq_spin_lock_irq_desc(info, NULL);
> > + if ( desc && desc->msi_desc && !msi_maskable_irq(desc-
> > >msi_desc) )
> > + set_bit(pirq_dpci->gmsi.gvec,
> > + hvm_domain_irq(d)->unmaskable_msi_vecs);
> > + spin_unlock_irq(&desc->lock);
> > +
> > spin_unlock(&d->event_lock);
> >
> > pirq_dpci->gmsi.posted = false;
> > @@ -869,7 +874,8 @@ static int _hvm_dpci_msi_eoi(struct domain *d,
> >
> > void hvm_dpci_msi_eoi(struct domain *d, int vector)
> > {
> > - if ( !iommu_enabled || !hvm_domain_irq(d)->dpci )
> > + if ( !iommu_enabled || !hvm_domain_irq(d)->dpci ||
> > + !test_bit(vector, hvm_domain_irq(d)->unmaskable_msi_vecs) )
> > return;
> >
> > spin_lock(&d->event_lock);
> > diff --git a/xen/include/asm-x86/hvm/irq.h b/xen/include/asm-
> > x86/hvm/irq.h
> > index 8a43cb9..d9d4652 100644
> > --- a/xen/include/asm-x86/hvm/irq.h
> > +++ b/xen/include/asm-x86/hvm/irq.h
> > @@ -78,6 +78,7 @@ struct hvm_irq {
> > u8 round_robin_prev_vcpu;
> >
> > struct hvm_irq_dpci *dpci;
> > + DECLARE_BITMAP(unmaskable_msi_vecs, 256);
> >
> > /*
> > * Number of wires asserting each GSI.
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xenproject.org
> > https://lists.xenproject.org/mailman/listinfo/xen-devel
Re: [PATCH v2] x86/hvm: re-work viridian APIC assist code [ In reply to ]
On Thu, Aug 13, 2020 at 09:10:31AM +0100, Paul Durrant wrote:
> > -----Original Message-----
> > From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of David Woodhouse
> > Sent: 11 August 2020 14:25
> > To: Paul Durrant <paul.durrant@citrix.com>; xen-devel@lists.xenproject.org; Roger Pau Monne
> > <roger.pau@citrix.com>
> > Cc: Eslam Elnikety <elnikety@amazon.de>; Andrew Cooper <andrew.cooper3@citrix.com>; Shan Haitao
> > <haitao.shan@intel.com>; Jan Beulich <jbeulich@suse.com>
> > Subject: Re: [Xen-devel] [PATCH v2] x86/hvm: re-work viridian APIC assist code
> >
> > Resending this straw man patch at Roger's request, to restart discussion.
> >
> > Redux: In order to cope with the relatively rare case of unmaskable
> > legacy MSIs, each vlapic EOI takes a domain-global spinlock just to
> > iterate over all IRQs and determine that there's actually nothing to
> > do.
> >
> > In my testing, I observe that this drops Windows performance on passed-
> > through NVMe from 2.2M IOPS down to about 1.0M IOPS.
> >
> > I have a variant of this patch which just has a single per-domain "I
> > attached legacy unmaskable MSIs" flag, which is never cleared. The
> > patch below is per-vector (but Roger points out it should be per-vCPU
> > per-vector). I don't know that we really care enough to do more than
> > the single per-domain flag, which in real life would never happen
> > anyway unless you have crappy hardware, at which point you get back to
> > today's status quo.
> >
> > My main concern is that this code is fairly sparsely documented and I'm
> > only 99% sure that this code path really *is* only for unmaskable MSIs,
> > and doesn't have some other esoteric use. A second opinion on that
> > would be particularly welcome.
> >
>
> The loop appears to be there to handle the case where multiple
> devices assigned to a domain have MSIs programmed with the same
> dest/vector... which seems like an odd thing for a guest to do but I
> guess it is at liberty to do it. Does it matter whether they are
> maskable or not?

Such configuration would never work properly, as lapic vectors are
edge triggered and thus can't be safely shared between devices?

I think the iteration is there in order to get the hvm_pirq_dpci
struct that injected that specific vector, so that you can perform the
ack if required. Having lapic EOI callbacks should simply this, as you
can pass a hvm_pirq_dpci when injecting a vector, and that would be
forwarded to the EOI callback, so there should be no need to iterate
over the list of hvm_pirq_dpci for a domain.

Roger.
Re: [PATCH v2] x86/hvm: re-work viridian APIC assist code [ In reply to ]
On Thu, 2020-08-13 at 11:45 +0200, Roger Pau Monné wrote:
> > The loop appears to be there to handle the case where multiple
> > devices assigned to a domain have MSIs programmed with the same
> > dest/vector... which seems like an odd thing for a guest to do but I
> > guess it is at liberty to do it. Does it matter whether they are
> > maskable or not?
>
> Such configuration would never work properly, as lapic vectors are
> edge triggered and thus can't be safely shared between devices?
>
> I think the iteration is there in order to get the hvm_pirq_dpci
> struct that injected that specific vector, so that you can perform the
> ack if required. Having lapic EOI callbacks should simply this, as you
> can pass a hvm_pirq_dpci when injecting a vector, and that would be
> forwarded to the EOI callback, so there should be no need to iterate
> over the list of hvm_pirq_dpci for a domain.

If we didn't have the loop — or more to the point if we didn't grab the
domain-global d->event_lock that protects it — then I wouldn't even
care about optimising the whole thing away for the modern MSI case.

It isn't the act of not doing any work in the _hvm_dpci_msi_eoi()
function that takes the time. It's that domain-global lock, and a
little bit the retpoline-stalled indirect call from pt_pirq_interate().

I suppose with Roger's series, we'll still suffer the retpoline stall
for a callback that ultimately does nothing, but it's nowhere near as
expensive as the lock.
Re: [PATCH v2] x86/hvm: re-work viridian APIC assist code [ In reply to ]
On Fri, Aug 14, 2020 at 03:13:51PM +0100, David Woodhouse wrote:
> On Thu, 2020-08-13 at 11:45 +0200, Roger Pau Monné wrote:
> > > The loop appears to be there to handle the case where multiple
> > > devices assigned to a domain have MSIs programmed with the same
> > > dest/vector... which seems like an odd thing for a guest to do but I
> > > guess it is at liberty to do it. Does it matter whether they are
> > > maskable or not?
> >
> > Such configuration would never work properly, as lapic vectors are
> > edge triggered and thus can't be safely shared between devices?
> >
> > I think the iteration is there in order to get the hvm_pirq_dpci
> > struct that injected that specific vector, so that you can perform the
> > ack if required. Having lapic EOI callbacks should simply this, as you
> > can pass a hvm_pirq_dpci when injecting a vector, and that would be
> > forwarded to the EOI callback, so there should be no need to iterate
> > over the list of hvm_pirq_dpci for a domain.
>
> If we didn't have the loop — or more to the point if we didn't grab the
> domain-global d->event_lock that protects it — then I wouldn't even
> care about optimising the whole thing away for the modern MSI case.
>
> It isn't the act of not doing any work in the _hvm_dpci_msi_eoi()
> function that takes the time. It's that domain-global lock, and a
> little bit the retpoline-stalled indirect call from pt_pirq_interate().
>
> I suppose with Roger's series, we'll still suffer the retpoline stall
> for a callback that ultimately does nothing, but it's nowhere near as
> expensive as the lock.

I think we could ultimately avoid the callback (and the vmexit when
running on Intel with virtual interrupt delivery) by not registering
any callback when injecting a vector that originated from a source
that doesn't require any Ack, the diff below should be applied on top
of my series and I think should remove the execution of a callback
when there's no Ack to perform. Still pretty much a proof of concept
and could do with some further cleanup.

---8<---
diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index e192c4c6da..483c69deb3 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -110,7 +110,7 @@ int vmsi_deliver(struct domain *d, int vector, uint8_t dest, uint8_t dest_mode,
trig_mode, NULL, NULL);
}

-void vmsi_deliver_pirq(struct domain *d, const struct hvm_pirq_dpci *pirq_dpci)
+void vmsi_deliver_pirq(struct domain *d, struct hvm_pirq_dpci *pirq_dpci)
{
uint32_t flags = pirq_dpci->gmsi.gflags;
int vector = pirq_dpci->gmsi.gvec;
@@ -118,6 +118,7 @@ void vmsi_deliver_pirq(struct domain *d, const struct hvm_pirq_dpci *pirq_dpci)
bool dest_mode = flags & XEN_DOMCTL_VMSI_X86_DM_MASK;
uint8_t delivery_mode = MASK_EXTR(flags, XEN_DOMCTL_VMSI_X86_DELIV_MASK);
bool trig_mode = flags & XEN_DOMCTL_VMSI_X86_TRIG_MASK;
+ struct pirq *pirq = dpci_pirq(pirq_dpci);

HVM_DBG_LOG(DBG_LEVEL_IOAPIC,
"msi: dest=%x dest_mode=%x delivery_mode=%x "
@@ -127,7 +128,7 @@ void vmsi_deliver_pirq(struct domain *d, const struct hvm_pirq_dpci *pirq_dpci)
ASSERT(pirq_dpci->flags & HVM_IRQ_DPCI_GUEST_MSI);

vmsi_deliver_callback(d, vector, dest, dest_mode, delivery_mode, trig_mode,
- hvm_dpci_msi_eoi, NULL);
+ pirq->masked ? hvm_dpci_msi_eoi : NULL, pirq_dpci);
}

/* Return value, -1 : multi-dests, non-negative value: dest_vcpu_id */
diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c
index 3793029b29..2a0b7014f2 100644
--- a/xen/drivers/passthrough/io.c
+++ b/xen/drivers/passthrough/io.c
@@ -851,29 +851,6 @@ static void __msi_pirq_eoi(struct hvm_pirq_dpci *pirq_dpci)
}
}

-static int _hvm_dpci_msi_eoi(struct domain *d,
- struct hvm_pirq_dpci *pirq_dpci, void *arg)
-{
- int vector = (long)arg;
-
- if ( (pirq_dpci->flags & HVM_IRQ_DPCI_MACH_MSI) &&
- (pirq_dpci->gmsi.gvec == vector) )
- {
- unsigned int dest = MASK_EXTR(pirq_dpci->gmsi.gflags,
- XEN_DOMCTL_VMSI_X86_DEST_ID_MASK);
- bool dest_mode = pirq_dpci->gmsi.gflags & XEN_DOMCTL_VMSI_X86_DM_MASK;
-
- if ( vlapic_match_dest(vcpu_vlapic(current), NULL, 0, dest,
- dest_mode) )
- {
- __msi_pirq_eoi(pirq_dpci);
- return 1;
- }
- }
-
- return 0;
-}
-
void hvm_dpci_msi_eoi(struct vcpu *v, unsigned int vector, void *data)
{
struct domain *d = v->domain;
@@ -883,7 +860,7 @@ void hvm_dpci_msi_eoi(struct vcpu *v, unsigned int vector, void *data)
return;

spin_lock(&d->event_lock);
- pt_pirq_iterate(d, _hvm_dpci_msi_eoi, (void *)(long)vector);
+ __msi_pirq_eoi(data);
spin_unlock(&d->event_lock);
}

diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h
index be0d8b0a4d..c28fbf96f9 100644
--- a/xen/include/asm-x86/hvm/hvm.h
+++ b/xen/include/asm-x86/hvm/hvm.h
@@ -264,7 +264,7 @@ int vmsi_deliver(
uint8_t dest, uint8_t dest_mode,
uint8_t delivery_mode, uint8_t trig_mode);
struct hvm_pirq_dpci;
-void vmsi_deliver_pirq(struct domain *d, const struct hvm_pirq_dpci *);
+void vmsi_deliver_pirq(struct domain *d, struct hvm_pirq_dpci *);
int hvm_girq_dest_2_vcpu_id(struct domain *d, uint8_t dest, uint8_t dest_mode);

enum hvm_intblk
Re: [PATCH v2] x86/hvm: re-work viridian APIC assist code [ In reply to ]
On 13.08.2020 11:45, Roger Pau Monné wrote:
> On Thu, Aug 13, 2020 at 09:10:31AM +0100, Paul Durrant wrote:
>>> -----Original Message-----
>>> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of David Woodhouse
>>> Sent: 11 August 2020 14:25
>>> To: Paul Durrant <paul.durrant@citrix.com>; xen-devel@lists.xenproject.org; Roger Pau Monne
>>> <roger.pau@citrix.com>
>>> Cc: Eslam Elnikety <elnikety@amazon.de>; Andrew Cooper <andrew.cooper3@citrix.com>; Shan Haitao
>>> <haitao.shan@intel.com>; Jan Beulich <jbeulich@suse.com>
>>> Subject: Re: [Xen-devel] [PATCH v2] x86/hvm: re-work viridian APIC assist code
>>>
>>> Resending this straw man patch at Roger's request, to restart discussion.
>>>
>>> Redux: In order to cope with the relatively rare case of unmaskable
>>> legacy MSIs, each vlapic EOI takes a domain-global spinlock just to
>>> iterate over all IRQs and determine that there's actually nothing to
>>> do.
>>>
>>> In my testing, I observe that this drops Windows performance on passed-
>>> through NVMe from 2.2M IOPS down to about 1.0M IOPS.
>>>
>>> I have a variant of this patch which just has a single per-domain "I
>>> attached legacy unmaskable MSIs" flag, which is never cleared. The
>>> patch below is per-vector (but Roger points out it should be per-vCPU
>>> per-vector). I don't know that we really care enough to do more than
>>> the single per-domain flag, which in real life would never happen
>>> anyway unless you have crappy hardware, at which point you get back to
>>> today's status quo.
>>>
>>> My main concern is that this code is fairly sparsely documented and I'm
>>> only 99% sure that this code path really *is* only for unmaskable MSIs,
>>> and doesn't have some other esoteric use. A second opinion on that
>>> would be particularly welcome.
>>>
>>
>> The loop appears to be there to handle the case where multiple
>> devices assigned to a domain have MSIs programmed with the same
>> dest/vector... which seems like an odd thing for a guest to do but I
>> guess it is at liberty to do it. Does it matter whether they are
>> maskable or not?
>
> Such configuration would never work properly, as lapic vectors are
> edge triggered and thus can't be safely shared between devices?

Wait - there are two aspects here: Vectors are difficult to be shared
on the same CPU (but it's not impossible if the devices and their
drivers meet certain conditions). But the bitmap gets installed as a
per-domain rather than a per-vcpu one, and using the same vector on
different CPUs is definitely possible, as demonstrated by both Xen
itself as well as Linux.

Jan
Re: [PATCH v2] x86/hvm: re-work viridian APIC assist code [ In reply to ]
On Wed, Aug 19, 2020 at 09:12:02AM +0200, Jan Beulich wrote:
> On 13.08.2020 11:45, Roger Pau Monné wrote:
> > On Thu, Aug 13, 2020 at 09:10:31AM +0100, Paul Durrant wrote:
> >>> -----Original Message-----
> >>> From: Xen-devel <xen-devel-bounces@lists.xenproject.org> On Behalf Of David Woodhouse
> >>> Sent: 11 August 2020 14:25
> >>> To: Paul Durrant <paul.durrant@citrix.com>; xen-devel@lists.xenproject.org; Roger Pau Monne
> >>> <roger.pau@citrix.com>
> >>> Cc: Eslam Elnikety <elnikety@amazon.de>; Andrew Cooper <andrew.cooper3@citrix.com>; Shan Haitao
> >>> <haitao.shan@intel.com>; Jan Beulich <jbeulich@suse.com>
> >>> Subject: Re: [Xen-devel] [PATCH v2] x86/hvm: re-work viridian APIC assist code
> >>>
> >>> Resending this straw man patch at Roger's request, to restart discussion.
> >>>
> >>> Redux: In order to cope with the relatively rare case of unmaskable
> >>> legacy MSIs, each vlapic EOI takes a domain-global spinlock just to
> >>> iterate over all IRQs and determine that there's actually nothing to
> >>> do.
> >>>
> >>> In my testing, I observe that this drops Windows performance on passed-
> >>> through NVMe from 2.2M IOPS down to about 1.0M IOPS.
> >>>
> >>> I have a variant of this patch which just has a single per-domain "I
> >>> attached legacy unmaskable MSIs" flag, which is never cleared. The
> >>> patch below is per-vector (but Roger points out it should be per-vCPU
> >>> per-vector). I don't know that we really care enough to do more than
> >>> the single per-domain flag, which in real life would never happen
> >>> anyway unless you have crappy hardware, at which point you get back to
> >>> today's status quo.
> >>>
> >>> My main concern is that this code is fairly sparsely documented and I'm
> >>> only 99% sure that this code path really *is* only for unmaskable MSIs,
> >>> and doesn't have some other esoteric use. A second opinion on that
> >>> would be particularly welcome.
> >>>
> >>
> >> The loop appears to be there to handle the case where multiple
> >> devices assigned to a domain have MSIs programmed with the same
> >> dest/vector... which seems like an odd thing for a guest to do but I
> >> guess it is at liberty to do it. Does it matter whether they are
> >> maskable or not?
> >
> > Such configuration would never work properly, as lapic vectors are
> > edge triggered and thus can't be safely shared between devices?
>
> Wait - there are two aspects here: Vectors are difficult to be shared
> on the same CPU (but it's not impossible if the devices and their
> drivers meet certain conditions). But the bitmap gets installed as a
> per-domain rather than a per-vcpu one, and using the same vector on
> different CPUs is definitely possible, as demonstrated by both Xen
> itself as well as Linux.

Yes, that's why I've requested the bitmap to be per-vcpu, but given
the work I'm doing related to interrupt EOI callbacks maybe this won't
be needed.

Thanks, Roger.