Mailing List Archive

[PATCH] x86/pv: Flush TLB in response to paging structure changes
With MMU_UPDATE, a PV guest can make changes to higher level pagetables. This
is from Xen's point of view (as the update only affects guest mappings), and
the guest is required to flush suitably after making updates.

However, Xen's use of linear pagetables (UPDATE_VA_MAPPING, GNTTABOP_map,
writeable pagetables, etc.) is an implementation detail outside of the
API/ABI.

Changes in the paging structure require invalidations in the linear pagetable
range for subsequent accesses into the linear pagetables to access non-stale
mappings. Xen must provide suitable flushing to prevent intermixed guest
actions from accidentally accessing/modifying the wrong pagetable.

For all L2 and higher modifications, flush the full TLB. (This could in
principle be an order 39 flush starting at LINEAR_PT_VIRT_START, but no such
mechanism exists in practice.)

As this combines with sync_guest for XPTI L4 "shadowing", replace the
sync_guest boolean with flush_flags and accumulate flags. The sync_guest case
now always needs to flush, there is no point trying to exclude the current CPU
from the flush mask. Use pt_owner->dirty_cpumask directly.

This is XSA-286.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>

A couple of minor points.

* PV guests can create global mappings. I can't reason any safe way to relax
FLUSH_TLB_GLOBAL to just FLUSH_TLB.

* Performance tests are still ongoing, but so far is fairing better than the
embargoed alternative.
---
xen/arch/x86/mm.c | 31 +++++++++++++++----------------
1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 918ee2bbe3..a6a7fcb56c 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -3883,11 +3883,10 @@ long do_mmu_update(
void *va = NULL;
unsigned long gpfn, gmfn;
struct page_info *page;
- unsigned int cmd, i = 0, done = 0, pt_dom;
+ unsigned int cmd, i = 0, done = 0, pt_dom, flush_flags = 0;
struct vcpu *curr = current, *v = curr;
struct domain *d = v->domain, *pt_owner = d, *pg_owner;
mfn_t map_mfn = INVALID_MFN, mfn;
- bool sync_guest = false;
uint32_t xsm_needed = 0;
uint32_t xsm_checked = 0;
int rc = put_old_guest_table(curr);
@@ -4037,6 +4036,8 @@ long do_mmu_update(
break;
rc = mod_l2_entry(va, l2e_from_intpte(req.val), mfn,
cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
+ if ( !rc )
+ flush_flags |= FLUSH_TLB_GLOBAL;
break;

case PGT_l3_page_table:
@@ -4044,6 +4045,8 @@ long do_mmu_update(
break;
rc = mod_l3_entry(va, l3e_from_intpte(req.val), mfn,
cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
+ if ( !rc )
+ flush_flags |= FLUSH_TLB_GLOBAL;
break;

case PGT_l4_page_table:
@@ -4051,6 +4054,8 @@ long do_mmu_update(
break;
rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
+ if ( !rc )
+ flush_flags |= FLUSH_TLB_GLOBAL;
if ( !rc && pt_owner->arch.pv.xpti )
{
bool local_in_use = false;
@@ -4071,7 +4076,7 @@ long do_mmu_update(
(1 + !!(page->u.inuse.type_info & PGT_pinned) +
mfn_eq(pagetable_get_mfn(curr->arch.guest_table_user),
mfn) + local_in_use) )
- sync_guest = true;
+ flush_flags |= FLUSH_ROOT_PGTBL;
}
break;

@@ -4173,19 +4178,13 @@ long do_mmu_update(
if ( va )
unmap_domain_page(va);

- if ( sync_guest )
- {
- /*
- * Force other vCPU-s of the affected guest to pick up L4 entry
- * changes (if any).
- */
- unsigned int cpu = smp_processor_id();
- cpumask_t *mask = per_cpu(scratch_cpumask, cpu);
-
- cpumask_andnot(mask, pt_owner->dirty_cpumask, cpumask_of(cpu));
- if ( !cpumask_empty(mask) )
- flush_mask(mask, FLUSH_TLB_GLOBAL | FLUSH_ROOT_PGTBL);
- }
+ /*
+ * Flush TLBs if an L2 or higher was changed (invalidates the structure of
+ * the linear pagetables), or an L4 in use by other CPUs was made (needs
+ * to resync the XPTI copy of the table).
+ */
+ if ( flush_flags )
+ flush_mask(pt_owner->dirty_cpumask, flush_flags);

perfc_add(num_page_updates, i);

--
2.11.0
Re: [PATCH] x86/pv: Flush TLB in response to paging structure changes [ In reply to ]
On 20/10/2020 16:24, Andrew Cooper wrote:
> With MMU_UPDATE, a PV guest can make changes to higher level pagetables. This
> is from Xen's point of view (as the update only affects guest mappings), and
> the guest is required to flush suitably after making updates.
>
> However, Xen's use of linear pagetables (UPDATE_VA_MAPPING, GNTTABOP_map,
> writeable pagetables, etc.) is an implementation detail outside of the
> API/ABI.
>
> Changes in the paging structure require invalidations in the linear pagetable
> range for subsequent accesses into the linear pagetables to access non-stale
> mappings. Xen must provide suitable flushing to prevent intermixed guest
> actions from accidentally accessing/modifying the wrong pagetable.
>
> For all L2 and higher modifications, flush the full TLB. (This could in
> principle be an order 39 flush starting at LINEAR_PT_VIRT_START, but no such
> mechanism exists in practice.)
>
> As this combines with sync_guest for XPTI L4 "shadowing", replace the
> sync_guest boolean with flush_flags and accumulate flags. The sync_guest case
> now always needs to flush, there is no point trying to exclude the current CPU
> from the flush mask. Use pt_owner->dirty_cpumask directly.
>
> This is XSA-286.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> CC: Wei Liu <wl@xen.org>
>
> A couple of minor points.
>
> * PV guests can create global mappings. I can't reason any safe way to relax
> FLUSH_TLB_GLOBAL to just FLUSH_TLB.

Sorry - forgot one of the points here.

We could in principle relax the flush entirely if we know that we're
editing from a not-present to present entry, but plumbing this up
through mod_l?_entry() isn't trivial, and its also not not obvious how
much of an optimisation it would be in practice.

~Andrew

> * Performance tests are still ongoing, but so far is fairing better than the
> embargoed alternative.
> ---
> xen/arch/x86/mm.c | 31 +++++++++++++++----------------
> 1 file changed, 15 insertions(+), 16 deletions(-)
>
> diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
> index 918ee2bbe3..a6a7fcb56c 100644
> --- a/xen/arch/x86/mm.c
> +++ b/xen/arch/x86/mm.c
> @@ -3883,11 +3883,10 @@ long do_mmu_update(
> void *va = NULL;
> unsigned long gpfn, gmfn;
> struct page_info *page;
> - unsigned int cmd, i = 0, done = 0, pt_dom;
> + unsigned int cmd, i = 0, done = 0, pt_dom, flush_flags = 0;
> struct vcpu *curr = current, *v = curr;
> struct domain *d = v->domain, *pt_owner = d, *pg_owner;
> mfn_t map_mfn = INVALID_MFN, mfn;
> - bool sync_guest = false;
> uint32_t xsm_needed = 0;
> uint32_t xsm_checked = 0;
> int rc = put_old_guest_table(curr);
> @@ -4037,6 +4036,8 @@ long do_mmu_update(
> break;
> rc = mod_l2_entry(va, l2e_from_intpte(req.val), mfn,
> cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
> + if ( !rc )
> + flush_flags |= FLUSH_TLB_GLOBAL;
> break;
>
> case PGT_l3_page_table:
> @@ -4044,6 +4045,8 @@ long do_mmu_update(
> break;
> rc = mod_l3_entry(va, l3e_from_intpte(req.val), mfn,
> cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
> + if ( !rc )
> + flush_flags |= FLUSH_TLB_GLOBAL;
> break;
>
> case PGT_l4_page_table:
> @@ -4051,6 +4054,8 @@ long do_mmu_update(
> break;
> rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
> cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
> + if ( !rc )
> + flush_flags |= FLUSH_TLB_GLOBAL;
> if ( !rc && pt_owner->arch.pv.xpti )
> {
> bool local_in_use = false;
> @@ -4071,7 +4076,7 @@ long do_mmu_update(
> (1 + !!(page->u.inuse.type_info & PGT_pinned) +
> mfn_eq(pagetable_get_mfn(curr->arch.guest_table_user),
> mfn) + local_in_use) )
> - sync_guest = true;
> + flush_flags |= FLUSH_ROOT_PGTBL;
> }
> break;
>
> @@ -4173,19 +4178,13 @@ long do_mmu_update(
> if ( va )
> unmap_domain_page(va);
>
> - if ( sync_guest )
> - {
> - /*
> - * Force other vCPU-s of the affected guest to pick up L4 entry
> - * changes (if any).
> - */
> - unsigned int cpu = smp_processor_id();
> - cpumask_t *mask = per_cpu(scratch_cpumask, cpu);
> -
> - cpumask_andnot(mask, pt_owner->dirty_cpumask, cpumask_of(cpu));
> - if ( !cpumask_empty(mask) )
> - flush_mask(mask, FLUSH_TLB_GLOBAL | FLUSH_ROOT_PGTBL);
> - }
> + /*
> + * Flush TLBs if an L2 or higher was changed (invalidates the structure of
> + * the linear pagetables), or an L4 in use by other CPUs was made (needs
> + * to resync the XPTI copy of the table).
> + */
> + if ( flush_flags )
> + flush_mask(pt_owner->dirty_cpumask, flush_flags);
>
> perfc_add(num_page_updates, i);
>
Re: [PATCH] x86/pv: Flush TLB in response to paging structure changes [ In reply to ]
On 20.10.2020 17:24, Andrew Cooper wrote:
> With MMU_UPDATE, a PV guest can make changes to higher level pagetables. This
> is from Xen's point of view (as the update only affects guest mappings), and
> the guest is required to flush suitably after making updates.
>
> However, Xen's use of linear pagetables (UPDATE_VA_MAPPING, GNTTABOP_map,
> writeable pagetables, etc.) is an implementation detail outside of the
> API/ABI.
>
> Changes in the paging structure require invalidations in the linear pagetable
> range for subsequent accesses into the linear pagetables to access non-stale
> mappings. Xen must provide suitable flushing to prevent intermixed guest
> actions from accidentally accessing/modifying the wrong pagetable.
>
> For all L2 and higher modifications, flush the full TLB. (This could in
> principle be an order 39 flush starting at LINEAR_PT_VIRT_START, but no such
> mechanism exists in practice.)
>
> As this combines with sync_guest for XPTI L4 "shadowing", replace the
> sync_guest boolean with flush_flags and accumulate flags. The sync_guest case
> now always needs to flush, there is no point trying to exclude the current CPU
> from the flush mask. Use pt_owner->dirty_cpumask directly.

Why is there no point? There's no need for the FLUSH_ROOT_PGTBL
part of the flushing on the local CPU. The draft you had sent
earlier looked better in this regard.

Jan
Re: [PATCH] x86/pv: Flush TLB in response to paging structure changes [ In reply to ]
On 20/10/2020 16:48, Jan Beulich wrote:
> On 20.10.2020 17:24, Andrew Cooper wrote:
>> With MMU_UPDATE, a PV guest can make changes to higher level pagetables. This
>> is from Xen's point of view (as the update only affects guest mappings), and
>> the guest is required to flush suitably after making updates.
>>
>> However, Xen's use of linear pagetables (UPDATE_VA_MAPPING, GNTTABOP_map,
>> writeable pagetables, etc.) is an implementation detail outside of the
>> API/ABI.
>>
>> Changes in the paging structure require invalidations in the linear pagetable
>> range for subsequent accesses into the linear pagetables to access non-stale
>> mappings. Xen must provide suitable flushing to prevent intermixed guest
>> actions from accidentally accessing/modifying the wrong pagetable.
>>
>> For all L2 and higher modifications, flush the full TLB. (This could in
>> principle be an order 39 flush starting at LINEAR_PT_VIRT_START, but no such
>> mechanism exists in practice.)
>>
>> As this combines with sync_guest for XPTI L4 "shadowing", replace the
>> sync_guest boolean with flush_flags and accumulate flags. The sync_guest case
>> now always needs to flush, there is no point trying to exclude the current CPU
>> from the flush mask. Use pt_owner->dirty_cpumask directly.
> Why is there no point? There's no need for the FLUSH_ROOT_PGTBL
> part of the flushing on the local CPU. The draft you had sent
> earlier looked better in this regard.

This was the area which broke.  It is to do with subtle difference in
the scope of L4 updates.

ROOT_PGTBL needs to resync current (if in use), and be broadcasted if
other references to the pages are found.

The TLB flush needs to be broadcast to the whole domain dirty mask, as
we can't (easily) know if the update was part of the current structure.

~Andrew
Re: [PATCH] x86/pv: Flush TLB in response to paging structure changes [ In reply to ]
On 20/10/2020 17:20, Andrew Cooper wrote:
> On 20/10/2020 16:48, Jan Beulich wrote:
>> On 20.10.2020 17:24, Andrew Cooper wrote:
>>> With MMU_UPDATE, a PV guest can make changes to higher level pagetables. This
>>> is from Xen's point of view (as the update only affects guest mappings), and
>>> the guest is required to flush suitably after making updates.
>>>
>>> However, Xen's use of linear pagetables (UPDATE_VA_MAPPING, GNTTABOP_map,
>>> writeable pagetables, etc.) is an implementation detail outside of the
>>> API/ABI.
>>>
>>> Changes in the paging structure require invalidations in the linear pagetable
>>> range for subsequent accesses into the linear pagetables to access non-stale
>>> mappings. Xen must provide suitable flushing to prevent intermixed guest
>>> actions from accidentally accessing/modifying the wrong pagetable.
>>>
>>> For all L2 and higher modifications, flush the full TLB. (This could in
>>> principle be an order 39 flush starting at LINEAR_PT_VIRT_START, but no such
>>> mechanism exists in practice.)
>>>
>>> As this combines with sync_guest for XPTI L4 "shadowing", replace the
>>> sync_guest boolean with flush_flags and accumulate flags. The sync_guest case
>>> now always needs to flush, there is no point trying to exclude the current CPU
>>> from the flush mask. Use pt_owner->dirty_cpumask directly.
>> Why is there no point? There's no need for the FLUSH_ROOT_PGTBL
>> part of the flushing on the local CPU. The draft you had sent
>> earlier looked better in this regard.
> This was the area which broke.  It is to do with subtle difference in
> the scope of L4 updates.
>
> ROOT_PGTBL needs to resync current (if in use), and be broadcasted if
> other references to the pages are found.
>
> The TLB flush needs to be broadcast to the whole domain dirty mask, as
> we can't (easily) know if the update was part of the current structure.

Actually - we can know whether an L4 update needs flushing locally or
not, in exactly the same way as the sync logic currently works.

However, unlike the opencoded get_cpu_info()->root_pgt_changed = true,
we can't just flush locally for free.

This is quite awkward to express.

~Andrew
Re: [PATCH] x86/pv: Flush TLB in response to paging structure changes [ In reply to ]
On 20/10/2020 18:10, Andrew Cooper wrote:
> On 20/10/2020 17:20, Andrew Cooper wrote:
>> On 20/10/2020 16:48, Jan Beulich wrote:
>>> On 20.10.2020 17:24, Andrew Cooper wrote:
>>>> With MMU_UPDATE, a PV guest can make changes to higher level pagetables. This
>>>> is from Xen's point of view (as the update only affects guest mappings), and
>>>> the guest is required to flush suitably after making updates.
>>>>
>>>> However, Xen's use of linear pagetables (UPDATE_VA_MAPPING, GNTTABOP_map,
>>>> writeable pagetables, etc.) is an implementation detail outside of the
>>>> API/ABI.
>>>>
>>>> Changes in the paging structure require invalidations in the linear pagetable
>>>> range for subsequent accesses into the linear pagetables to access non-stale
>>>> mappings. Xen must provide suitable flushing to prevent intermixed guest
>>>> actions from accidentally accessing/modifying the wrong pagetable.
>>>>
>>>> For all L2 and higher modifications, flush the full TLB. (This could in
>>>> principle be an order 39 flush starting at LINEAR_PT_VIRT_START, but no such
>>>> mechanism exists in practice.)
>>>>
>>>> As this combines with sync_guest for XPTI L4 "shadowing", replace the
>>>> sync_guest boolean with flush_flags and accumulate flags. The sync_guest case
>>>> now always needs to flush, there is no point trying to exclude the current CPU
>>>> from the flush mask. Use pt_owner->dirty_cpumask directly.
>>> Why is there no point? There's no need for the FLUSH_ROOT_PGTBL
>>> part of the flushing on the local CPU. The draft you had sent
>>> earlier looked better in this regard.
>> This was the area which broke.  It is to do with subtle difference in
>> the scope of L4 updates.
>>
>> ROOT_PGTBL needs to resync current (if in use), and be broadcasted if
>> other references to the pages are found.
>>
>> The TLB flush needs to be broadcast to the whole domain dirty mask, as
>> we can't (easily) know if the update was part of the current structure.
> Actually - we can know whether an L4 update needs flushing locally or
> not, in exactly the same way as the sync logic currently works.
>
> However, unlike the opencoded get_cpu_info()->root_pgt_changed = true,
> we can't just flush locally for free.
>
> This is quite awkward to express.

And not safe.  Flushes may accumulate from multiple levels in a batch,
and pt_owner may not be equal to current.

I stand by the version submitted as the security fix.

~Andrew
Re: [PATCH] x86/pv: Flush TLB in response to paging structure changes [ In reply to ]
On 20.10.2020 17:24, Andrew Cooper wrote:
> A couple of minor points.
>
> * PV guests can create global mappings. I can't reason any safe way to relax
> FLUSH_TLB_GLOBAL to just FLUSH_TLB.

We only care about guest view here, and from guest view we only
care about non-leaf entries. Non-leaf entries can't be global,
and luckily (for now at least) the G bit also doesn't get
different meaning assigned in non-leaf entries.

Jan
Re: [PATCH] x86/pv: Flush TLB in response to paging structure changes [ In reply to ]
On 20.10.2020 20:46, Andrew Cooper wrote:
> On 20/10/2020 18:10, Andrew Cooper wrote:
>> On 20/10/2020 17:20, Andrew Cooper wrote:
>>> On 20/10/2020 16:48, Jan Beulich wrote:
>>>> On 20.10.2020 17:24, Andrew Cooper wrote:
>>>>> With MMU_UPDATE, a PV guest can make changes to higher level pagetables. This
>>>>> is from Xen's point of view (as the update only affects guest mappings), and
>>>>> the guest is required to flush suitably after making updates.
>>>>>
>>>>> However, Xen's use of linear pagetables (UPDATE_VA_MAPPING, GNTTABOP_map,
>>>>> writeable pagetables, etc.) is an implementation detail outside of the
>>>>> API/ABI.
>>>>>
>>>>> Changes in the paging structure require invalidations in the linear pagetable
>>>>> range for subsequent accesses into the linear pagetables to access non-stale
>>>>> mappings. Xen must provide suitable flushing to prevent intermixed guest
>>>>> actions from accidentally accessing/modifying the wrong pagetable.
>>>>>
>>>>> For all L2 and higher modifications, flush the full TLB. (This could in
>>>>> principle be an order 39 flush starting at LINEAR_PT_VIRT_START, but no such
>>>>> mechanism exists in practice.)
>>>>>
>>>>> As this combines with sync_guest for XPTI L4 "shadowing", replace the
>>>>> sync_guest boolean with flush_flags and accumulate flags. The sync_guest case
>>>>> now always needs to flush, there is no point trying to exclude the current CPU
>>>>> from the flush mask. Use pt_owner->dirty_cpumask directly.
>>>> Why is there no point? There's no need for the FLUSH_ROOT_PGTBL
>>>> part of the flushing on the local CPU. The draft you had sent
>>>> earlier looked better in this regard.
>>> This was the area which broke.  It is to do with subtle difference in
>>> the scope of L4 updates.
>>>
>>> ROOT_PGTBL needs to resync current (if in use), and be broadcasted if
>>> other references to the pages are found.
>>>
>>> The TLB flush needs to be broadcast to the whole domain dirty mask, as
>>> we can't (easily) know if the update was part of the current structure.
>> Actually - we can know whether an L4 update needs flushing locally or
>> not, in exactly the same way as the sync logic currently works.
>>
>> However, unlike the opencoded get_cpu_info()->root_pgt_changed = true,
>> we can't just flush locally for free.
>>
>> This is quite awkward to express.
>
> And not safe.  Flushes may accumulate from multiple levels in a batch,
> and pt_owner may not be equal to current.

I'm not questioning the TLB flush - this needs to be the scope you
use (but just FLUSH_TLB as per my earlier reply). I'm questioning
the extra ROOT_PGTBL sync (meaning changes to levels other than L4
don't matter), which is redundant with the explicit setting right
after the call to mod_l4_entry(). But I guess since now you need
to issue _some_ flush_mask() for the local CPU anyway, perhaps
it's rather the explicit setting of ->root_pgt_changed which wants
dropping?

(If pt_owner != current->domain, then pt_owner->dirty_cpumask
can't have smp_processor_id()'s bit set, and hence there was no
reduction in scope in this case anyway. Similarly in this case
local_in_use is necessarily false, as page tables can't be
shared between domains.)

Taking both adjustments together
- all L[234] changes require FLUSH_TLB on dirty CPUs of
pt_owner including the local CPU
- the converted sync_guest continues to require
FLUSH_TLB_GLOBAL | FLUSH_ROOT_PGTBL on remote dirty CPUs of
pt_owner
This difference, I think, still warrants treating the local CPU
specially, as the global flush continues to be unnecessary there.
Whether the local CPU's ->root_pgt_changed gets set via
flush_local() or explicitly is then a pretty benign secondary
aspect.

Jan
Re: [PATCH] x86/pv: Flush TLB in response to paging structure changes [ In reply to ]
On 21/10/2020 07:55, Jan Beulich wrote:
> On 20.10.2020 20:46, Andrew Cooper wrote:
>> On 20/10/2020 18:10, Andrew Cooper wrote:
>>> On 20/10/2020 17:20, Andrew Cooper wrote:
>>>> On 20/10/2020 16:48, Jan Beulich wrote:
>>>>> On 20.10.2020 17:24, Andrew Cooper wrote:
>>>>>> With MMU_UPDATE, a PV guest can make changes to higher level pagetables. This
>>>>>> is from Xen's point of view (as the update only affects guest mappings), and
>>>>>> the guest is required to flush suitably after making updates.
>>>>>>
>>>>>> However, Xen's use of linear pagetables (UPDATE_VA_MAPPING, GNTTABOP_map,
>>>>>> writeable pagetables, etc.) is an implementation detail outside of the
>>>>>> API/ABI.
>>>>>>
>>>>>> Changes in the paging structure require invalidations in the linear pagetable
>>>>>> range for subsequent accesses into the linear pagetables to access non-stale
>>>>>> mappings. Xen must provide suitable flushing to prevent intermixed guest
>>>>>> actions from accidentally accessing/modifying the wrong pagetable.
>>>>>>
>>>>>> For all L2 and higher modifications, flush the full TLB. (This could in
>>>>>> principle be an order 39 flush starting at LINEAR_PT_VIRT_START, but no such
>>>>>> mechanism exists in practice.)
>>>>>>
>>>>>> As this combines with sync_guest for XPTI L4 "shadowing", replace the
>>>>>> sync_guest boolean with flush_flags and accumulate flags. The sync_guest case
>>>>>> now always needs to flush, there is no point trying to exclude the current CPU
>>>>>> from the flush mask. Use pt_owner->dirty_cpumask directly.
>>>>> Why is there no point? There's no need for the FLUSH_ROOT_PGTBL
>>>>> part of the flushing on the local CPU. The draft you had sent
>>>>> earlier looked better in this regard.
>>>> This was the area which broke.  It is to do with subtle difference in
>>>> the scope of L4 updates.
>>>>
>>>> ROOT_PGTBL needs to resync current (if in use), and be broadcasted if
>>>> other references to the pages are found.
>>>>
>>>> The TLB flush needs to be broadcast to the whole domain dirty mask, as
>>>> we can't (easily) know if the update was part of the current structure.
>>> Actually - we can know whether an L4 update needs flushing locally or
>>> not, in exactly the same way as the sync logic currently works.
>>>
>>> However, unlike the opencoded get_cpu_info()->root_pgt_changed = true,
>>> we can't just flush locally for free.
>>>
>>> This is quite awkward to express.
>> And not safe.  Flushes may accumulate from multiple levels in a batch,
>> and pt_owner may not be equal to current.
> I'm not questioning the TLB flush - this needs to be the scope you
> use (but just FLUSH_TLB as per my earlier reply). I'm questioning
> the extra ROOT_PGTBL sync (meaning changes to levels other than L4
> don't matter), which is redundant with the explicit setting right
> after the call to mod_l4_entry(). But I guess since now you need
> to issue _some_ flush_mask() for the local CPU anyway, perhaps
> it's rather the explicit setting of ->root_pgt_changed which wants
> dropping?

No.  That was the delta which delayed posting in the first place.  Dom0
crashes very early without it.

The problem, as I said, is the asymmetry.

As dom0 is booting, the "only one use of this L4" logic kicks in, and
skips setting ROOT_PGTBL, which then causes the flush_mask() not to
flush on the local CPU either.

~Andrew
Re: [PATCH] x86/pv: Flush TLB in response to paging structure changes [ In reply to ]
On 21.10.2020 12:01, Andrew Cooper wrote:
> On 21/10/2020 07:55, Jan Beulich wrote:
>> On 20.10.2020 20:46, Andrew Cooper wrote:
>>> On 20/10/2020 18:10, Andrew Cooper wrote:
>>>> On 20/10/2020 17:20, Andrew Cooper wrote:
>>>>> On 20/10/2020 16:48, Jan Beulich wrote:
>>>>>> On 20.10.2020 17:24, Andrew Cooper wrote:
>>>>>>> With MMU_UPDATE, a PV guest can make changes to higher level pagetables. This
>>>>>>> is from Xen's point of view (as the update only affects guest mappings), and
>>>>>>> the guest is required to flush suitably after making updates.
>>>>>>>
>>>>>>> However, Xen's use of linear pagetables (UPDATE_VA_MAPPING, GNTTABOP_map,
>>>>>>> writeable pagetables, etc.) is an implementation detail outside of the
>>>>>>> API/ABI.
>>>>>>>
>>>>>>> Changes in the paging structure require invalidations in the linear pagetable
>>>>>>> range for subsequent accesses into the linear pagetables to access non-stale
>>>>>>> mappings. Xen must provide suitable flushing to prevent intermixed guest
>>>>>>> actions from accidentally accessing/modifying the wrong pagetable.
>>>>>>>
>>>>>>> For all L2 and higher modifications, flush the full TLB. (This could in
>>>>>>> principle be an order 39 flush starting at LINEAR_PT_VIRT_START, but no such
>>>>>>> mechanism exists in practice.)
>>>>>>>
>>>>>>> As this combines with sync_guest for XPTI L4 "shadowing", replace the
>>>>>>> sync_guest boolean with flush_flags and accumulate flags. The sync_guest case
>>>>>>> now always needs to flush, there is no point trying to exclude the current CPU
>>>>>>> from the flush mask. Use pt_owner->dirty_cpumask directly.
>>>>>> Why is there no point? There's no need for the FLUSH_ROOT_PGTBL
>>>>>> part of the flushing on the local CPU. The draft you had sent
>>>>>> earlier looked better in this regard.
>>>>> This was the area which broke.  It is to do with subtle difference in
>>>>> the scope of L4 updates.
>>>>>
>>>>> ROOT_PGTBL needs to resync current (if in use), and be broadcasted if
>>>>> other references to the pages are found.
>>>>>
>>>>> The TLB flush needs to be broadcast to the whole domain dirty mask, as
>>>>> we can't (easily) know if the update was part of the current structure.
>>>> Actually - we can know whether an L4 update needs flushing locally or
>>>> not, in exactly the same way as the sync logic currently works.
>>>>
>>>> However, unlike the opencoded get_cpu_info()->root_pgt_changed = true,
>>>> we can't just flush locally for free.
>>>>
>>>> This is quite awkward to express.
>>> And not safe.  Flushes may accumulate from multiple levels in a batch,
>>> and pt_owner may not be equal to current.
>> I'm not questioning the TLB flush - this needs to be the scope you
>> use (but just FLUSH_TLB as per my earlier reply). I'm questioning
>> the extra ROOT_PGTBL sync (meaning changes to levels other than L4
>> don't matter), which is redundant with the explicit setting right
>> after the call to mod_l4_entry(). But I guess since now you need
>> to issue _some_ flush_mask() for the local CPU anyway, perhaps
>> it's rather the explicit setting of ->root_pgt_changed which wants
>> dropping?
>
> No.  That was the delta which delayed posting in the first place.  Dom0
> crashes very early without it.
>
> The problem, as I said, is the asymmetry.
>
> As dom0 is booting, the "only one use of this L4" logic kicks in, and
> skips setting ROOT_PGTBL, which then causes the flush_mask() not to
> flush on the local CPU either.

Ah, I think I finally got it. This asymmetry wants expressing then
in two different sets of flush flags (not sure whether two different
variables are needed), since - as per other replies - the local CPU
has different requirements anyway.

- all CPUs need FLUSH_TLB
- the local CPU may additionally need FLUSH_ROOT_PGTBL
- other CPUs may additionally (or instead, if you like) need
FLUSH_ROOT_PGTBL | FLUSH_TLB_GLOBAL

But then of course the local CPU can as well have its
->root_pgt_changed updated directly - there's no difference whether
it gets done this way or by passing FLUSH_ROOT_PGTBL to flush_local().

Jan
[PATCH] x86/pv: Flush TLB in response to paging structure changes [ In reply to ]
With MMU_UPDATE, a PV guest can make changes to higher level pagetables. This
is from Xen's point of view (as the update only affects guest mappings), and
the guest is required to flush suitably after making updates.

However, Xen's use of linear pagetables (UPDATE_VA_MAPPING, GNTTABOP_map,
writeable pagetables, etc.) is an implementation detail outside of the
API/ABI.

Changes in the paging structure require invalidations in the linear pagetable
range for subsequent accesses into the linear pagetables to access non-stale
mappings. Xen must provide suitable flushing to prevent intermixed guest
actions from accidentally accessing/modifying the wrong pagetable.

For all L2 and higher modifications, flush the TLB. PV guests cannot create
L2 or higher entries with the Global bit set, so no mappings established in
the linear range can be global. (This could in principle be an order 39 flush
starting at LINEAR_PT_VIRT_START, but no such mechanism exists in practice.)

This combines with sync_guest for XPTI L4 "shadowing", but has some asymmetry
between local and remote flush requirements. Replace the sync_guest boolean
with flush_flags_{local,all} and accumulate flags, performing all required
flushing at the end.

This is XSA-286.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>

v2:
* Use two separate flush flags.
* Use non-global flushes.
---
xen/arch/x86/mm.c | 61 +++++++++++++++++++++++++++++++++++++++----------------
1 file changed, 43 insertions(+), 18 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 918ee2bbe3..87860c2ca3 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -3887,7 +3887,7 @@ long do_mmu_update(
struct vcpu *curr = current, *v = curr;
struct domain *d = v->domain, *pt_owner = d, *pg_owner;
mfn_t map_mfn = INVALID_MFN, mfn;
- bool sync_guest = false;
+ unsigned int flush_flags_local = 0, flush_flags_all = 0;
uint32_t xsm_needed = 0;
uint32_t xsm_checked = 0;
int rc = put_old_guest_table(curr);
@@ -4037,6 +4037,9 @@ long do_mmu_update(
break;
rc = mod_l2_entry(va, l2e_from_intpte(req.val), mfn,
cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
+ /* Paging structure maybe changed. Flush linear range. */
+ if ( !rc )
+ flush_flags_all |= FLUSH_TLB;
break;

case PGT_l3_page_table:
@@ -4044,6 +4047,9 @@ long do_mmu_update(
break;
rc = mod_l3_entry(va, l3e_from_intpte(req.val), mfn,
cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
+ /* Paging structure maybe changed. Flush linear range. */
+ if ( !rc )
+ flush_flags_all |= FLUSH_TLB;
break;

case PGT_l4_page_table:
@@ -4051,27 +4057,28 @@ long do_mmu_update(
break;
rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
- if ( !rc && pt_owner->arch.pv.xpti )
+ /* Paging structure maybe changed. Flush linear range. */
+ if ( !rc )
{
- bool local_in_use = false;
+ bool local_in_use = mfn_eq(
+ pagetable_get_mfn(curr->arch.guest_table), mfn);

- if ( mfn_eq(pagetable_get_mfn(curr->arch.guest_table),
- mfn) )
- {
- local_in_use = true;
- get_cpu_info()->root_pgt_changed = true;
- }
+ flush_flags_all |= FLUSH_TLB;
+
+ if ( local_in_use )
+ flush_flags_local |= FLUSH_TLB | FLUSH_ROOT_PGTBL;

/*
* No need to sync if all uses of the page can be
* accounted to the page lock we hold, its pinned
* status, and uses on this (v)CPU.
*/
- if ( (page->u.inuse.type_info & PGT_count_mask) >
+ if ( pt_owner->arch.pv.xpti &&
+ (page->u.inuse.type_info & PGT_count_mask) >
(1 + !!(page->u.inuse.type_info & PGT_pinned) +
mfn_eq(pagetable_get_mfn(curr->arch.guest_table_user),
mfn) + local_in_use) )
- sync_guest = true;
+ flush_flags_all |= FLUSH_ROOT_PGTBL;
}
break;

@@ -4173,18 +4180,36 @@ long do_mmu_update(
if ( va )
unmap_domain_page(va);

- if ( sync_guest )
+ /*
+ * Flushing needs to occur for one of several reasons.
+ *
+ * 1) An update to an L2 or higher occured. This potentially changes the
+ * pagetable structure, requiring a flush of the linear range.
+ * 2) An update to an L4 occured, and XPTI is enabled. All CPUs running
+ * on a copy of this L4 need refreshing.
+ */
+ if ( flush_flags_all || flush_flags_local )
{
+ cpumask_t *mask = pt_owner->dirty_cpumask;
+
/*
- * Force other vCPU-s of the affected guest to pick up L4 entry
- * changes (if any).
+ * Local flushing may be asymmetric with remote. If there is local
+ * flushing to do, perform it separately and omit the current CPU from
+ * pt_owner->dirty_cpumask.
*/
- unsigned int cpu = smp_processor_id();
- cpumask_t *mask = per_cpu(scratch_cpumask, cpu);
+ if ( flush_flags_local )
+ {
+ unsigned int cpu = smp_processor_id();
+
+ mask = per_cpu(scratch_cpumask, cpu);
+ cpumask_copy(mask, pt_owner->dirty_cpumask);
+ __cpumask_clear_cpu(cpu, mask);
+
+ flush_local(flush_flags_local);
+ }

- cpumask_andnot(mask, pt_owner->dirty_cpumask, cpumask_of(cpu));
if ( !cpumask_empty(mask) )
- flush_mask(mask, FLUSH_TLB_GLOBAL | FLUSH_ROOT_PGTBL);
+ flush_mask(mask, flush_flags_all);
}

perfc_add(num_page_updates, i);
--
2.11.0
Re: [PATCH] x86/pv: Flush TLB in response to paging structure changes [ In reply to ]
On 21.10.2020 15:07, Andrew Cooper wrote:
> @@ -4037,6 +4037,9 @@ long do_mmu_update(
> break;
> rc = mod_l2_entry(va, l2e_from_intpte(req.val), mfn,
> cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
> + /* Paging structure maybe changed. Flush linear range. */
> + if ( !rc )
> + flush_flags_all |= FLUSH_TLB;
> break;
>
> case PGT_l3_page_table:
> @@ -4044,6 +4047,9 @@ long do_mmu_update(
> break;
> rc = mod_l3_entry(va, l3e_from_intpte(req.val), mfn,
> cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
> + /* Paging structure maybe changed. Flush linear range. */
> + if ( !rc )
> + flush_flags_all |= FLUSH_TLB;
> break;
>
> case PGT_l4_page_table:
> @@ -4051,27 +4057,28 @@ long do_mmu_update(
> break;
> rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
> cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
> - if ( !rc && pt_owner->arch.pv.xpti )
> + /* Paging structure maybe changed. Flush linear range. */
> + if ( !rc )
> {
> - bool local_in_use = false;
> + bool local_in_use = mfn_eq(
> + pagetable_get_mfn(curr->arch.guest_table), mfn);
>
> - if ( mfn_eq(pagetable_get_mfn(curr->arch.guest_table),
> - mfn) )
> - {
> - local_in_use = true;
> - get_cpu_info()->root_pgt_changed = true;
> - }
> + flush_flags_all |= FLUSH_TLB;
> +
> + if ( local_in_use )
> + flush_flags_local |= FLUSH_TLB | FLUSH_ROOT_PGTBL;

Aiui here (and in the code consuming the flags) you build upon
flush_flags_local, when not zero, always being a superset of
flush_flags_all. I think this is a trap to fall into when later
wanting to change this code, but as per below this won't hold
anymore anyway, I think. Hence here I think you want to set
FLUSH_TLB unconditionally, and above for L3 and L2 you want to
set it in both variables. Or, if I'm wrong below, a comment to
that effect may help people avoid falling into this trap.

An alternative would be to have

flush_flags_local |= (flush_flags_all & FLUSH_TLB);

before doing the actual flush.

> /*
> * No need to sync if all uses of the page can be
> * accounted to the page lock we hold, its pinned
> * status, and uses on this (v)CPU.
> */
> - if ( (page->u.inuse.type_info & PGT_count_mask) >
> + if ( pt_owner->arch.pv.xpti &&

I assume you've moved this here to avoid the further non-trivial
checks when possible, but you've not put it around the setting
of FLUSH_ROOT_PGTBL in flush_flags_local because setting
->root_pgt_changed is benign when !XPTI?

> + (page->u.inuse.type_info & PGT_count_mask) >
> (1 + !!(page->u.inuse.type_info & PGT_pinned) +
> mfn_eq(pagetable_get_mfn(curr->arch.guest_table_user),
> mfn) + local_in_use) )
> - sync_guest = true;
> + flush_flags_all |= FLUSH_ROOT_PGTBL;

This needs to also specify FLUSH_TLB_GLOBAL, or else original
XPTI behavior gets broken. Since the local CPU doesn't need this,
the variable may then better be named flush_flags_remote?

Or if I'm wrong here, the reasoning behind the dropping of the
global flush in this case needs putting in the description, not
the least because it would mean the change introducing it went
too far.

> @@ -4173,18 +4180,36 @@ long do_mmu_update(
> if ( va )
> unmap_domain_page(va);
>
> - if ( sync_guest )
> + /*
> + * Flushing needs to occur for one of several reasons.
> + *
> + * 1) An update to an L2 or higher occured. This potentially changes the
> + * pagetable structure, requiring a flush of the linear range.
> + * 2) An update to an L4 occured, and XPTI is enabled. All CPUs running
> + * on a copy of this L4 need refreshing.
> + */
> + if ( flush_flags_all || flush_flags_local )

Minor remark: At least in x86 code it is more efficient to use
| instead of || in such cases, to avoid relying on the compiler
to carry out this small optimization. It may well be that all
compilers we care about do so, but it's certainly not the case
for all the compilers I've ever worked with.

> {
> + cpumask_t *mask = pt_owner->dirty_cpumask;
> +
> /*
> - * Force other vCPU-s of the affected guest to pick up L4 entry
> - * changes (if any).
> + * Local flushing may be asymmetric with remote. If there is local
> + * flushing to do, perform it separately and omit the current CPU from
> + * pt_owner->dirty_cpumask.
> */
> - unsigned int cpu = smp_processor_id();
> - cpumask_t *mask = per_cpu(scratch_cpumask, cpu);
> + if ( flush_flags_local )
> + {
> + unsigned int cpu = smp_processor_id();
> +
> + mask = per_cpu(scratch_cpumask, cpu);
> + cpumask_copy(mask, pt_owner->dirty_cpumask);
> + __cpumask_clear_cpu(cpu, mask);
> +
> + flush_local(flush_flags_local);
> + }
>
> - cpumask_andnot(mask, pt_owner->dirty_cpumask, cpumask_of(cpu));

I understand you're of the opinion that cpumask_copy() +
__cpumask_clear_cpu() is more efficient than cpumask_andnot()?
(I guess I agree for high enough CPU counts.)

Jan
Re: [PATCH] x86/pv: Flush TLB in response to paging structure changes [ In reply to ]
On 21/10/2020 14:56, Jan Beulich wrote:
> On 21.10.2020 15:07, Andrew Cooper wrote:
>> @@ -4037,6 +4037,9 @@ long do_mmu_update(
>> break;
>> rc = mod_l2_entry(va, l2e_from_intpte(req.val), mfn,
>> cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
>> + /* Paging structure maybe changed. Flush linear range. */
>> + if ( !rc )
>> + flush_flags_all |= FLUSH_TLB;
>> break;
>>
>> case PGT_l3_page_table:
>> @@ -4044,6 +4047,9 @@ long do_mmu_update(
>> break;
>> rc = mod_l3_entry(va, l3e_from_intpte(req.val), mfn,
>> cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
>> + /* Paging structure maybe changed. Flush linear range. */
>> + if ( !rc )
>> + flush_flags_all |= FLUSH_TLB;
>> break;
>>
>> case PGT_l4_page_table:
>> @@ -4051,27 +4057,28 @@ long do_mmu_update(
>> break;
>> rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
>> cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
>> - if ( !rc && pt_owner->arch.pv.xpti )
>> + /* Paging structure maybe changed. Flush linear range. */
>> + if ( !rc )
>> {
>> - bool local_in_use = false;
>> + bool local_in_use = mfn_eq(
>> + pagetable_get_mfn(curr->arch.guest_table), mfn);
>>
>> - if ( mfn_eq(pagetable_get_mfn(curr->arch.guest_table),
>> - mfn) )
>> - {
>> - local_in_use = true;
>> - get_cpu_info()->root_pgt_changed = true;
>> - }
>> + flush_flags_all |= FLUSH_TLB;
>> +
>> + if ( local_in_use )
>> + flush_flags_local |= FLUSH_TLB | FLUSH_ROOT_PGTBL;
> Aiui here (and in the code consuming the flags) you build upon
> flush_flags_local, when not zero, always being a superset of
> flush_flags_all. I think this is a trap to fall into when later
> wanting to change this code, but as per below this won't hold
> anymore anyway, I think. Hence here I think you want to set
> FLUSH_TLB unconditionally, and above for L3 and L2 you want to
> set it in both variables. Or, if I'm wrong below, a comment to
> that effect may help people avoid falling into this trap.
>
> An alternative would be to have
>
> flush_flags_local |= (flush_flags_all & FLUSH_TLB);
>
> before doing the actual flush.

Honestly, this is what I meant by stating that the asymmetry is a total
mess.

I originally named all 'remote', but this is even less accurate, it may
still contain the current cpu.

Our matrix of complexity:

* FLUSH_TLB for L2+ structure changes
* FLUSH_TLB_GLOBAL/FLUSH_ROOT_PGTBL for XPTI

with optimisations to skip GLOBAL/ROOT_PGTBL on the local CPU if none of
the updates hit the L4-in-use, and to skip the remote if we hold all
references on the L4.

Everything is complicated because pt_owner may not be current, for
toolstack operations constructing a new domain.

>
>> /*
>> * No need to sync if all uses of the page can be
>> * accounted to the page lock we hold, its pinned
>> * status, and uses on this (v)CPU.
>> */
>> - if ( (page->u.inuse.type_info & PGT_count_mask) >
>> + if ( pt_owner->arch.pv.xpti &&
> I assume you've moved this here to avoid the further non-trivial
> checks when possible, but you've not put it around the setting
> of FLUSH_ROOT_PGTBL in flush_flags_local because setting
> ->root_pgt_changed is benign when !XPTI?

No - that was accidental, while attempting to reduce the diff.

>
>> + (page->u.inuse.type_info & PGT_count_mask) >
>> (1 + !!(page->u.inuse.type_info & PGT_pinned) +
>> mfn_eq(pagetable_get_mfn(curr->arch.guest_table_user),
>> mfn) + local_in_use) )
>> - sync_guest = true;
>> + flush_flags_all |= FLUSH_ROOT_PGTBL;
> This needs to also specify FLUSH_TLB_GLOBAL, or else original
> XPTI behavior gets broken. Since the local CPU doesn't need this,
> the variable may then better be named flush_flags_remote?

See above.  remote is even more confusing than all.

>
> Or if I'm wrong here, the reasoning behind the dropping of the
> global flush in this case needs putting in the description, not
> the least because it would mean the change introducing it went
> too far.
>
>> @@ -4173,18 +4180,36 @@ long do_mmu_update(
>> if ( va )
>> unmap_domain_page(va);
>>
>> - if ( sync_guest )
>> + /*
>> + * Flushing needs to occur for one of several reasons.
>> + *
>> + * 1) An update to an L2 or higher occured. This potentially changes the
>> + * pagetable structure, requiring a flush of the linear range.
>> + * 2) An update to an L4 occured, and XPTI is enabled. All CPUs running
>> + * on a copy of this L4 need refreshing.
>> + */
>> + if ( flush_flags_all || flush_flags_local )
> Minor remark: At least in x86 code it is more efficient to use
> | instead of || in such cases, to avoid relying on the compiler
> to carry out this small optimization.

This transformation should not be recommended in general.  There are
cases, including this one, where it is at best, no effect, and at worst,
wrong.

I don't care about people using ancient compilers.  They've got far
bigger (== more impactful) problems than (the absence of) this
transformation, and the TLB flush will dwarf anything the compiler does
here.

However, the hand "optimised" case breaks a compiler spotting that the
entire second clause is actually redundant for now.

I specifically didn't encode the dependency, to avoid subtle bugs
if/when someone alters the logic.

>
>> {
>> + cpumask_t *mask = pt_owner->dirty_cpumask;
>> +
>> /*
>> - * Force other vCPU-s of the affected guest to pick up L4 entry
>> - * changes (if any).
>> + * Local flushing may be asymmetric with remote. If there is local
>> + * flushing to do, perform it separately and omit the current CPU from
>> + * pt_owner->dirty_cpumask.
>> */
>> - unsigned int cpu = smp_processor_id();
>> - cpumask_t *mask = per_cpu(scratch_cpumask, cpu);
>> + if ( flush_flags_local )
>> + {
>> + unsigned int cpu = smp_processor_id();
>> +
>> + mask = per_cpu(scratch_cpumask, cpu);
>> + cpumask_copy(mask, pt_owner->dirty_cpumask);
>> + __cpumask_clear_cpu(cpu, mask);
>> +
>> + flush_local(flush_flags_local);
>> + }
>>
>> - cpumask_andnot(mask, pt_owner->dirty_cpumask, cpumask_of(cpu));
> I understand you're of the opinion that cpumask_copy() +
> __cpumask_clear_cpu() is more efficient than cpumask_andnot()?
> (I guess I agree for high enough CPU counts.)

Its faster in all cases, even low CPU counts.

~Andrew
Re: [PATCH] x86/pv: Flush TLB in response to paging structure changes [ In reply to ]
On 21/10/2020 16:39, Andrew Cooper wrote:
>>> @@ -4051,27 +4057,28 @@ long do_mmu_update(
>>> break;
>>> rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
>>> cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
>>> - if ( !rc && pt_owner->arch.pv.xpti )
>>> + /* Paging structure maybe changed. Flush linear range. */
>>> + if ( !rc )
>>> {
>>> - bool local_in_use = false;
>>> + bool local_in_use = mfn_eq(
>>> + pagetable_get_mfn(curr->arch.guest_table), mfn);
>>>
>>> - if ( mfn_eq(pagetable_get_mfn(curr->arch.guest_table),
>>> - mfn) )
>>> - {
>>> - local_in_use = true;
>>> - get_cpu_info()->root_pgt_changed = true;
>>> - }
>>> + flush_flags_all |= FLUSH_TLB;
>>> +
>>> + if ( local_in_use )
>>> + flush_flags_local |= FLUSH_TLB | FLUSH_ROOT_PGTBL;
>> Aiui here (and in the code consuming the flags) you build upon
>> flush_flags_local, when not zero, always being a superset of
>> flush_flags_all. I think this is a trap to fall into when later
>> wanting to change this code, but as per below this won't hold
>> anymore anyway, I think. Hence here I think you want to set
>> FLUSH_TLB unconditionally, and above for L3 and L2 you want to
>> set it in both variables. Or, if I'm wrong below, a comment to
>> that effect may help people avoid falling into this trap.
>>
>> An alternative would be to have
>>
>> flush_flags_local |= (flush_flags_all & FLUSH_TLB);
>>
>> before doing the actual flush.

Also, what I forgot to say in the previous reply, this is still buggy if
hypothetically speaking FLUSH_CACHE had managed to be accumulated in
flush_flags_all.

You cannot have general accumulation logic, a special case for local,
and any catch-all logic like that, because the only correct way to do
the catch-all logic will clobber the special case for local.

~Andrew
Re: [PATCH] x86/pv: Flush TLB in response to paging structure changes [ In reply to ]
On 21/10/2020 16:55, Andrew Cooper wrote:
> On 21/10/2020 16:39, Andrew Cooper wrote:
>>>> @@ -4051,27 +4057,28 @@ long do_mmu_update(
>>>> break;
>>>> rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
>>>> cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
>>>> - if ( !rc && pt_owner->arch.pv.xpti )
>>>> + /* Paging structure maybe changed. Flush linear range. */
>>>> + if ( !rc )
>>>> {
>>>> - bool local_in_use = false;
>>>> + bool local_in_use = mfn_eq(
>>>> + pagetable_get_mfn(curr->arch.guest_table), mfn);
>>>>
>>>> - if ( mfn_eq(pagetable_get_mfn(curr->arch.guest_table),
>>>> - mfn) )
>>>> - {
>>>> - local_in_use = true;
>>>> - get_cpu_info()->root_pgt_changed = true;
>>>> - }
>>>> + flush_flags_all |= FLUSH_TLB;
>>>> +
>>>> + if ( local_in_use )
>>>> + flush_flags_local |= FLUSH_TLB | FLUSH_ROOT_PGTBL;
>>> Aiui here (and in the code consuming the flags) you build upon
>>> flush_flags_local, when not zero, always being a superset of
>>> flush_flags_all. I think this is a trap to fall into when later
>>> wanting to change this code, but as per below this won't hold
>>> anymore anyway, I think. Hence here I think you want to set
>>> FLUSH_TLB unconditionally, and above for L3 and L2 you want to
>>> set it in both variables. Or, if I'm wrong below, a comment to
>>> that effect may help people avoid falling into this trap.
>>>
>>> An alternative would be to have
>>>
>>> flush_flags_local |= (flush_flags_all & FLUSH_TLB);
>>>
>>> before doing the actual flush.
> Also, what I forgot to say in the previous reply, this is still buggy if
> hypothetically speaking FLUSH_CACHE had managed to be accumulated in
> flush_flags_all.
>
> You cannot have general accumulation logic, a special case for local,
> and any catch-all logic like that, because the only correct way to do
> the catch-all logic will clobber the special case for local.

I'm going to try a 3rd time with flush_flags and local_may_skip which
defaults to GLOBAL|ROOT_PGTBL, and may get clobbered.

This seems like it might be a less fragile way of expressing the
optimisation.

~Andrew
Re: [PATCH] x86/pv: Flush TLB in response to paging structure changes [ In reply to ]
On 21.10.2020 17:39, Andrew Cooper wrote:
> On 21/10/2020 14:56, Jan Beulich wrote:
>> On 21.10.2020 15:07, Andrew Cooper wrote:
>>> @@ -4037,6 +4037,9 @@ long do_mmu_update(
>>> break;
>>> rc = mod_l2_entry(va, l2e_from_intpte(req.val), mfn,
>>> cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
>>> + /* Paging structure maybe changed. Flush linear range. */
>>> + if ( !rc )
>>> + flush_flags_all |= FLUSH_TLB;
>>> break;
>>>
>>> case PGT_l3_page_table:
>>> @@ -4044,6 +4047,9 @@ long do_mmu_update(
>>> break;
>>> rc = mod_l3_entry(va, l3e_from_intpte(req.val), mfn,
>>> cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
>>> + /* Paging structure maybe changed. Flush linear range. */
>>> + if ( !rc )
>>> + flush_flags_all |= FLUSH_TLB;
>>> break;
>>>
>>> case PGT_l4_page_table:
>>> @@ -4051,27 +4057,28 @@ long do_mmu_update(
>>> break;
>>> rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
>>> cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
>>> - if ( !rc && pt_owner->arch.pv.xpti )
>>> + /* Paging structure maybe changed. Flush linear range. */
>>> + if ( !rc )
>>> {
>>> - bool local_in_use = false;
>>> + bool local_in_use = mfn_eq(
>>> + pagetable_get_mfn(curr->arch.guest_table), mfn);
>>>
>>> - if ( mfn_eq(pagetable_get_mfn(curr->arch.guest_table),
>>> - mfn) )
>>> - {
>>> - local_in_use = true;
>>> - get_cpu_info()->root_pgt_changed = true;
>>> - }
>>> + flush_flags_all |= FLUSH_TLB;
>>> +
>>> + if ( local_in_use )
>>> + flush_flags_local |= FLUSH_TLB | FLUSH_ROOT_PGTBL;
>> Aiui here (and in the code consuming the flags) you build upon
>> flush_flags_local, when not zero, always being a superset of
>> flush_flags_all. I think this is a trap to fall into when later
>> wanting to change this code, but as per below this won't hold
>> anymore anyway, I think. Hence here I think you want to set
>> FLUSH_TLB unconditionally, and above for L3 and L2 you want to
>> set it in both variables. Or, if I'm wrong below, a comment to
>> that effect may help people avoid falling into this trap.
>>
>> An alternative would be to have
>>
>> flush_flags_local |= (flush_flags_all & FLUSH_TLB);
>>
>> before doing the actual flush.
>
> Honestly, this is what I meant by stating that the asymmetry is a total
> mess.
>
> I originally named all 'remote', but this is even less accurate, it may
> still contain the current cpu.
>
> Our matrix of complexity:
>
> * FLUSH_TLB for L2+ structure changes
> * FLUSH_TLB_GLOBAL/FLUSH_ROOT_PGTBL for XPTI
>
> with optimisations to skip GLOBAL/ROOT_PGTBL on the local CPU if none of
> the updates hit the L4-in-use, and to skip the remote if we hold all
> references on the L4.
>
> Everything is complicated because pt_owner may not be current, for
> toolstack operations constructing a new domain.

Which is a case I'm wondering why we flush in the first place.
The L4 under construction can't be in use yet, and hence
updates to it shouldn't need syncing. Otoh
pt_owner->dirty_cpumask obviously is empty at that point, i.e.
special casing this likely isn't really worth it.

>>> @@ -4173,18 +4180,36 @@ long do_mmu_update(
>>> if ( va )
>>> unmap_domain_page(va);
>>>
>>> - if ( sync_guest )
>>> + /*
>>> + * Flushing needs to occur for one of several reasons.
>>> + *
>>> + * 1) An update to an L2 or higher occured. This potentially changes the
>>> + * pagetable structure, requiring a flush of the linear range.
>>> + * 2) An update to an L4 occured, and XPTI is enabled. All CPUs running
>>> + * on a copy of this L4 need refreshing.
>>> + */
>>> + if ( flush_flags_all || flush_flags_local )
>> Minor remark: At least in x86 code it is more efficient to use
>> | instead of || in such cases, to avoid relying on the compiler
>> to carry out this small optimization.
>
> This transformation should not be recommended in general.  There are
> cases, including this one, where it is at best, no effect, and at worst,
> wrong.
>
> I don't care about people using ancient compilers.  They've got far
> bigger (== more impactful) problems than (the absence of) this
> transformation, and the TLB flush will dwarf anything the compiler does
> here.
>
> However, the hand "optimised" case breaks a compiler spotting that the
> entire second clause is actually redundant for now.

Oh, you mean non-zero flush_flags_local implying non-zero
flush_flags_all? Not sure why a compiler recognizing this shouldn't
be able to optimize away | when it is able to optimize away || in
this case. Anyway - minor point, as said.

> I specifically didn't encode the dependency, to avoid subtle bugs
> if/when someone alters the logic.

Good point, but let me point out then that you encoded another
subtle dependency, which I didn't spell out completely before
because with the suggested adjustments it disappears: You may not
omit FLUSH_TLB from flush_flags_local when !local_in_use, as the L4
being changed may be one recursively hanging off of the L4 we're
running on.

>>> {
>>> + cpumask_t *mask = pt_owner->dirty_cpumask;
>>> +
>>> /*
>>> - * Force other vCPU-s of the affected guest to pick up L4 entry
>>> - * changes (if any).
>>> + * Local flushing may be asymmetric with remote. If there is local
>>> + * flushing to do, perform it separately and omit the current CPU from
>>> + * pt_owner->dirty_cpumask.
>>> */
>>> - unsigned int cpu = smp_processor_id();
>>> - cpumask_t *mask = per_cpu(scratch_cpumask, cpu);
>>> + if ( flush_flags_local )
>>> + {
>>> + unsigned int cpu = smp_processor_id();
>>> +
>>> + mask = per_cpu(scratch_cpumask, cpu);
>>> + cpumask_copy(mask, pt_owner->dirty_cpumask);
>>> + __cpumask_clear_cpu(cpu, mask);
>>> +
>>> + flush_local(flush_flags_local);
>>> + }
>>>
>>> - cpumask_andnot(mask, pt_owner->dirty_cpumask, cpumask_of(cpu));
>> I understand you're of the opinion that cpumask_copy() +
>> __cpumask_clear_cpu() is more efficient than cpumask_andnot()?
>> (I guess I agree for high enough CPU counts.)
>
> Its faster in all cases, even low CPU counts.

Is it? Data for BTR with a memory operand is available in the ORM only
for some Atom CPUs, and its latency and throughput aren't very good
there. Anyway - again a minor aspect, but I'll keep this in mind for
future code I write, as so far I've preferred the "andnot" form in
such cases.

Jan