Mailing List Archive

[xen stable-4.9] x86/mm: Always retain a general ref on partial
commit d3c4b609fafdf440ea159c13f2633693d320f39b
Author: George Dunlap <george.dunlap@citrix.com>
AuthorDate: Mon Nov 4 15:06:12 2019 +0100
Commit: Jan Beulich <jbeulich@suse.com>
CommitDate: Mon Nov 4 15:06:12 2019 +0100

x86/mm: Always retain a general ref on partial

In order to allow recursive pagetable promotions and demotions to be
interrupted, Xen must keep track of the state of the sub-pages
promoted or demoted. This is stored in two elements in the page struct:
nr_entries_validated and partial_flags.

The rule is that entries [.0, nr_entries_validated) should always be
validated and hold a general reference count. If partial_flags is
zero, then [nr_entries_validated] is not validated and no reference
count is held. If PTF_partial_set is set, then [nr_entries_validated]
is partially validated.

At the moment, a distinction is made between promotion and demotion
with regard to whether the entry itself "holds" a general reference
count: when entry promotion is interrupted (i.e., returns -ERESTART),
the entry is not considered to hold a reference; when entry demotion
is interrupted, the entry is still considered to hold a general
reference.

PTF_partial_general_ref is used to distinguish between these cases.
If clear, it's a partial promotion => no general reference count held
by the entry; if set, it's partial demotion, so a general reference
count held. Because promotions and demotions can be interleaved, this
value is passed to get_page_and_type_from_mfn and put_page_from_l*e,
to be able to properly handle reference counts.

Unfortunately, because a refcount is not held, it is possible to
engineer a situation where PFT_partial_set is set but the page in
question has been assigned to another domain. A sketch is provided in
the appendix.

Fix this by having the parent page table entry hold a general
reference count whenever PFT_partial_set is set. (For clarity of
change, keep two separate flags. These will be collapsed in a
subsequent changeset.)

This has two basic implications. On the put_page_from_lNe() side,
this mean that the (partial_set && !partial_ref) case can never happen,
and no longer needs to be special-cased.

Secondly, because both flags are set together, there's no need to carry over
existing bits from partial_pte.

(NB there is still another issue with calling _put_page_type() on a
page which had PGT_partial set; that will be handled in a subsequent
patch.)

On the get_page_and_type_from_mfn() side, we need to distinguish
between callers which hold a reference on partial (i.e.,
alloc_lN_table()), and those which do not (new_cr3, PIN_LN_TABLE, and
so on): pass a flag if the type should be retained on interruption.

NB that since l1 promotion can't be preempted, that get_page_from_l2e
can't return -ERESTART.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
-----
* Appendix: Engineering PTF_partial_set while a page belongs to a
foreign domain

Suppose A is a page which can be promoted to an l3, and B is a page
which can be promoted to an l2, and A[x] points to B. B has
PGC_allocated set but no other general references.

V1: PIN_L3 A.
A is validated, B is validated.
A.type_count = 1 | PGT_validated | PGT_pinned
B.type_count = 1 | PGT_validated
B.count = 2 | PGC_allocated (A[x] holds a general ref)

V1: UNPIN A.
A begins de-validation.
Arrange to be interrupted when i < x
V1->old_guest_table = A
V1->old_guest_table_ref_held = false
A.type_count = 1 | PGT_partial
A.nr_validated_entries = i < x
B.type_count = 0
B.count = 1 | PGC_allocated

V2: MOD_L4_ENTRY to point some l4e to A.
Picks up re-validation of A.
Arrange to be interrupted halfway through B's validation
B.type_count = 1 | PGT_partial
B.count = 2 | PGC_allocated (PGT_partial holds a general ref)
A.type_count = 1 | PGT_partial
A.nr_validated_entries = x
A.partial_pte = PTF_partial_set

V3: MOD_L3_ENTRY to point some other l3e (not in A) to B.
Validates B.
B.type_count = 1 | PGT_validated
B.count = 2 | PGC_allocated ("other l3e" holds a general ref)

V3: MOD_L3_ENTRY to clear l3e pointing to B.
Devalidates B.
B.type_count = 0
B.count = 1 | PGC_allocated

V3: decrease_reservation(B)
Clears PGC_allocated
B.count = 0 => B is freed

B gets assigned to a different domain

V1: Restarts UNPIN of A
put_old_guest_table(A)
...
free_l3_table(A)

Now since A.partial_flags has PTF_partial_set, free_l3_table() will
call put_page_from_l3e() on A[x], which points to B, while B is owned
by another domain.

If A[x] held a general refcount for B on partial validation, as it does
for partial de-validation, then B would still have a reference count of
1 after PGC_allocated was freed; so B wouldn't be freed until after
put_page_from_l3e() had happend on A[x].
master commit: 18b0ab697830a46ce3dacaf9210799322cb3732c
master date: 2019-10-31 16:14:36 +0100
---
xen/arch/x86/mm.c | 87 +++++++++++++++++++++++++++++-------------------
xen/include/asm-x86/mm.h | 15 +++++----
2 files changed, 61 insertions(+), 41 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index fce15f30ca..0ac8c4592d 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -777,10 +777,11 @@ static int __get_page_type(struct page_info *page, unsigned long type,
* page->pte[page->nr_validated_entries]. See the comment in mm.h for
* more information.
*/
-#define PTF_partial_set (1 << 0)
-#define PTF_partial_general_ref (1 << 1)
-#define PTF_preemptible (1 << 2)
-#define PTF_defer (1 << 3)
+#define PTF_partial_set (1 << 0)
+#define PTF_partial_general_ref (1 << 1)
+#define PTF_preemptible (1 << 2)
+#define PTF_defer (1 << 3)
+#define PTF_retain_ref_on_restart (1 << 4)

static int get_page_and_type_from_pagenr(unsigned long page_nr,
unsigned long type,
@@ -790,7 +791,11 @@ static int get_page_and_type_from_pagenr(unsigned long page_nr,
struct page_info *page = mfn_to_page(page_nr);
int rc;
bool preemptible = flags & PTF_preemptible,
- partial_ref = flags & PTF_partial_general_ref;
+ partial_ref = flags & PTF_partial_general_ref,
+ partial_set = flags & PTF_partial_set,
+ retain_ref = flags & PTF_retain_ref_on_restart;
+
+ ASSERT(partial_ref == partial_set);

if ( likely(!partial_ref) &&
unlikely(!get_page_from_pagenr(page_nr, d)) )
@@ -803,13 +808,15 @@ static int get_page_and_type_from_pagenr(unsigned long page_nr,
* - page is fully validated (rc == 0)
* - page is not validated (rc < 0) but:
* - We came in with a reference (partial_ref)
+ * - page is partially validated (rc == -ERESTART), and the
+ * caller has asked the ref to be retained in that case
* - page is partially validated but there's been an error
* (page == current->arch.old_guest_table)
*
* The partial_ref-on-error clause is worth an explanation. There
* are two scenarios where partial_ref might be true coming in:
- * - mfn has been partially demoted as type `type`; i.e. has
- * PGT_partial set
+ * - mfn has been partially promoted / demoted as type `type`;
+ * i.e. has PGT_partial set
* - mfn has been partially demoted as L(type+1) (i.e., a linear
* page; e.g. we're being called from get_page_from_l2e with
* type == PGT_l1_table, but the mfn is PGT_l2_table)
@@ -832,7 +839,8 @@ static int get_page_and_type_from_pagenr(unsigned long page_nr,
*/
if ( likely(!rc) || partial_ref )
/* nothing */;
- else if ( page == current->arch.old_guest_table )
+ else if ( page == current->arch.old_guest_table ||
+ (retain_ref && rc == -ERESTART) )
ASSERT(preemptible);
else
put_page(page);
@@ -1535,8 +1543,8 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
PTF_partial_set )
{
- ASSERT(!(flags & PTF_defer));
- rc = _put_page_type(pg, PTF_preemptible, ptpg);
+ /* partial_set should always imply partial_ref */
+ BUG();
}
else if ( flags & PTF_defer )
{
@@ -1581,8 +1589,8 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
PTF_partial_set )
{
- ASSERT(!(flags & PTF_defer));
- return _put_page_type(pg, PTF_preemptible, mfn_to_page(pfn));
+ /* partial_set should always imply partial_ref */
+ BUG();
}

if ( flags & PTF_defer )
@@ -1612,8 +1620,8 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
PTF_partial_set )
{
- ASSERT(!(flags & PTF_defer));
- return _put_page_type(pg, PTF_preemptible, mfn_to_page(pfn));
+ /* partial_set should always imply partial_ref */
+ BUG();
}

if ( flags & PTF_defer )
@@ -1740,13 +1748,22 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
(rc = get_page_from_l2e(pl2e[i], pfn, d, partial_flags)) > 0 )
continue;

- if ( rc == -ERESTART )
- {
- page->nr_validated_ptes = i;
- /* Set 'set', retain 'general ref' */
- page->partial_flags = partial_flags | PTF_partial_set;
- }
- else if ( rc == -EINTR && i )
+ /*
+ * It shouldn't be possible for get_page_from_l2e to return
+ * -ERESTART, since we never call this with PTF_preemptible.
+ * (alloc_l1_table may return -EINTR on an L1TF-vulnerable
+ * entry.)
+ *
+ * NB that while on a "clean" promotion, we can never get
+ * PGT_partial. It is possible to arrange for an l2e to
+ * contain a partially-devalidated l2; but in that case, both
+ * of the following functions will fail anyway (the first
+ * because the page in question is not an l1; the second
+ * because the page is not fully validated).
+ */
+ ASSERT(rc != -ERESTART);
+
+ if ( rc == -EINTR && i )
{
page->nr_validated_ptes = i;
page->partial_flags = 0;
@@ -1755,6 +1772,7 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
else if ( rc < 0 && rc != -EINTR )
{
gdprintk(XENLOG_WARNING, "Failure in alloc_l2_table: slot %#x\n", i);
+ ASSERT(current->arch.old_guest_table == NULL);
if ( i )
{
page->nr_validated_ptes = i;
@@ -1818,17 +1836,21 @@ static int alloc_l3_table(struct page_info *page)
PGT_l2_page_table |
PGT_pae_xen_l2,
d,
- partial_flags | PTF_preemptible);
+ partial_flags |
+ PTF_preemptible |
+ PTF_retain_ref_on_restart);
}
else if ( !is_guest_l3_slot(i) ||
- (rc = get_page_from_l3e(pl3e[i], pfn, d, partial_flags)) > 0 )
+ (rc = get_page_from_l3e(pl3e[i], pfn, d,
+ partial_flags |
+ PTF_retain_ref_on_restart)) > 0 )
continue;

if ( rc == -ERESTART )
{
page->nr_validated_ptes = i;
/* Set 'set', leave 'general ref' set if this entry was set */
- page->partial_flags = partial_flags | PTF_partial_set;
+ page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
}
else if ( rc == -EINTR && i )
{
@@ -1923,14 +1945,15 @@ static int alloc_l4_table(struct page_info *page)
i++, partial_flags = 0 )
{
if ( !is_guest_l4_slot(d, i) ||
- (rc = get_page_from_l4e(pl4e[i], pfn, d, partial_flags)) > 0 )
+ (rc = get_page_from_l4e(pl4e[i], pfn, d,
+ partial_flags | PTF_retain_ref_on_restart)) > 0 )
continue;

if ( rc == -ERESTART )
{
page->nr_validated_ptes = i;
/* Set 'set', leave 'general ref' set if this entry was set */
- page->partial_flags = partial_flags | PTF_partial_set;
+ page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
}
else if ( rc < 0 )
{
@@ -2029,9 +2052,7 @@ static int free_l2_table(struct page_info *page)
else if ( rc == -ERESTART )
{
page->nr_validated_ptes = i;
- page->partial_flags = (partial_flags & PTF_partial_set) ?
- partial_flags :
- (PTF_partial_set | PTF_partial_general_ref);
+ page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
}
else if ( rc == -EINTR && i < L2_PAGETABLE_ENTRIES - 1 )
{
@@ -2082,9 +2103,7 @@ static int free_l3_table(struct page_info *page)
if ( rc == -ERESTART )
{
page->nr_validated_ptes = i;
- page->partial_flags = (partial_flags & PTF_partial_set) ?
- partial_flags :
- (PTF_partial_set | PTF_partial_general_ref);
+ page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
}
else if ( rc == -EINTR && i < L3_PAGETABLE_ENTRIES - 1 )
{
@@ -2115,9 +2134,7 @@ static int free_l4_table(struct page_info *page)
if ( rc == -ERESTART )
{
page->nr_validated_ptes = i;
- page->partial_flags = (partial_flags & PTF_partial_set) ?
- partial_flags :
- (PTF_partial_set | PTF_partial_general_ref);
+ page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
}
else if ( rc == -EINTR && i < L4_PAGETABLE_ENTRIES - 1 )
{
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index 0bf5b60ba8..6ab642eb18 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -150,22 +150,25 @@ struct page_info
* page.
*
* This happens:
- * - During de-validation, if de-validation of the page was
+ * - During validation or de-validation, if the operation was
* interrupted
* - During validation, if an invalid entry is encountered and
* validation is preemptible
* - During validation, if PTF_partial_general_ref was set on
- * this entry to begin with (perhaps because we're picking
- * up from a partial de-validation).
+ * this entry to begin with (perhaps because it picked up a
+ * previous operation)
*
- * When resuming validation, if PTF_partial_general_ref is clear,
- * then a general reference must be re-acquired; if it is set, no
- * reference should be acquired.
+ * When resuming validation, if PTF_partial_general_ref is
+ * clear, then a general reference must be re-acquired; if it
+ * is set, no reference should be acquired.
*
* When resuming de-validation, if PTF_partial_general_ref is
* clear, no reference should be dropped; if it is set, a
* reference should be dropped.
*
+ * NB at the moment, PTF_partial_set should be set if and only if
+ * PTF_partial_general_ref is set.
+ *
* NB that PTF_partial_set and PTF_partial_general_ref are
* defined in mm.c, the only place where they are used.
*
--
generated by git-patchbot for /home/xen/git/xen.git#stable-4.9

_______________________________________________
Xen-changelog mailing list
Xen-changelog@lists.xenproject.org
https://lists.xenproject.org/xen-changelog