Mailing List Archive: RFC Patch for the "x86-64, mm: Put early page table high"

RFC Patch for the "x86-64, mm: Put early page table high"

Apr 29, 2011, 8:46 AM

Post #1 of 6 (189 views)

Without a fix for this 2.6.39-rc5 fails during bootup.

It fails such early, you only Xen telling you that the domain crashed.

There is one patch to fix this, and the last word was here:
http://marc.info/?i=4DAF0ECB.8060009@goop.org

But after nothing had been heard from Peter.

So I decided to look at this from a different perspective: why not
contain the workaround inside Xen early bootup code.

I've tested this code as DomU (1G, 2G, 3G), Dom0 (8G, 4G, 1000M)
and they all booted fine. Hadn't compile tested it on 32-bit and
I think it will actually complain about it. Let me fix that.

See attached patch (also present in stable/bug-fixes-for-rc5) which also
has the "xen: mask_rw_pte mark RO all pagetable pages up to pgt_buf_top"
integrated in.

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index aef7af9..a54c7c2 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1463,6 +1463,115 @@ static int xen_pgd_alloc(struct mm_struct *mm)
return ret;
}

+#ifdef CONFIG_X86_64
+static __initdata u64 __last_pgt_set_rw = 0;
+static __initdata u64 __pgt_buf_start = 0;
+static __initdata u64 __pgt_buf_end = 0;
+static __initdata u64 __pgt_buf_top = 0;
+/*
+ * As a consequence of the commit:
+ *
+ * commit 4b239f458c229de044d6905c2b0f9fe16ed9e01e
+ * Author: Yinghai Lu <yinghai@kernel.org>
+ * Date: Fri Dec 17 16:58:28 2010 -0800
+ *
+ * x86-64, mm: Put early page table high
+ *
+ * at some point init_memory_mapping is going to reach the pagetable pages
+ * area and map those pages too (mapping them as normal memory that falls
+ * in the range of addresses passed to init_memory_mapping as argument).
+ * Some of those pages are already pagetable pages (they are in the range
+ * pgt_buf_start-pgt_buf_end) therefore they are going to be mapped RO and
+ * everything is fine.
+ * Some of these pages are not pagetable pages yet (they fall in the range
+ * pgt_buf_end-pgt_buf_top; for example the page at pgt_buf_end) so they
+ * are going to be mapped RW. When these pages become pagetable pages and
+ * are hooked into the pagetable, xen will find that the guest has already
+ * a RW mapping of them somewhere and fail the operation.
+ * The reason Xen requires pagetables to be RO is that the hypervisor needs
+ * to verify that the pagetables are valid before using them. The validation
+ * operations are called "pinning" (more details in arch/x86/xen/mmu.c).
+ *
+ * In order to fix the issue we mark all the pages in the entire range
+ * pgt_buf_start-pgt_buf_top as RO, however when the pagetable allocation
+ * is completed only the range pgt_buf_start-pgt_buf_end is reserved by
+ * init_memory_mapping. Hence the kernel is going to crash as soon as one
+ * of the pages in the range pgt_buf_end-pgt_buf_top is reused (b/c those
+ * ranges are RO).
+ *
+ * For this reason, this function is introduced which is called _after_
+ * the init_memory_mapping has completed (in a perfect world we would
+ * call this function from init_memory_mapping, but lets ignore that).
+ *
+ * Because we are called _after_ init_memory_mapping the pgt_buf_[start,
+ * end,top] have all changed to new values (b/c another init_memory_mapping
+ * is called). Hence, the first time we enter this function, we save
+ * away the pgt_buf_start value and update the pgt_buf_[end,top].
+ *
+ * When we detect that the "old" pgt_buf_start through pgt_buf_end
+ * PFNs have been reserved (so memblock_x86_reserve_range has been called),
+ * we immediately set out to RW the "old" pgt_buf_end through pgt_buf_top.
+ *
+ * And then we update those "old" pgt_buf_[end|top] with the new ones
+ * so that we can redo this on the next pagetable.
+ */
+static __init void mark_rw_past_pgt(unsigned long pfn) {
+
+ if (pfn && pgt_buf_end > pgt_buf_start) {
+ u64 addr, size;
+
+ /* Save it away. */
+ if (!__pgt_buf_start) {
+ __pgt_buf_start = pgt_buf_start;
+ __pgt_buf_end = pgt_buf_end;
+ __pgt_buf_top = pgt_buf_top;
+ return;
+ }
+ /* If we get the range that starts at __pgt_buf_end that means
+ * the range is reserved, and that in 'init_memory_mapping'
+ * the 'memblock_x86_reserve_range' has been called with the
+ * outdated __pgt_buf_start, __pgt_buf_end (the "new"
+ * pgt_buf_[start|end|top] refer now to a new pagetable.
+ * Note: we are called _after_ the pgt_buf_[..] have been
+ * updated.*/
+
+ addr = memblock_x86_find_in_range_size(PFN_PHYS(__pgt_buf_start), &size, PAGE_SIZE);
+
+ /* Still not reserved, meaning 'memblock_x86_reserve_range'
+ * hasn't been called yet. Update the _end and _top.*/
+ if (addr == PFN_PHYS(__pgt_buf_start)) {
+ __pgt_buf_end = pgt_buf_end;
+ __pgt_buf_top = pgt_buf_top;
+ return;
+ }
+
+ /* OK, the area is reserved, meaning it is time for us to
+ * set RW for the old end->top PFNs. */
+
+ /* ..unless we had already done this. */
+ if (__pgt_buf_end == __last_pgt_set_rw)
+ return;
+
+ addr = PFN_PHYS(__pgt_buf_end);
+
+ /* set as RW the rest */
+ printk(KERN_DEBUG "xen: setting RW the range %llx - %llx\n",
+ PFN_PHYS(__pgt_buf_end), PFN_PHYS(__pgt_buf_top));
+
+ while (addr < PFN_PHYS(__pgt_buf_top)) {
+ make_lowmem_page_readwrite(__va(addr));
+ addr += PAGE_SIZE;
+ }
+ /* And update everything so that we are ready for the next
+ * pagetable (the one created for regions past 4GB) */
+ __last_pgt_set_rw = __pgt_buf_end;
+ __pgt_buf_start = pgt_buf_start;
+ __pgt_buf_end = pgt_buf_end;
+ __pgt_buf_top = pgt_buf_top;
+ }
+ return;
+}
+#endif
static void xen_pgd_free(struct mm_struct *mm, pgd_t *pgd)
{
#ifdef CONFIG_X86_64
@@ -1488,6 +1597,7 @@ static __init pte_t mask_rw_pte(pte_t *ptep, pte_t pte)
{
unsigned long pfn = pte_pfn(pte);

+ mark_rw_past_pgt(pfn);
/*
* If the new pfn is within the range of the newly allocated
* kernel pagetable, and it isn't being mapped into an
@@ -1495,7 +1605,7 @@ static __init pte_t mask_rw_pte(pte_t *ptep, pte_t pte)
* it is RO.
*/
if (((!is_early_ioremap_ptep(ptep) &&
- pfn >= pgt_buf_start && pfn < pgt_buf_end)) ||
+ pfn >= pgt_buf_start && pfn < pgt_buf_top)) ||
(is_early_ioremap_ptep(ptep) && pfn != (pgt_buf_end - 1)))
pte = pte_wrprotect(pte);

@@ -1997,6 +2107,8 @@ __init void xen_ident_map_ISA(void)

static __init void xen_post_allocator_init(void)
{
+ mark_rw_past_pgt(1);
+
#ifdef CONFIG_XEN_DEBUG
pv_mmu_ops.make_pte = PV_CALLEE_SAVE(xen_make_pte_debug);
#endif

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Re: RFC Patch for the "x86-64, mm: Put early page table high" [ In reply to ]

jeremy at goop

Apr 29, 2011, 1:58 PM

Post #2 of 6 (186 views)

Permalink

On 04/29/2011 08:46 AM, Konrad Rzeszutek Wilk wrote:
> Without a fix for this 2.6.39-rc5 fails during bootup.
>
> It fails such early, you only Xen telling you that the domain crashed.
>
> There is one patch to fix this, and the last word was here:
> http://marc.info/?i=4DAF0ECB.8060009@goop.org
>
> But after nothing had been heard from Peter.
>
> So I decided to look at this from a different perspective: why not
> contain the workaround inside Xen early bootup code.
>
> I've tested this code as DomU (1G, 2G, 3G), Dom0 (8G, 4G, 1000M)
> and they all booted fine. Hadn't compile tested it on 32-bit and
> I think it will actually complain about it. Let me fix that.
>
> See attached patch (also present in stable/bug-fixes-for-rc5) which also
> has the "xen: mask_rw_pte mark RO all pagetable pages up to pgt_buf_top"
> integrated in.

Well, if we can hide the fix away in our code, then that has obvious
advantages. But I worry that this change is pretty closely dependent on
how the other code works, and would be fragile in the face of further
changes to that code (esp since there's no obvious coupling between the
two, so anyone changing the arch/x86 code wouldn't have any clues to
look at the corresponding Xen code).

> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index aef7af9..a54c7c2 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -1463,6 +1463,115 @@ static int xen_pgd_alloc(struct mm_struct *mm)
> return ret;
> }
>
> +#ifdef CONFIG_X86_64
> +static __initdata u64 __last_pgt_set_rw = 0;
> +static __initdata u64 __pgt_buf_start = 0;
> +static __initdata u64 __pgt_buf_end = 0;
> +static __initdata u64 __pgt_buf_top = 0;
> +/*
> + * As a consequence of the commit:
> + *
> + * commit 4b239f458c229de044d6905c2b0f9fe16ed9e01e
> + * Author: Yinghai Lu <yinghai@kernel.org>
> + * Date: Fri Dec 17 16:58:28 2010 -0800
> + *
> + * x86-64, mm: Put early page table high
> + *
> + * at some point init_memory_mapping is going to reach the pagetable pages
> + * area and map those pages too (mapping them as normal memory that falls
> + * in the range of addresses passed to init_memory_mapping as argument).
> + * Some of those pages are already pagetable pages (they are in the range
> + * pgt_buf_start-pgt_buf_end) therefore they are going to be mapped RO and
> + * everything is fine.
> + * Some of these pages are not pagetable pages yet (they fall in the range
> + * pgt_buf_end-pgt_buf_top; for example the page at pgt_buf_end) so they
> + * are going to be mapped RW. When these pages become pagetable pages and
> + * are hooked into the pagetable, xen will find that the guest has already
> + * a RW mapping of them somewhere and fail the operation.
> + * The reason Xen requires pagetables to be RO is that the hypervisor needs
> + * to verify that the pagetables are valid before using them. The validation
> + * operations are called "pinning" (more details in arch/x86/xen/mmu.c).
> + *
> + * In order to fix the issue we mark all the pages in the entire range
> + * pgt_buf_start-pgt_buf_top as RO, however when the pagetable allocation
> + * is completed only the range pgt_buf_start-pgt_buf_end is reserved by
> + * init_memory_mapping. Hence the kernel is going to crash as soon as one
> + * of the pages in the range pgt_buf_end-pgt_buf_top is reused (b/c those
> + * ranges are RO).
> + *
> + * For this reason, this function is introduced which is called _after_

I think name "mark_rw_past_pg" explicitly here, since "this" is somewhat
ambiguous.

> + * the init_memory_mapping has completed (in a perfect world we would
> + * call this function from init_memory_mapping, but lets ignore that).
> + *
> + * Because we are called _after_ init_memory_mapping the pgt_buf_[start,
> + * end,top] have all changed to new values (b/c another init_memory_mapping
> + * is called). Hence, the first time we enter this function, we save
> + * away the pgt_buf_start value and update the pgt_buf_[end,top].
> + *
> + * When we detect that the "old" pgt_buf_start through pgt_buf_end
> + * PFNs have been reserved (so memblock_x86_reserve_range has been called),
> + * we immediately set out to RW the "old" pgt_buf_end through pgt_buf_top.
> + *
> + * And then we update those "old" pgt_buf_[end|top] with the new ones
> + * so that we can redo this on the next pagetable.
> + */
> +static __init void mark_rw_past_pgt(unsigned long pfn) {
> +
> + if (pfn && pgt_buf_end > pgt_buf_start) {
> + u64 addr, size;
> +
> + /* Save it away. */
> + if (!__pgt_buf_start) {
> + __pgt_buf_start = pgt_buf_start;
> + __pgt_buf_end = pgt_buf_end;
> + __pgt_buf_top = pgt_buf_top;
> + return;
> + }
> + /* If we get the range that starts at __pgt_buf_end that means
> + * the range is reserved, and that in 'init_memory_mapping'
> + * the 'memblock_x86_reserve_range' has been called with the
> + * outdated __pgt_buf_start, __pgt_buf_end (the "new"
> + * pgt_buf_[start|end|top] refer now to a new pagetable.
> + * Note: we are called _after_ the pgt_buf_[..] have been
> + * updated.*/
> +
> + addr = memblock_x86_find_in_range_size(PFN_PHYS(__pgt_buf_start), &size, PAGE_SIZE);
> +
> + /* Still not reserved, meaning 'memblock_x86_reserve_range'
> + * hasn't been called yet. Update the _end and _top.*/
> + if (addr == PFN_PHYS(__pgt_buf_start)) {
> + __pgt_buf_end = pgt_buf_end;
> + __pgt_buf_top = pgt_buf_top;
> + return;
> + }
> +
> + /* OK, the area is reserved, meaning it is time for us to
> + * set RW for the old end->top PFNs. */
> +
> + /* ..unless we had already done this. */
> + if (__pgt_buf_end == __last_pgt_set_rw)
> + return;
> +
> + addr = PFN_PHYS(__pgt_buf_end);
> +
> + /* set as RW the rest */
> + printk(KERN_DEBUG "xen: setting RW the range %llx - %llx\n",
> + PFN_PHYS(__pgt_buf_end), PFN_PHYS(__pgt_buf_top));
> +
> + while (addr < PFN_PHYS(__pgt_buf_top)) {
> + make_lowmem_page_readwrite(__va(addr));
> + addr += PAGE_SIZE;
> + }
> + /* And update everything so that we are ready for the next
> + * pagetable (the one created for regions past 4GB) */
> + __last_pgt_set_rw = __pgt_buf_end;
> + __pgt_buf_start = pgt_buf_start;
> + __pgt_buf_end = pgt_buf_end;
> + __pgt_buf_top = pgt_buf_top;
> + }
> + return;
> +}
> +#endif
> static void xen_pgd_free(struct mm_struct *mm, pgd_t *pgd)
> {
> #ifdef CONFIG_X86_64
> @@ -1488,6 +1597,7 @@ static __init pte_t mask_rw_pte(pte_t *ptep, pte_t pte)
> {
> unsigned long pfn = pte_pfn(pte);
>
> + mark_rw_past_pgt(pfn);
> /*
> * If the new pfn is within the range of the newly allocated
> * kernel pagetable, and it isn't being mapped into an
> @@ -1495,7 +1605,7 @@ static __init pte_t mask_rw_pte(pte_t *ptep, pte_t pte)
> * it is RO.
> */
> if (((!is_early_ioremap_ptep(ptep) &&
> - pfn >= pgt_buf_start && pfn < pgt_buf_end)) ||
> + pfn >= pgt_buf_start && pfn < pgt_buf_top)) ||
> (is_early_ioremap_ptep(ptep) && pfn != (pgt_buf_end - 1)))
> pte = pte_wrprotect(pte);
>
> @@ -1997,6 +2107,8 @@ __init void xen_ident_map_ISA(void)
>
> static __init void xen_post_allocator_init(void)
> {
> + mark_rw_past_pgt(1);

Why '1'? Perhaps this should be a different function rather than
overloading a single one?

J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Re: RFC Patch for the "x86-64, mm: Put early page table high" [ In reply to ]

konrad.wilk at oracle

Apr 29, 2011, 2:33 PM

Post #3 of 6 (184 views)

Permalink

On Fri, Apr 29, 2011 at 01:58:33PM -0700, Jeremy Fitzhardinge wrote:
> On 04/29/2011 08:46 AM, Konrad Rzeszutek Wilk wrote:
> > Without a fix for this 2.6.39-rc5 fails during bootup.
> >
> > It fails such early, you only Xen telling you that the domain crashed.
> >
> > There is one patch to fix this, and the last word was here:
> > http://marc.info/?i=4DAF0ECB.8060009@goop.org
> >
> > But after nothing had been heard from Peter.
> >
> > So I decided to look at this from a different perspective: why not
> > contain the workaround inside Xen early bootup code.
> >
> > I've tested this code as DomU (1G, 2G, 3G), Dom0 (8G, 4G, 1000M)
> > and they all booted fine. Hadn't compile tested it on 32-bit and
> > I think it will actually complain about it. Let me fix that.
> >
> > See attached patch (also present in stable/bug-fixes-for-rc5) which also
> > has the "xen: mask_rw_pte mark RO all pagetable pages up to pgt_buf_top"
> > integrated in.
>
> Well, if we can hide the fix away in our code, then that has obvious
> advantages. But I worry that this change is pretty closely dependent on
> how the other code works, and would be fragile in the face of further
> changes to that code (esp since there's no obvious coupling between the
> two, so anyone changing the arch/x86 code wouldn't have any clues to
> look at the corresponding Xen code).

True. I am hoping to remove this in 2.6.40 though...
>
> > diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> > index aef7af9..a54c7c2 100644
> > --- a/arch/x86/xen/mmu.c
> > +++ b/arch/x86/xen/mmu.c
> > @@ -1463,6 +1463,115 @@ static int xen_pgd_alloc(struct mm_struct *mm)
> > return ret;
> > }
> >
> > +#ifdef CONFIG_X86_64
> > +static __initdata u64 __last_pgt_set_rw = 0;
> > +static __initdata u64 __pgt_buf_start = 0;
> > +static __initdata u64 __pgt_buf_end = 0;
> > +static __initdata u64 __pgt_buf_top = 0;
> > +/*
> > + * As a consequence of the commit:
> > + *
> > + * commit 4b239f458c229de044d6905c2b0f9fe16ed9e01e
> > + * Author: Yinghai Lu <yinghai@kernel.org>
> > + * Date: Fri Dec 17 16:58:28 2010 -0800
> > + *
> > + * x86-64, mm: Put early page table high
> > + *
> > + * at some point init_memory_mapping is going to reach the pagetable pages
> > + * area and map those pages too (mapping them as normal memory that falls
> > + * in the range of addresses passed to init_memory_mapping as argument).
> > + * Some of those pages are already pagetable pages (they are in the range
> > + * pgt_buf_start-pgt_buf_end) therefore they are going to be mapped RO and
> > + * everything is fine.
> > + * Some of these pages are not pagetable pages yet (they fall in the range
> > + * pgt_buf_end-pgt_buf_top; for example the page at pgt_buf_end) so they
> > + * are going to be mapped RW. When these pages become pagetable pages and
> > + * are hooked into the pagetable, xen will find that the guest has already
> > + * a RW mapping of them somewhere and fail the operation.
> > + * The reason Xen requires pagetables to be RO is that the hypervisor needs
> > + * to verify that the pagetables are valid before using them. The validation
> > + * operations are called "pinning" (more details in arch/x86/xen/mmu.c).
> > + *
> > + * In order to fix the issue we mark all the pages in the entire range
> > + * pgt_buf_start-pgt_buf_top as RO, however when the pagetable allocation
> > + * is completed only the range pgt_buf_start-pgt_buf_end is reserved by
> > + * init_memory_mapping. Hence the kernel is going to crash as soon as one
> > + * of the pages in the range pgt_buf_end-pgt_buf_top is reused (b/c those
> > + * ranges are RO).
> > + *
> > + * For this reason, this function is introduced which is called _after_
>
> I think name "mark_rw_past_pg" explicitly here, since "this" is somewhat
> ambiguous.

<nods>
>
> > + * the init_memory_mapping has completed (in a perfect world we would
> > + * call this function from init_memory_mapping, but lets ignore that).
> > + *
> > + * Because we are called _after_ init_memory_mapping the pgt_buf_[start,
> > + * end,top] have all changed to new values (b/c another init_memory_mapping
> > + * is called). Hence, the first time we enter this function, we save
> > + * away the pgt_buf_start value and update the pgt_buf_[end,top].
> > + *
> > + * When we detect that the "old" pgt_buf_start through pgt_buf_end
> > + * PFNs have been reserved (so memblock_x86_reserve_range has been called),
> > + * we immediately set out to RW the "old" pgt_buf_end through pgt_buf_top.
> > + *
> > + * And then we update those "old" pgt_buf_[end|top] with the new ones
> > + * so that we can redo this on the next pagetable.
> > + */
> > +static __init void mark_rw_past_pgt(unsigned long pfn) {
> > +
> > + if (pfn && pgt_buf_end > pgt_buf_start) {
> > + u64 addr, size;
> > +
> > + /* Save it away. */
> > + if (!__pgt_buf_start) {
> > + __pgt_buf_start = pgt_buf_start;
> > + __pgt_buf_end = pgt_buf_end;
> > + __pgt_buf_top = pgt_buf_top;
> > + return;
> > + }
> > + /* If we get the range that starts at __pgt_buf_end that means
> > + * the range is reserved, and that in 'init_memory_mapping'
> > + * the 'memblock_x86_reserve_range' has been called with the
> > + * outdated __pgt_buf_start, __pgt_buf_end (the "new"
> > + * pgt_buf_[start|end|top] refer now to a new pagetable.
> > + * Note: we are called _after_ the pgt_buf_[..] have been
> > + * updated.*/
> > +
> > + addr = memblock_x86_find_in_range_size(PFN_PHYS(__pgt_buf_start), &size, PAGE_SIZE);
> > +
> > + /* Still not reserved, meaning 'memblock_x86_reserve_range'
> > + * hasn't been called yet. Update the _end and _top.*/
> > + if (addr == PFN_PHYS(__pgt_buf_start)) {
> > + __pgt_buf_end = pgt_buf_end;
> > + __pgt_buf_top = pgt_buf_top;
> > + return;
> > + }
> > +
> > + /* OK, the area is reserved, meaning it is time for us to
> > + * set RW for the old end->top PFNs. */
> > +
> > + /* ..unless we had already done this. */
> > + if (__pgt_buf_end == __last_pgt_set_rw)
> > + return;
> > +
> > + addr = PFN_PHYS(__pgt_buf_end);
> > +
> > + /* set as RW the rest */
> > + printk(KERN_DEBUG "xen: setting RW the range %llx - %llx\n",
> > + PFN_PHYS(__pgt_buf_end), PFN_PHYS(__pgt_buf_top));
> > +
> > + while (addr < PFN_PHYS(__pgt_buf_top)) {
> > + make_lowmem_page_readwrite(__va(addr));
> > + addr += PAGE_SIZE;
> > + }
> > + /* And update everything so that we are ready for the next
> > + * pagetable (the one created for regions past 4GB) */
> > + __last_pgt_set_rw = __pgt_buf_end;
> > + __pgt_buf_start = pgt_buf_start;
> > + __pgt_buf_end = pgt_buf_end;
> > + __pgt_buf_top = pgt_buf_top;
> > + }
> > + return;
> > +}
> > +#endif
> > static void xen_pgd_free(struct mm_struct *mm, pgd_t *pgd)
> > {
> > #ifdef CONFIG_X86_64
> > @@ -1488,6 +1597,7 @@ static __init pte_t mask_rw_pte(pte_t *ptep, pte_t pte)
> > {
> > unsigned long pfn = pte_pfn(pte);
> >
> > + mark_rw_past_pgt(pfn);
> > /*
> > * If the new pfn is within the range of the newly allocated
> > * kernel pagetable, and it isn't being mapped into an
> > @@ -1495,7 +1605,7 @@ static __init pte_t mask_rw_pte(pte_t *ptep, pte_t pte)
> > * it is RO.
> > */
> > if (((!is_early_ioremap_ptep(ptep) &&
> > - pfn >= pgt_buf_start && pfn < pgt_buf_end)) ||
> > + pfn >= pgt_buf_start && pfn < pgt_buf_top)) ||
> > (is_early_ioremap_ptep(ptep) && pfn != (pgt_buf_end - 1)))
> > pte = pte_wrprotect(pte);
> >
> > @@ -1997,6 +2107,8 @@ __init void xen_ident_map_ISA(void)
> >
> > static __init void xen_post_allocator_init(void)
> > {
> > + mark_rw_past_pgt(1);
>
> Why '1'? Perhaps this should be a different function rather than
> overloading a single one?

The 'mask_rw_pte' gets called quite often during startup. And a lot of times
for the pte(0), so that was an optimization to not do the memblock search.

But that can as well be expressed in 'mask_rw_pte' by

if (pfn)
mark_rw_past_pgt();

Let me do that.

>
> J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Re: Re: RFC Patch for the "x86-64, mm: Put early page table high" [ In reply to ]

konrad.wilk at oracle

Apr 29, 2011, 3:38 PM

Post #4 of 6 (184 views)

Permalink

On Fri, Apr 29, 2011 at 05:33:37PM -0400, Konrad Rzeszutek Wilk wrote:
> On Fri, Apr 29, 2011 at 01:58:33PM -0700, Jeremy Fitzhardinge wrote:
> > On 04/29/2011 08:46 AM, Konrad Rzeszutek Wilk wrote:
> > > Without a fix for this 2.6.39-rc5 fails during bootup.
> > >
> > > It fails such early, you only Xen telling you that the domain crashed.
> > >
> > > There is one patch to fix this, and the last word was here:
> > > http://marc.info/?i=4DAF0ECB.8060009@goop.org
> > >
> > > But after nothing had been heard from Peter.
> > >
> > > So I decided to look at this from a different perspective: why not
> > > contain the workaround inside Xen early bootup code.
> > >
> > > I've tested this code as DomU (1G, 2G, 3G), Dom0 (8G, 4G, 1000M)
> > > and they all booted fine. Hadn't compile tested it on 32-bit and
> > > I think it will actually complain about it. Let me fix that.

OK, fixed that, and this is the patch (also present in stable/bug-fixes-for-rc5).

Would appreciate if somebody else tried it as well :-)

commit becd252268baed54d5f5d1736ed64ce94c1a8aee
Author: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Fri Apr 29 11:34:00 2011 -0400

xen/mmu: Add workaround "x86-64, mm: Put early page table high"

As a consequence of the commit:

commit 4b239f458c229de044d6905c2b0f9fe16ed9e01e
Author: Yinghai Lu <yinghai@kernel.org>
Date: Fri Dec 17 16:58:28 2010 -0800

x86-64, mm: Put early page table high

it causes the Linux kernel to crash under Xen:

mapping kernel into physical memory
Xen: setup ISA identity maps
about to get started...
(XEN) mm.c:2466:d0 Bad type (saw 7400000000000001 != exp 1000000000000000) for mfn b1d89 (pfn bacf7)
(XEN) mm.c:3027:d0 Error while pinning mfn b1d89
(XEN) traps.c:481:d0 Unhandled invalid opcode fault/trap [#6] on VCPU 0 [ec=0000]
(XEN) domain_crash_sync called from entry.S
(XEN) Domain 0 (vcpu#0) crashed on cpu#0:
...

The reason is that at some point init_memory_mapping is going to reach
the pagetable pages area and map those pages too (mapping them as normal
memory that falls in the range of addresses passed to init_memory_mapping
as argument). Some of those pages are already pagetable pages (they are
in the range pgt_buf_start-pgt_buf_end) therefore they are going to be
mapped RO and everything is fine.
Some of these pages are not pagetable pages yet (they fall in the range
pgt_buf_end-pgt_buf_top; for example the page at pgt_buf_end) so they
are going to be mapped RW. When these pages become pagetable pages and
are hooked into the pagetable, xen will find that the guest has already
a RW mapping of them somewhere and fail the operation.
The reason Xen requires pagetables to be RO is that the hypervisor needs
to verify that the pagetables are valid before using them. The validation
operations are called "pinning" (more details in arch/x86/xen/mmu.c).

In order to fix the issue we mark all the pages in the entire range
pgt_buf_start-pgt_buf_top as RO, however when the pagetable allocation
is completed only the range pgt_buf_start-pgt_buf_end is reserved by
init_memory_mapping. Hence the kernel is going to crash as soon as one
of the pages in the range pgt_buf_end-pgt_buf_top is reused (b/c those
ranges are RO).

For this reason, this function is introduced which is called _after_
the init_memory_mapping has completed (in a perfect world we would
call this function from init_memory_mapping, but lets ignore that).

Because we are called _after_ init_memory_mapping the pgt_buf_[start,
end,top] have all changed to new values (b/c another init_memory_mapping
is called). Hence, the first time we enter this function, we save
away the pgt_buf_start value and update the pgt_buf_[end,top].

When we detect that the "old" pgt_buf_start through pgt_buf_end
PFNs have been reserved (so memblock_x86_reserve_range has been called),
we immediately set out to RW the "old" pgt_buf_end through pgt_buf_top.

And then we update those "old" pgt_buf_[end|top] with the new ones
so that we can redo this on the next pagetable.

Reviewed-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
[v1: Updated with Jeremy's comments]
[v2: Added the crash output]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index aef7af9..1bca25f 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1463,6 +1463,119 @@ static int xen_pgd_alloc(struct mm_struct *mm)
return ret;
}

+#ifdef CONFIG_X86_64
+static __initdata u64 __last_pgt_set_rw = 0;
+static __initdata u64 __pgt_buf_start = 0;
+static __initdata u64 __pgt_buf_end = 0;
+static __initdata u64 __pgt_buf_top = 0;
+/*
+ * As a consequence of the commit:
+ *
+ * commit 4b239f458c229de044d6905c2b0f9fe16ed9e01e
+ * Author: Yinghai Lu <yinghai@kernel.org>
+ * Date: Fri Dec 17 16:58:28 2010 -0800
+ *
+ * x86-64, mm: Put early page table high
+ *
+ * at some point init_memory_mapping is going to reach the pagetable pages
+ * area and map those pages too (mapping them as normal memory that falls
+ * in the range of addresses passed to init_memory_mapping as argument).
+ * Some of those pages are already pagetable pages (they are in the range
+ * pgt_buf_start-pgt_buf_end) therefore they are going to be mapped RO and
+ * everything is fine.
+ * Some of these pages are not pagetable pages yet (they fall in the range
+ * pgt_buf_end-pgt_buf_top; for example the page at pgt_buf_end) so they
+ * are going to be mapped RW. When these pages become pagetable pages and
+ * are hooked into the pagetable, xen will find that the guest has already
+ * a RW mapping of them somewhere and fail the operation.
+ * The reason Xen requires pagetables to be RO is that the hypervisor needs
+ * to verify that the pagetables are valid before using them. The validation
+ * operations are called "pinning".
+ *
+ * In order to fix the issue we mark all the pages in the entire range
+ * pgt_buf_start-pgt_buf_top as RO, however when the pagetable allocation
+ * is completed only the range pgt_buf_start-pgt_buf_end is reserved by
+ * init_memory_mapping. Hence the kernel is going to crash as soon as one
+ * of the pages in the range pgt_buf_end-pgt_buf_top is reused (b/c those
+ * ranges are RO).
+ *
+ * For this reason, 'mark_rw_past_pgt' is introduced which is called _after_
+ * the init_memory_mapping has completed (in a perfect world we would
+ * call this function from init_memory_mapping, but lets ignore that).
+ *
+ * Because we are called _after_ init_memory_mapping the pgt_buf_[start,
+ * end,top] have all changed to new values (b/c init_memory_mapping
+ * is called and setting up another new page-table). Hence, the first time
+ * we enter this function, we save away the pgt_buf_start value and update
+ * the pgt_buf_[end,top].
+ *
+ * When we detect that the "old" pgt_buf_start through pgt_buf_end
+ * PFNs have been reserved (so memblock_x86_reserve_range has been called),
+ * we immediately set out to RW the "old" pgt_buf_end through pgt_buf_top.
+ *
+ * And then we update those "old" pgt_buf_[end|top] with the new ones
+ * so that we can redo this on the next pagetable.
+ */
+static __init void mark_rw_past_pgt(void) {
+
+ if (pgt_buf_end > pgt_buf_start) {
+ u64 addr, size;
+
+ /* Save it away. */
+ if (!__pgt_buf_start) {
+ __pgt_buf_start = pgt_buf_start;
+ __pgt_buf_end = pgt_buf_end;
+ __pgt_buf_top = pgt_buf_top;
+ return;
+ }
+ /* If we get the range that starts at __pgt_buf_end that means
+ * the range is reserved, and that in 'init_memory_mapping'
+ * the 'memblock_x86_reserve_range' has been called with the
+ * outdated __pgt_buf_start, __pgt_buf_end (the "new"
+ * pgt_buf_[start|end|top] refer now to a new pagetable.
+ * Note: we are called _after_ the pgt_buf_[..] have been
+ * updated.*/
+
+ addr = memblock_x86_find_in_range_size(PFN_PHYS(__pgt_buf_start),
+ &size, PAGE_SIZE);
+
+ /* Still not reserved, meaning 'memblock_x86_reserve_range'
+ * hasn't been called yet. Update the _end and _top.*/
+ if (addr == PFN_PHYS(__pgt_buf_start)) {
+ __pgt_buf_end = pgt_buf_end;
+ __pgt_buf_top = pgt_buf_top;
+ return;
+ }
+
+ /* OK, the area is reserved, meaning it is time for us to
+ * set RW for the old end->top PFNs. */
+
+ /* ..unless we had already done this. */
+ if (__pgt_buf_end == __last_pgt_set_rw)
+ return;
+
+ addr = PFN_PHYS(__pgt_buf_end);
+
+ /* set as RW the rest */
+ printk(KERN_DEBUG "xen: setting RW the range %llx - %llx\n",
+ PFN_PHYS(__pgt_buf_end), PFN_PHYS(__pgt_buf_top));
+
+ while (addr < PFN_PHYS(__pgt_buf_top)) {
+ make_lowmem_page_readwrite(__va(addr));
+ addr += PAGE_SIZE;
+ }
+ /* And update everything so that we are ready for the next
+ * pagetable (the one created for regions past 4GB) */
+ __last_pgt_set_rw = __pgt_buf_end;
+ __pgt_buf_start = pgt_buf_start;
+ __pgt_buf_end = pgt_buf_end;
+ __pgt_buf_top = pgt_buf_top;
+ }
+ return;
+}
+#else
+static __init void mark_rw_past_pgt(void) { }
+#endif
static void xen_pgd_free(struct mm_struct *mm, pgd_t *pgd)
{
#ifdef CONFIG_X86_64
@@ -1489,6 +1602,14 @@ static __init pte_t mask_rw_pte(pte_t *ptep, pte_t pte)
unsigned long pfn = pte_pfn(pte);

/*
+ * A bit of optimization. We do not need to call the workaround
+ * when xen_set_pte_init is called with a PTE with 0 as PFN.
+ * That is b/c the pagetable at that point are just being populated
+ * with empty values and we can save some cycles by not calling
+ * the 'memblock' code.*/
+ if (pfn)
+ mark_rw_past_pgt();
+ /*
* If the new pfn is within the range of the newly allocated
* kernel pagetable, and it isn't being mapped into an
* early_ioremap fixmap slot as a freshly allocated page, make sure
@@ -1997,6 +2118,8 @@ __init void xen_ident_map_ISA(void)

static __init void xen_post_allocator_init(void)
{
+ mark_rw_past_pgt();
+
#ifdef CONFIG_XEN_DEBUG
pv_mmu_ops.make_pte = PV_CALLEE_SAVE(xen_make_pte_debug);
#endif

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Re: RFC Patch for the "x86-64, mm: Put early page table high" [ In reply to ]

jeremy at goop

Apr 29, 2011, 4:46 PM

Post #5 of 6 (185 views)

Permalink

On 04/29/2011 02:33 PM, Konrad Rzeszutek Wilk wrote:
> On Fri, Apr 29, 2011 at 01:58:33PM -0700, Jeremy Fitzhardinge wrote:
>> On 04/29/2011 08:46 AM, Konrad Rzeszutek Wilk wrote:
>>> Without a fix for this 2.6.39-rc5 fails during bootup.
>>>
>>> It fails such early, you only Xen telling you that the domain crashed.
>>>
>>> There is one patch to fix this, and the last word was here:
>>> http://marc.info/?i=4DAF0ECB.8060009@goop.org
>>>
>>> But after nothing had been heard from Peter.
>>>
>>> So I decided to look at this from a different perspective: why not
>>> contain the workaround inside Xen early bootup code.
>>>
>>> I've tested this code as DomU (1G, 2G, 3G), Dom0 (8G, 4G, 1000M)
>>> and they all booted fine. Hadn't compile tested it on 32-bit and
>>> I think it will actually complain about it. Let me fix that.
>>>
>>> See attached patch (also present in stable/bug-fixes-for-rc5) which also
>>> has the "xen: mask_rw_pte mark RO all pagetable pages up to pgt_buf_top"
>>> integrated in.
>> Well, if we can hide the fix away in our code, then that has obvious
>> advantages. But I worry that this change is pretty closely dependent on
>> how the other code works, and would be fragile in the face of further
>> changes to that code (esp since there's no obvious coupling between the
>> two, so anyone changing the arch/x86 code wouldn't have any clues to
>> look at the corresponding Xen code).
> True. I am hoping to remove this in 2.6.40 though...

That makes it more palatable. What's the solution for 40+?

J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Re: RFC Patch for the "x86-64, mm: Put early page table high" [ In reply to ]

konrad.wilk at oracle

May 2, 2011, 7:03 AM

Post #6 of 6 (183 views)

Permalink

> >>> So I decided to look at this from a different perspective: why not
> >>> contain the workaround inside Xen early bootup code.
> >>>
> >>> I've tested this code as DomU (1G, 2G, 3G), Dom0 (8G, 4G, 1000M)
> >>> and they all booted fine. Hadn't compile tested it on 32-bit and
> >>> I think it will actually complain about it. Let me fix that.
> >>>
> >>> See attached patch (also present in stable/bug-fixes-for-rc5) which also
> >>> has the "xen: mask_rw_pte mark RO all pagetable pages up to pgt_buf_top"
> >>> integrated in.
> >> Well, if we can hide the fix away in our code, then that has obvious
> >> advantages. But I worry that this change is pretty closely dependent on
> >> how the other code works, and would be fragile in the face of further
> >> changes to that code (esp since there's no obvious coupling between the
> >> two, so anyone changing the arch/x86 code wouldn't have any clues to
> >> look at the corresponding Xen code).
> > True. I am hoping to remove this in 2.6.40 though...
>
> That makes it more palatable. What's the solution for 40+?

None yet. I am thinking to schedule some form of meeting between hpa, Yinghai,
Stefano, you, and whoever else to hammer something out.

So far we got two workarounds in the Xen MMU code - there is the 32-bit dance with
the swapper_pg being copied from initial_..something and then back (Ian came up
with a fix for that), and then this one.

Did you have a chance to test this patch on your box?

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel