Mailing List Archive

[PATCH V5] x86/kexec: Change NMI and MCE handling on kexec path
xen/arch/x86/crash.c | 116 ++++++++++++++++++++++++++++++++++-----
xen/arch/x86/machine_kexec.c | 19 ++++++
xen/arch/x86/x86_64/entry.S | 34 +++++++++++
xen/include/asm-x86/desc.h | 45 +++++++++++++++
xen/include/asm-x86/processor.h | 4 +
5 files changed, 203 insertions(+), 15 deletions(-)


Experimentally, certain crash kernels will triple fault very early after
starting if started with NMIs disabled. This was discovered when
experimenting with a debug keyhandler which deliberately created a
reentrant NMI, causing stack corruption.

Because of this discovered bug, and that the future changes to the NMI
handling will make the kexec path more fragile, take the time now to
bullet-proof the kexec behaviour to be safer in more circumstances.

This patch adds three new low level routines:
* nmi_crash
This is a special NMI handler for using during a kexec crash.
* enable_nmis
This function enables NMIs by executing an iret-to-self, to
disengage the hardware NMI latch.
* trap_nop
This is a no op handler which irets immediately. It is actually in
the middle of enable_nmis to reuse the iret instruction, without
having a single lone aligned iret inflating the code side.

And adds three new IDT entry helper routines:
* _write_gate_lower
This is a substitute for using cmpxchg16b to update a 128bit
structure at once. It assumes that the top 64 bits are unchanged
(and ASSERT()s the fact) and performs a regular write on the lower
64 bits.
* _set_gate_lower
This is functionally equivalent to the already present _set_gate(),
except it uses _write_gate_lower rather than updating both 64bit
values.
* _update_gate_addr_lower
This is designed to update an IDT entry handler only, without
altering any other settings in the entry. It also uses
_write_gate_lower.

The IDT entry helpers are required because:
* Is it unsafe to attempt a disable/update/re-enable cycle on the NMI
or MCE IDT entries.
* We need to be able to update NMI handlers without changing the IST
entry.


As a result, the new behaviour of the kexec_crash path is:

nmi_shootdown_cpus() will:

* Disable the crashing cpus NMI/MCE interrupt stack tables.
Disabling the stack tables removes race conditions which would lead
to corrupt exception frames and infinite loops. As this pcpu is
never planning to execute a sysret back to a pv vcpu, the update is
safe from a security point of view.

* Swap the NMI trap handlers.
The crashing pcpu gets the nop handler, to prevent it getting stuck in
an NMI context, causing a hang instead of crash. The non-crashing
pcpus all get the nmi_crash handler which is designed never to
return.

do_nmi_crash() will:

* Save the crash notes and shut the pcpu down.
There is now an extra per-cpu variable to prevent us from executing
this multiple times. In the case where we reenter midway through,
attempt the whole operation again in preference to not completing
it in the first place.

* Set up another NMI at the LAPIC.
Even when the LAPIC has been disabled, the ID and command registers
are still usable. As a result, we can deliberately queue up a new
NMI to re-interrupt us later if NMIs get unlatched. Because of the
call to __stop_this_cpu(), we have to hand craft self_nmi() to be
safe from General Protection Faults.

* Fall into infinite loop.

machine_kexec() will:

* Swap the MCE handlers to be a nop.
We cannot prevent MCEs from being delivered when we pass off to the
crash kernel, and the less Xen context is being touched the better.

* Explicitly enable NMIs.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

--

Changes since v4:
* Change stale comments, spelling corrections and coding style changes.

Changes since v3:
* Added IDT entry helpers to safely update NMI/MCE entries.

Changes since v2:

* Rework the alteration of the stack tables to completely remove the
possibility of a PV domain getting very lucky with the "NMI or MCE in
a 1 instruction race window on sysret" and managing to execute code
in the hypervisor context.
* Make use of set_ist() from the previous patch in the series to avoid
open-coding the IST manipulation.

Changes since v1:
* Reintroduce atomic_dec(&waiting_for_crash_ipi); which got missed
during the original refactoring.
* Fold trap_nop into the middle of enable_nmis to reuse the iret.
* Expand comments in areas as per Tim's suggestions.

diff -r ef8c1b607b10 -r 96b068439bc4 xen/arch/x86/crash.c
--- a/xen/arch/x86/crash.c
+++ b/xen/arch/x86/crash.c
@@ -32,41 +32,127 @@

static atomic_t waiting_for_crash_ipi;
static unsigned int crashing_cpu;
+static DEFINE_PER_CPU_READ_MOSTLY(bool_t, crash_save_done);

-static int crash_nmi_callback(struct cpu_user_regs *regs, int cpu)
+/* This becomes the NMI handler for non-crashing CPUs, when Xen is crashing. */
+void __attribute__((noreturn)) do_nmi_crash(struct cpu_user_regs *regs)
{
- /* Don't do anything if this handler is invoked on crashing cpu.
- * Otherwise, system will completely hang. Crashing cpu can get
- * an NMI if system was initially booted with nmi_watchdog parameter.
+ int cpu = smp_processor_id();
+
+ /* nmi_shootdown_cpus() should ensure that this assertion is correct. */
+ ASSERT( cpu != crashing_cpu );
+
+ /* Save crash information and shut down CPU. Attempt only once. */
+ if ( ! this_cpu(crash_save_done) )
+ {
+ /* Disable the interrupt stack table for the MCE handler. This
+ * prevents race conditions between clearing MCIP and receving a
+ * new MCE, during which the exception frame would be clobbered
+ * and the MCE handler fall into an infinite loop. We are soon
+ * going to disable the NMI watchdog, so the loop would not be
+ * caught.
+ *
+ * We do not need to change the NMI IST, as the nmi_crash
+ * handler is immue to corrupt exception frames, by virtue of
+ * being designed never to return.
+ *
+ * This update is safe from a security point of view, as this
+ * pcpu is never going to try to sysret back to a PV vcpu.
+ */
+ set_ist(&idt_tables[cpu][TRAP_machine_check], IST_NONE);
+
+ kexec_crash_save_cpu();
+ __stop_this_cpu();
+
+ this_cpu(crash_save_done) = 1;
+ atomic_dec(&waiting_for_crash_ipi);
+ }
+
+ /* Poor mans self_nmi(). __stop_this_cpu() has reverted the LAPIC
+ * back to its boot state, so we are unable to rely on the regular
+ * apic_* functions, due to 'x2apic_enabled' being possibly wrong.
+ * (The likely scenario is that we have reverted from x2apic mode to
+ * xapic, at which point #GPFs will occur if we use the apic_*
+ * functions)
+ *
+ * The ICR and APIC ID of the LAPIC are still valid even during
+ * software disable (Intel SDM Vol 3, 10.4.7.2). As a result, we
+ * can deliberately queue up another NMI at the LAPIC which will not
+ * be delivered as the hardware NMI latch is currently in effect.
+ * This means that if NMIs become unlatched (e.g. following a
+ * non-fatal MCE), the LAPIC will force us back here rather than
+ * wandering back into regular Xen code.
*/
- if ( cpu == crashing_cpu )
- return 1;
- local_irq_disable();
+ switch ( current_local_apic_mode() )
+ {
+ u32 apic_id;

- kexec_crash_save_cpu();
+ case APIC_MODE_X2APIC:
+ apic_id = apic_rdmsr(APIC_ID);

- __stop_this_cpu();
+ apic_wrmsr(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL | ((u64)apic_id << 32));
+ break;

- atomic_dec(&waiting_for_crash_ipi);
+ case APIC_MODE_XAPIC:
+ apic_id = GET_xAPIC_ID(apic_mem_read(APIC_ID));
+
+ while ( apic_mem_read(APIC_ICR) & APIC_ICR_BUSY )
+ cpu_relax();
+
+ apic_mem_write(APIC_ICR2, apic_id << 24);
+ apic_mem_write(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL);
+ break;
+
+ default:
+ break;
+ }

for ( ; ; )
halt();
-
- return 1;
}

static void nmi_shootdown_cpus(void)
{
unsigned long msecs;
+ int i, cpu = smp_processor_id();

local_irq_disable();

- crashing_cpu = smp_processor_id();
+ crashing_cpu = cpu;
local_irq_count(crashing_cpu) = 0;

atomic_set(&waiting_for_crash_ipi, num_online_cpus() - 1);
- /* Would it be better to replace the trap vector here? */
- set_nmi_callback(crash_nmi_callback);
+
+ /* Change NMI trap handlers. Non-crashing pcpus get nmi_crash which
+ * invokes do_nmi_crash (above), which cause them to write state and
+ * fall into a loop. The crashing pcpu gets the nop handler to
+ * cause it to return to this function ASAP.
+ */
+ for ( i = 0; i < nr_cpu_ids; ++i )
+ if ( idt_tables[i] )
+ {
+
+ if ( i == cpu )
+ {
+ /* Disable the interrupt stack tables for this cpus MCE
+ * and NMI handlers, and alter the NMI handler to have
+ * no operation. Disabling the stack tables prevents
+ * stack corruption race conditions, while changing the
+ * handler helps prevent cascading faults; we are
+ * certainly going to crash by this point.
+ *
+ * This update is safe from a security point of view, as
+ * this pcpu is never going to try to sysret back to a
+ * PV vcpu.
+ */
+ _set_gate_lower(&idt_tables[i][TRAP_nmi], 14, 0, &trap_nop);
+ set_ist(&idt_tables[i][TRAP_machine_check], IST_NONE);
+ }
+ else
+ /* Do not update stack table for other pcpus. */
+ _update_gate_addr_lower(&idt_tables[i][TRAP_nmi], &nmi_crash);
+ }
+
/* Ensure the new callback function is set before sending out the NMI. */
wmb();

diff -r ef8c1b607b10 -r 96b068439bc4 xen/arch/x86/machine_kexec.c
--- a/xen/arch/x86/machine_kexec.c
+++ b/xen/arch/x86/machine_kexec.c
@@ -81,12 +81,31 @@ void machine_kexec(xen_kexec_image_t *im
.base = (unsigned long)(boot_cpu_gdt_table - FIRST_RESERVED_GDT_ENTRY),
.limit = LAST_RESERVED_GDT_BYTE
};
+ int i;

/* We are about to permenantly jump out of the Xen context into the kexec
* purgatory code. We really dont want to be still servicing interupts.
*/
local_irq_disable();

+ /* Now regular interrupts are disabled, we need to reduce the impact
+ * of interrupts not disabled by 'cli'.
+ *
+ * The NMI handlers have already been set up nmi_shootdown_cpus(). All
+ * pcpus other than us have the nmi_crash handler, while we have the nop
+ * handler.
+ *
+ * The MCE handlers touch extensive areas of Xen code and data. At this
+ * point, there is nothing we can usefully do, so set the nop handler.
+ */
+ for ( i = 0; i < nr_cpu_ids; ++i )
+ if ( idt_tables[i] )
+ _update_gate_addr_lower(&idt_tables[i][TRAP_machine_check], &trap_nop);
+
+ /* Explicitly enable NMIs on this CPU. Some crashdump kernels do
+ * not like running with NMIs disabled. */
+ enable_nmis();
+
/*
* compat_machine_kexec() returns to idle pagetables, which requires us
* to be running on a static GDT mapping (idle pagetables have no GDT
diff -r ef8c1b607b10 -r 96b068439bc4 xen/arch/x86/x86_64/entry.S
--- a/xen/arch/x86/x86_64/entry.S
+++ b/xen/arch/x86/x86_64/entry.S
@@ -635,11 +635,45 @@ ENTRY(nmi)
movl $TRAP_nmi,4(%rsp)
jmp handle_ist_exception

+ENTRY(nmi_crash)
+ cli
+ pushq $0
+ movl $TRAP_nmi,4(%rsp)
+ SAVE_ALL
+ movq %rsp,%rdi
+ callq do_nmi_crash /* Does not return */
+ ud2
+
ENTRY(machine_check)
pushq $0
movl $TRAP_machine_check,4(%rsp)
jmp handle_ist_exception

+/* Enable NMIs. No special register assumptions. Only %rax is not preserved. */
+ENTRY(enable_nmis)
+ movq %rsp, %rax /* Grab RSP before pushing */
+
+ /* Set up stack frame */
+ pushq $0 /* SS */
+ pushq %rax /* RSP */
+ pushfq /* RFLAGS */
+ pushq $__HYPERVISOR_CS /* CS */
+ leaq 1f(%rip),%rax
+ pushq %rax /* RIP */
+
+ iretq /* Disable the hardware NMI latch */
+1:
+ retq
+
+/* No op trap handler. Required for kexec crash path. This is not
+ * declared with the ENTRY() macro to avoid wasted alignment space.
+ */
+.globl trap_nop
+trap_nop:
+ iretq
+
+
+
.section .rodata, "a", @progbits

ENTRY(exception_table)
diff -r ef8c1b607b10 -r 96b068439bc4 xen/include/asm-x86/desc.h
--- a/xen/include/asm-x86/desc.h
+++ b/xen/include/asm-x86/desc.h
@@ -106,6 +106,21 @@ typedef struct {
u64 a, b;
} idt_entry_t;

+/* Write the lower 64 bits of an IDT Entry. This relies on the upper 32
+ * bits of the address not changing, which is a safe assumption as all
+ * functions we are likely to load will live inside the 1GB
+ * code/data/bss address range.
+ *
+ * Ideally, we would use cmpxchg16b, but this is not supported on some
+ * old AMD 64bit capable processors, and has no safe equivalent.
+ */
+static inline void _write_gate_lower(volatile idt_entry_t *gate,
+ const idt_entry_t *new)
+{
+ ASSERT(gate->b == new->b);
+ gate->a = new->a;
+}
+
#define _set_gate(gate_addr,type,dpl,addr) \
do { \
(gate_addr)->a = 0; \
@@ -122,6 +137,36 @@ do {
(1UL << 47); \
} while (0)

+static inline void _set_gate_lower(idt_entry_t *gate, unsigned long type,
+ unsigned long dpl, void *addr)
+{
+ idt_entry_t idte;
+ idte.b = gate->b;
+ idte.a =
+ (((unsigned long)(addr) & 0xFFFF0000UL) << 32) |
+ ((unsigned long)(dpl) << 45) |
+ ((unsigned long)(type) << 40) |
+ ((unsigned long)(addr) & 0xFFFFUL) |
+ ((unsigned long)__HYPERVISOR_CS64 << 16) |
+ (1UL << 47);
+ _write_gate_lower(gate, &idte);
+}
+
+/* Update the lower half handler of an IDT Entry, without changing any
+ * other configuration. */
+static inline void _update_gate_addr_lower(idt_entry_t *gate, void *addr)
+{
+ idt_entry_t idte;
+ idte.a = gate->a;
+
+ idte.b = ((unsigned long)(addr) >> 32);
+ idte.a &= 0x0000FFFFFFFF0000ULL;
+ idte.a |= (((unsigned long)(addr) & 0xFFFF0000UL) << 32) |
+ ((unsigned long)(addr) & 0xFFFFUL);
+
+ _write_gate_lower(gate, &idte);
+}
+
#define _set_tssldt_desc(desc,addr,limit,type) \
do { \
(desc)[0].b = (desc)[1].b = 0; \
diff -r ef8c1b607b10 -r 96b068439bc4 xen/include/asm-x86/processor.h
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -527,6 +527,7 @@ void do_ ## _name(struct cpu_user_regs *
DECLARE_TRAP_HANDLER(divide_error);
DECLARE_TRAP_HANDLER(debug);
DECLARE_TRAP_HANDLER(nmi);
+DECLARE_TRAP_HANDLER(nmi_crash);
DECLARE_TRAP_HANDLER(int3);
DECLARE_TRAP_HANDLER(overflow);
DECLARE_TRAP_HANDLER(bounds);
@@ -545,6 +546,9 @@ DECLARE_TRAP_HANDLER(alignment_check);
DECLARE_TRAP_HANDLER(spurious_interrupt_bug);
#undef DECLARE_TRAP_HANDLER

+void trap_nop(void);
+void enable_nmis(void);
+
void syscall_enter(void);
void sysenter_entry(void);
void sysenter_eflags_saved(void);

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
Re: [PATCH V5] x86/kexec: Change NMI and MCE handling on kexec path [ In reply to ]
>>> On 12.12.12 at 12:35, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> xen/arch/x86/crash.c | 116 ++++++++++++++++++++++++++++++++++-----
> xen/arch/x86/machine_kexec.c | 19 ++++++
> xen/arch/x86/x86_64/entry.S | 34 +++++++++++
> xen/include/asm-x86/desc.h | 45 +++++++++++++++
> xen/include/asm-x86/processor.h | 4 +
> 5 files changed, 203 insertions(+), 15 deletions(-)
>
>
> Experimentally, certain crash kernels will triple fault very early after
> starting if started with NMIs disabled. This was discovered when
> experimenting with a debug keyhandler which deliberately created a
> reentrant NMI, causing stack corruption.
>
> Because of this discovered bug, and that the future changes to the NMI
> handling will make the kexec path more fragile, take the time now to
> bullet-proof the kexec behaviour to be safer in more circumstances.
>
> This patch adds three new low level routines:
> * nmi_crash
> This is a special NMI handler for using during a kexec crash.
> * enable_nmis
> This function enables NMIs by executing an iret-to-self, to
> disengage the hardware NMI latch.
> * trap_nop
> This is a no op handler which irets immediately. It is actually in
> the middle of enable_nmis to reuse the iret instruction, without
> having a single lone aligned iret inflating the code side.
>
> And adds three new IDT entry helper routines:
> * _write_gate_lower
> This is a substitute for using cmpxchg16b to update a 128bit
> structure at once. It assumes that the top 64 bits are unchanged
> (and ASSERT()s the fact) and performs a regular write on the lower
> 64 bits.
> * _set_gate_lower
> This is functionally equivalent to the already present _set_gate(),
> except it uses _write_gate_lower rather than updating both 64bit
> values.
> * _update_gate_addr_lower
> This is designed to update an IDT entry handler only, without
> altering any other settings in the entry. It also uses
> _write_gate_lower.
>
> The IDT entry helpers are required because:
> * Is it unsafe to attempt a disable/update/re-enable cycle on the NMI
> or MCE IDT entries.
> * We need to be able to update NMI handlers without changing the IST
> entry.
>
>
> As a result, the new behaviour of the kexec_crash path is:
>
> nmi_shootdown_cpus() will:
>
> * Disable the crashing cpus NMI/MCE interrupt stack tables.
> Disabling the stack tables removes race conditions which would lead
> to corrupt exception frames and infinite loops. As this pcpu is
> never planning to execute a sysret back to a pv vcpu, the update is
> safe from a security point of view.
>
> * Swap the NMI trap handlers.
> The crashing pcpu gets the nop handler, to prevent it getting stuck in
> an NMI context, causing a hang instead of crash. The non-crashing
> pcpus all get the nmi_crash handler which is designed never to
> return.
>
> do_nmi_crash() will:
>
> * Save the crash notes and shut the pcpu down.
> There is now an extra per-cpu variable to prevent us from executing
> this multiple times. In the case where we reenter midway through,
> attempt the whole operation again in preference to not completing
> it in the first place.
>
> * Set up another NMI at the LAPIC.
> Even when the LAPIC has been disabled, the ID and command registers
> are still usable. As a result, we can deliberately queue up a new
> NMI to re-interrupt us later if NMIs get unlatched. Because of the
> call to __stop_this_cpu(), we have to hand craft self_nmi() to be
> safe from General Protection Faults.
>
> * Fall into infinite loop.
>
> machine_kexec() will:
>
> * Swap the MCE handlers to be a nop.
> We cannot prevent MCEs from being delivered when we pass off to the
> crash kernel, and the less Xen context is being touched the better.
>
> * Explicitly enable NMIs.
>
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Acked-by: Jan Beulich <jbeulich@suse.com>

(but for committing I'd want at least one other knowledgeable
person's ack)

> --
>
> Changes since v4:
> * Change stale comments, spelling corrections and coding style changes.
>
> Changes since v3:
> * Added IDT entry helpers to safely update NMI/MCE entries.
>
> Changes since v2:
>
> * Rework the alteration of the stack tables to completely remove the
> possibility of a PV domain getting very lucky with the "NMI or MCE in
> a 1 instruction race window on sysret" and managing to execute code
> in the hypervisor context.
> * Make use of set_ist() from the previous patch in the series to avoid
> open-coding the IST manipulation.
>
> Changes since v1:
> * Reintroduce atomic_dec(&waiting_for_crash_ipi); which got missed
> during the original refactoring.
> * Fold trap_nop into the middle of enable_nmis to reuse the iret.
> * Expand comments in areas as per Tim's suggestions.
>
> diff -r ef8c1b607b10 -r 96b068439bc4 xen/arch/x86/crash.c
> --- a/xen/arch/x86/crash.c
> +++ b/xen/arch/x86/crash.c
> @@ -32,41 +32,127 @@
>
> static atomic_t waiting_for_crash_ipi;
> static unsigned int crashing_cpu;
> +static DEFINE_PER_CPU_READ_MOSTLY(bool_t, crash_save_done);
>
> -static int crash_nmi_callback(struct cpu_user_regs *regs, int cpu)
> +/* This becomes the NMI handler for non-crashing CPUs, when Xen is crashing.
> */
> +void __attribute__((noreturn)) do_nmi_crash(struct cpu_user_regs *regs)
> {
> - /* Don't do anything if this handler is invoked on crashing cpu.
> - * Otherwise, system will completely hang. Crashing cpu can get
> - * an NMI if system was initially booted with nmi_watchdog parameter.
> + int cpu = smp_processor_id();
> +
> + /* nmi_shootdown_cpus() should ensure that this assertion is correct.
> */
> + ASSERT( cpu != crashing_cpu );
> +
> + /* Save crash information and shut down CPU. Attempt only once. */
> + if ( ! this_cpu(crash_save_done) )
> + {
> + /* Disable the interrupt stack table for the MCE handler. This
> + * prevents race conditions between clearing MCIP and receving a
> + * new MCE, during which the exception frame would be clobbered
> + * and the MCE handler fall into an infinite loop. We are soon
> + * going to disable the NMI watchdog, so the loop would not be
> + * caught.
> + *
> + * We do not need to change the NMI IST, as the nmi_crash
> + * handler is immue to corrupt exception frames, by virtue of
> + * being designed never to return.
> + *
> + * This update is safe from a security point of view, as this
> + * pcpu is never going to try to sysret back to a PV vcpu.
> + */
> + set_ist(&idt_tables[cpu][TRAP_machine_check], IST_NONE);
> +
> + kexec_crash_save_cpu();
> + __stop_this_cpu();
> +
> + this_cpu(crash_save_done) = 1;
> + atomic_dec(&waiting_for_crash_ipi);
> + }
> +
> + /* Poor mans self_nmi(). __stop_this_cpu() has reverted the LAPIC
> + * back to its boot state, so we are unable to rely on the regular
> + * apic_* functions, due to 'x2apic_enabled' being possibly wrong.
> + * (The likely scenario is that we have reverted from x2apic mode to
> + * xapic, at which point #GPFs will occur if we use the apic_*
> + * functions)
> + *
> + * The ICR and APIC ID of the LAPIC are still valid even during
> + * software disable (Intel SDM Vol 3, 10.4.7.2). As a result, we
> + * can deliberately queue up another NMI at the LAPIC which will not
> + * be delivered as the hardware NMI latch is currently in effect.
> + * This means that if NMIs become unlatched (e.g. following a
> + * non-fatal MCE), the LAPIC will force us back here rather than
> + * wandering back into regular Xen code.
> */
> - if ( cpu == crashing_cpu )
> - return 1;
> - local_irq_disable();
> + switch ( current_local_apic_mode() )
> + {
> + u32 apic_id;
>
> - kexec_crash_save_cpu();
> + case APIC_MODE_X2APIC:
> + apic_id = apic_rdmsr(APIC_ID);
>
> - __stop_this_cpu();
> + apic_wrmsr(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL |
> ((u64)apic_id << 32));
> + break;
>
> - atomic_dec(&waiting_for_crash_ipi);
> + case APIC_MODE_XAPIC:
> + apic_id = GET_xAPIC_ID(apic_mem_read(APIC_ID));
> +
> + while ( apic_mem_read(APIC_ICR) & APIC_ICR_BUSY )
> + cpu_relax();
> +
> + apic_mem_write(APIC_ICR2, apic_id << 24);
> + apic_mem_write(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL);
> + break;
> +
> + default:
> + break;
> + }
>
> for ( ; ; )
> halt();
> -
> - return 1;
> }
>
> static void nmi_shootdown_cpus(void)
> {
> unsigned long msecs;
> + int i, cpu = smp_processor_id();
>
> local_irq_disable();
>
> - crashing_cpu = smp_processor_id();
> + crashing_cpu = cpu;
> local_irq_count(crashing_cpu) = 0;
>
> atomic_set(&waiting_for_crash_ipi, num_online_cpus() - 1);
> - /* Would it be better to replace the trap vector here? */
> - set_nmi_callback(crash_nmi_callback);
> +
> + /* Change NMI trap handlers. Non-crashing pcpus get nmi_crash which
> + * invokes do_nmi_crash (above), which cause them to write state and
> + * fall into a loop. The crashing pcpu gets the nop handler to
> + * cause it to return to this function ASAP.
> + */
> + for ( i = 0; i < nr_cpu_ids; ++i )
> + if ( idt_tables[i] )
> + {
> +
> + if ( i == cpu )
> + {
> + /* Disable the interrupt stack tables for this cpus MCE
> + * and NMI handlers, and alter the NMI handler to have
> + * no operation. Disabling the stack tables prevents
> + * stack corruption race conditions, while changing the
> + * handler helps prevent cascading faults; we are
> + * certainly going to crash by this point.
> + *
> + * This update is safe from a security point of view, as
> + * this pcpu is never going to try to sysret back to a
> + * PV vcpu.
> + */
> + _set_gate_lower(&idt_tables[i][TRAP_nmi], 14, 0, &trap_nop);
> + set_ist(&idt_tables[i][TRAP_machine_check], IST_NONE);
> + }
> + else
> + /* Do not update stack table for other pcpus. */
> + _update_gate_addr_lower(&idt_tables[i][TRAP_nmi],
> &nmi_crash);
> + }
> +
> /* Ensure the new callback function is set before sending out the NMI.
> */
> wmb();
>
> diff -r ef8c1b607b10 -r 96b068439bc4 xen/arch/x86/machine_kexec.c
> --- a/xen/arch/x86/machine_kexec.c
> +++ b/xen/arch/x86/machine_kexec.c
> @@ -81,12 +81,31 @@ void machine_kexec(xen_kexec_image_t *im
> .base = (unsigned long)(boot_cpu_gdt_table -
> FIRST_RESERVED_GDT_ENTRY),
> .limit = LAST_RESERVED_GDT_BYTE
> };
> + int i;
>
> /* We are about to permenantly jump out of the Xen context into the
> kexec
> * purgatory code. We really dont want to be still servicing
> interupts.
> */
> local_irq_disable();
>
> + /* Now regular interrupts are disabled, we need to reduce the impact
> + * of interrupts not disabled by 'cli'.
> + *
> + * The NMI handlers have already been set up nmi_shootdown_cpus(). All
> + * pcpus other than us have the nmi_crash handler, while we have the
> nop
> + * handler.
> + *
> + * The MCE handlers touch extensive areas of Xen code and data. At
> this
> + * point, there is nothing we can usefully do, so set the nop handler.
> + */
> + for ( i = 0; i < nr_cpu_ids; ++i )
> + if ( idt_tables[i] )
> + _update_gate_addr_lower(&idt_tables[i][TRAP_machine_check],
> &trap_nop);
> +
> + /* Explicitly enable NMIs on this CPU. Some crashdump kernels do
> + * not like running with NMIs disabled. */
> + enable_nmis();
> +
> /*
> * compat_machine_kexec() returns to idle pagetables, which requires us
> * to be running on a static GDT mapping (idle pagetables have no GDT
> diff -r ef8c1b607b10 -r 96b068439bc4 xen/arch/x86/x86_64/entry.S
> --- a/xen/arch/x86/x86_64/entry.S
> +++ b/xen/arch/x86/x86_64/entry.S
> @@ -635,11 +635,45 @@ ENTRY(nmi)
> movl $TRAP_nmi,4(%rsp)
> jmp handle_ist_exception
>
> +ENTRY(nmi_crash)
> + cli
> + pushq $0
> + movl $TRAP_nmi,4(%rsp)
> + SAVE_ALL
> + movq %rsp,%rdi
> + callq do_nmi_crash /* Does not return */
> + ud2
> +
> ENTRY(machine_check)
> pushq $0
> movl $TRAP_machine_check,4(%rsp)
> jmp handle_ist_exception
>
> +/* Enable NMIs. No special register assumptions. Only %rax is not
> preserved. */
> +ENTRY(enable_nmis)
> + movq %rsp, %rax /* Grab RSP before pushing */
> +
> + /* Set up stack frame */
> + pushq $0 /* SS */
> + pushq %rax /* RSP */
> + pushfq /* RFLAGS */
> + pushq $__HYPERVISOR_CS /* CS */
> + leaq 1f(%rip),%rax
> + pushq %rax /* RIP */
> +
> + iretq /* Disable the hardware NMI latch */
> +1:
> + retq
> +
> +/* No op trap handler. Required for kexec crash path. This is not
> + * declared with the ENTRY() macro to avoid wasted alignment space.
> + */
> +.globl trap_nop
> +trap_nop:
> + iretq
> +
> +
> +
> .section .rodata, "a", @progbits
>
> ENTRY(exception_table)
> diff -r ef8c1b607b10 -r 96b068439bc4 xen/include/asm-x86/desc.h
> --- a/xen/include/asm-x86/desc.h
> +++ b/xen/include/asm-x86/desc.h
> @@ -106,6 +106,21 @@ typedef struct {
> u64 a, b;
> } idt_entry_t;
>
> +/* Write the lower 64 bits of an IDT Entry. This relies on the upper 32
> + * bits of the address not changing, which is a safe assumption as all
> + * functions we are likely to load will live inside the 1GB
> + * code/data/bss address range.
> + *
> + * Ideally, we would use cmpxchg16b, but this is not supported on some
> + * old AMD 64bit capable processors, and has no safe equivalent.
> + */
> +static inline void _write_gate_lower(volatile idt_entry_t *gate,
> + const idt_entry_t *new)
> +{
> + ASSERT(gate->b == new->b);
> + gate->a = new->a;
> +}
> +
> #define _set_gate(gate_addr,type,dpl,addr) \
> do { \
> (gate_addr)->a = 0; \
> @@ -122,6 +137,36 @@ do {
> (1UL << 47); \
> } while (0)
>
> +static inline void _set_gate_lower(idt_entry_t *gate, unsigned long type,
> + unsigned long dpl, void *addr)
> +{
> + idt_entry_t idte;
> + idte.b = gate->b;
> + idte.a =
> + (((unsigned long)(addr) & 0xFFFF0000UL) << 32) |
> + ((unsigned long)(dpl) << 45) |
> + ((unsigned long)(type) << 40) |
> + ((unsigned long)(addr) & 0xFFFFUL) |
> + ((unsigned long)__HYPERVISOR_CS64 << 16) |
> + (1UL << 47);
> + _write_gate_lower(gate, &idte);
> +}
> +
> +/* Update the lower half handler of an IDT Entry, without changing any
> + * other configuration. */
> +static inline void _update_gate_addr_lower(idt_entry_t *gate, void *addr)
> +{
> + idt_entry_t idte;
> + idte.a = gate->a;
> +
> + idte.b = ((unsigned long)(addr) >> 32);
> + idte.a &= 0x0000FFFFFFFF0000ULL;
> + idte.a |= (((unsigned long)(addr) & 0xFFFF0000UL) << 32) |
> + ((unsigned long)(addr) & 0xFFFFUL);
> +
> + _write_gate_lower(gate, &idte);
> +}
> +
> #define _set_tssldt_desc(desc,addr,limit,type) \
> do { \
> (desc)[0].b = (desc)[1].b = 0; \
> diff -r ef8c1b607b10 -r 96b068439bc4 xen/include/asm-x86/processor.h
> --- a/xen/include/asm-x86/processor.h
> +++ b/xen/include/asm-x86/processor.h
> @@ -527,6 +527,7 @@ void do_ ## _name(struct cpu_user_regs *
> DECLARE_TRAP_HANDLER(divide_error);
> DECLARE_TRAP_HANDLER(debug);
> DECLARE_TRAP_HANDLER(nmi);
> +DECLARE_TRAP_HANDLER(nmi_crash);
> DECLARE_TRAP_HANDLER(int3);
> DECLARE_TRAP_HANDLER(overflow);
> DECLARE_TRAP_HANDLER(bounds);
> @@ -545,6 +546,9 @@ DECLARE_TRAP_HANDLER(alignment_check);
> DECLARE_TRAP_HANDLER(spurious_interrupt_bug);
> #undef DECLARE_TRAP_HANDLER
>
> +void trap_nop(void);
> +void enable_nmis(void);
> +
> void syscall_enter(void);
> void sysenter_entry(void);
> void sysenter_eflags_saved(void);



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
Re: [PATCH V5] x86/kexec: Change NMI and MCE handling on kexec path [ In reply to ]
On 12/12/2012 12:50, "Jan Beulich" <JBeulich@suse.com> wrote:

>> * Explicitly enable NMIs.
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>
> Acked-by: Jan Beulich <jbeulich@suse.com>
>
> (but for committing I'd want at least one other knowledgeable
> person's ack)

It looks okay to me too. Should I wait for an Ack from Tim too?

-- Keir

>> --
>>
>> Changes since v4:
>> * Change stale comments, spelling corrections and coding style changes.
>>
>> Changes since v3:
>> * Added IDT entry helpers to safely update NMI/MCE entries.
>>
>> Changes since v2:
>>
>> * Rework the alteration of the stack tables to completely remove the
>> possibility of a PV domain getting very lucky with the "NMI or MCE in
>> a 1 instruction race window on sysret" and managing to execute code
>> in the hypervisor context.
>> * Make use of set_ist() from the previous patch in the series to avoid
>> open-coding the IST manipulation.
>>
>> Changes since v1:
>> * Reintroduce atomic_dec(&waiting_for_crash_ipi); which got missed
>> during the original refactoring.
>> * Fold trap_nop into the middle of enable_nmis to reuse the iret.
>> * Expand comments in areas as per Tim's suggestions.
>>
>> diff -r ef8c1b607b10 -r 96b068439bc4 xen/arch/x86/crash.c
>> --- a/xen/arch/x86/crash.c
>> +++ b/xen/arch/x86/crash.c
>> @@ -32,41 +32,127 @@
>>
>> static atomic_t waiting_for_crash_ipi;
>> static unsigned int crashing_cpu;
>> +static DEFINE_PER_CPU_READ_MOSTLY(bool_t, crash_save_done);
>>
>> -static int crash_nmi_callback(struct cpu_user_regs *regs, int cpu)
>> +/* This becomes the NMI handler for non-crashing CPUs, when Xen is crashing.
>> */
>> +void __attribute__((noreturn)) do_nmi_crash(struct cpu_user_regs *regs)
>> {
>> - /* Don't do anything if this handler is invoked on crashing cpu.
>> - * Otherwise, system will completely hang. Crashing cpu can get
>> - * an NMI if system was initially booted with nmi_watchdog parameter.
>> + int cpu = smp_processor_id();
>> +
>> + /* nmi_shootdown_cpus() should ensure that this assertion is correct.
>> */
>> + ASSERT( cpu != crashing_cpu );
>> +
>> + /* Save crash information and shut down CPU. Attempt only once. */
>> + if ( ! this_cpu(crash_save_done) )
>> + {
>> + /* Disable the interrupt stack table for the MCE handler. This
>> + * prevents race conditions between clearing MCIP and receving a
>> + * new MCE, during which the exception frame would be clobbered
>> + * and the MCE handler fall into an infinite loop. We are soon
>> + * going to disable the NMI watchdog, so the loop would not be
>> + * caught.
>> + *
>> + * We do not need to change the NMI IST, as the nmi_crash
>> + * handler is immue to corrupt exception frames, by virtue of
>> + * being designed never to return.
>> + *
>> + * This update is safe from a security point of view, as this
>> + * pcpu is never going to try to sysret back to a PV vcpu.
>> + */
>> + set_ist(&idt_tables[cpu][TRAP_machine_check], IST_NONE);
>> +
>> + kexec_crash_save_cpu();
>> + __stop_this_cpu();
>> +
>> + this_cpu(crash_save_done) = 1;
>> + atomic_dec(&waiting_for_crash_ipi);
>> + }
>> +
>> + /* Poor mans self_nmi(). __stop_this_cpu() has reverted the LAPIC
>> + * back to its boot state, so we are unable to rely on the regular
>> + * apic_* functions, due to 'x2apic_enabled' being possibly wrong.
>> + * (The likely scenario is that we have reverted from x2apic mode to
>> + * xapic, at which point #GPFs will occur if we use the apic_*
>> + * functions)
>> + *
>> + * The ICR and APIC ID of the LAPIC are still valid even during
>> + * software disable (Intel SDM Vol 3, 10.4.7.2). As a result, we
>> + * can deliberately queue up another NMI at the LAPIC which will not
>> + * be delivered as the hardware NMI latch is currently in effect.
>> + * This means that if NMIs become unlatched (e.g. following a
>> + * non-fatal MCE), the LAPIC will force us back here rather than
>> + * wandering back into regular Xen code.
>> */
>> - if ( cpu == crashing_cpu )
>> - return 1;
>> - local_irq_disable();
>> + switch ( current_local_apic_mode() )
>> + {
>> + u32 apic_id;
>>
>> - kexec_crash_save_cpu();
>> + case APIC_MODE_X2APIC:
>> + apic_id = apic_rdmsr(APIC_ID);
>>
>> - __stop_this_cpu();
>> + apic_wrmsr(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL |
>> ((u64)apic_id << 32));
>> + break;
>>
>> - atomic_dec(&waiting_for_crash_ipi);
>> + case APIC_MODE_XAPIC:
>> + apic_id = GET_xAPIC_ID(apic_mem_read(APIC_ID));
>> +
>> + while ( apic_mem_read(APIC_ICR) & APIC_ICR_BUSY )
>> + cpu_relax();
>> +
>> + apic_mem_write(APIC_ICR2, apic_id << 24);
>> + apic_mem_write(APIC_ICR, APIC_DM_NMI | APIC_DEST_PHYSICAL);
>> + break;
>> +
>> + default:
>> + break;
>> + }
>>
>> for ( ; ; )
>> halt();
>> -
>> - return 1;
>> }
>>
>> static void nmi_shootdown_cpus(void)
>> {
>> unsigned long msecs;
>> + int i, cpu = smp_processor_id();
>>
>> local_irq_disable();
>>
>> - crashing_cpu = smp_processor_id();
>> + crashing_cpu = cpu;
>> local_irq_count(crashing_cpu) = 0;
>>
>> atomic_set(&waiting_for_crash_ipi, num_online_cpus() - 1);
>> - /* Would it be better to replace the trap vector here? */
>> - set_nmi_callback(crash_nmi_callback);
>> +
>> + /* Change NMI trap handlers. Non-crashing pcpus get nmi_crash which
>> + * invokes do_nmi_crash (above), which cause them to write state and
>> + * fall into a loop. The crashing pcpu gets the nop handler to
>> + * cause it to return to this function ASAP.
>> + */
>> + for ( i = 0; i < nr_cpu_ids; ++i )
>> + if ( idt_tables[i] )
>> + {
>> +
>> + if ( i == cpu )
>> + {
>> + /* Disable the interrupt stack tables for this cpus MCE
>> + * and NMI handlers, and alter the NMI handler to have
>> + * no operation. Disabling the stack tables prevents
>> + * stack corruption race conditions, while changing the
>> + * handler helps prevent cascading faults; we are
>> + * certainly going to crash by this point.
>> + *
>> + * This update is safe from a security point of view, as
>> + * this pcpu is never going to try to sysret back to a
>> + * PV vcpu.
>> + */
>> + _set_gate_lower(&idt_tables[i][TRAP_nmi], 14, 0, &trap_nop);
>> + set_ist(&idt_tables[i][TRAP_machine_check], IST_NONE);
>> + }
>> + else
>> + /* Do not update stack table for other pcpus. */
>> + _update_gate_addr_lower(&idt_tables[i][TRAP_nmi],
>> &nmi_crash);
>> + }
>> +
>> /* Ensure the new callback function is set before sending out the NMI.
>> */
>> wmb();
>>
>> diff -r ef8c1b607b10 -r 96b068439bc4 xen/arch/x86/machine_kexec.c
>> --- a/xen/arch/x86/machine_kexec.c
>> +++ b/xen/arch/x86/machine_kexec.c
>> @@ -81,12 +81,31 @@ void machine_kexec(xen_kexec_image_t *im
>> .base = (unsigned long)(boot_cpu_gdt_table -
>> FIRST_RESERVED_GDT_ENTRY),
>> .limit = LAST_RESERVED_GDT_BYTE
>> };
>> + int i;
>>
>> /* We are about to permenantly jump out of the Xen context into the
>> kexec
>> * purgatory code. We really dont want to be still servicing
>> interupts.
>> */
>> local_irq_disable();
>>
>> + /* Now regular interrupts are disabled, we need to reduce the impact
>> + * of interrupts not disabled by 'cli'.
>> + *
>> + * The NMI handlers have already been set up nmi_shootdown_cpus(). All
>> + * pcpus other than us have the nmi_crash handler, while we have the
>> nop
>> + * handler.
>> + *
>> + * The MCE handlers touch extensive areas of Xen code and data. At
>> this
>> + * point, there is nothing we can usefully do, so set the nop handler.
>> + */
>> + for ( i = 0; i < nr_cpu_ids; ++i )
>> + if ( idt_tables[i] )
>> + _update_gate_addr_lower(&idt_tables[i][TRAP_machine_check],
>> &trap_nop);
>> +
>> + /* Explicitly enable NMIs on this CPU. Some crashdump kernels do
>> + * not like running with NMIs disabled. */
>> + enable_nmis();
>> +
>> /*
>> * compat_machine_kexec() returns to idle pagetables, which requires us
>> * to be running on a static GDT mapping (idle pagetables have no GDT
>> diff -r ef8c1b607b10 -r 96b068439bc4 xen/arch/x86/x86_64/entry.S
>> --- a/xen/arch/x86/x86_64/entry.S
>> +++ b/xen/arch/x86/x86_64/entry.S
>> @@ -635,11 +635,45 @@ ENTRY(nmi)
>> movl $TRAP_nmi,4(%rsp)
>> jmp handle_ist_exception
>>
>> +ENTRY(nmi_crash)
>> + cli
>> + pushq $0
>> + movl $TRAP_nmi,4(%rsp)
>> + SAVE_ALL
>> + movq %rsp,%rdi
>> + callq do_nmi_crash /* Does not return */
>> + ud2
>> +
>> ENTRY(machine_check)
>> pushq $0
>> movl $TRAP_machine_check,4(%rsp)
>> jmp handle_ist_exception
>>
>> +/* Enable NMIs. No special register assumptions. Only %rax is not
>> preserved. */
>> +ENTRY(enable_nmis)
>> + movq %rsp, %rax /* Grab RSP before pushing */
>> +
>> + /* Set up stack frame */
>> + pushq $0 /* SS */
>> + pushq %rax /* RSP */
>> + pushfq /* RFLAGS */
>> + pushq $__HYPERVISOR_CS /* CS */
>> + leaq 1f(%rip),%rax
>> + pushq %rax /* RIP */
>> +
>> + iretq /* Disable the hardware NMI latch */
>> +1:
>> + retq
>> +
>> +/* No op trap handler. Required for kexec crash path. This is not
>> + * declared with the ENTRY() macro to avoid wasted alignment space.
>> + */
>> +.globl trap_nop
>> +trap_nop:
>> + iretq
>> +
>> +
>> +
>> .section .rodata, "a", @progbits
>>
>> ENTRY(exception_table)
>> diff -r ef8c1b607b10 -r 96b068439bc4 xen/include/asm-x86/desc.h
>> --- a/xen/include/asm-x86/desc.h
>> +++ b/xen/include/asm-x86/desc.h
>> @@ -106,6 +106,21 @@ typedef struct {
>> u64 a, b;
>> } idt_entry_t;
>>
>> +/* Write the lower 64 bits of an IDT Entry. This relies on the upper 32
>> + * bits of the address not changing, which is a safe assumption as all
>> + * functions we are likely to load will live inside the 1GB
>> + * code/data/bss address range.
>> + *
>> + * Ideally, we would use cmpxchg16b, but this is not supported on some
>> + * old AMD 64bit capable processors, and has no safe equivalent.
>> + */
>> +static inline void _write_gate_lower(volatile idt_entry_t *gate,
>> + const idt_entry_t *new)
>> +{
>> + ASSERT(gate->b == new->b);
>> + gate->a = new->a;
>> +}
>> +
>> #define _set_gate(gate_addr,type,dpl,addr) \
>> do { \
>> (gate_addr)->a = 0; \
>> @@ -122,6 +137,36 @@ do {
>> (1UL << 47); \
>> } while (0)
>>
>> +static inline void _set_gate_lower(idt_entry_t *gate, unsigned long type,
>> + unsigned long dpl, void *addr)
>> +{
>> + idt_entry_t idte;
>> + idte.b = gate->b;
>> + idte.a =
>> + (((unsigned long)(addr) & 0xFFFF0000UL) << 32) |
>> + ((unsigned long)(dpl) << 45) |
>> + ((unsigned long)(type) << 40) |
>> + ((unsigned long)(addr) & 0xFFFFUL) |
>> + ((unsigned long)__HYPERVISOR_CS64 << 16) |
>> + (1UL << 47);
>> + _write_gate_lower(gate, &idte);
>> +}
>> +
>> +/* Update the lower half handler of an IDT Entry, without changing any
>> + * other configuration. */
>> +static inline void _update_gate_addr_lower(idt_entry_t *gate, void *addr)
>> +{
>> + idt_entry_t idte;
>> + idte.a = gate->a;
>> +
>> + idte.b = ((unsigned long)(addr) >> 32);
>> + idte.a &= 0x0000FFFFFFFF0000ULL;
>> + idte.a |= (((unsigned long)(addr) & 0xFFFF0000UL) << 32) |
>> + ((unsigned long)(addr) & 0xFFFFUL);
>> +
>> + _write_gate_lower(gate, &idte);
>> +}
>> +
>> #define _set_tssldt_desc(desc,addr,limit,type) \
>> do { \
>> (desc)[0].b = (desc)[1].b = 0; \
>> diff -r ef8c1b607b10 -r 96b068439bc4 xen/include/asm-x86/processor.h
>> --- a/xen/include/asm-x86/processor.h
>> +++ b/xen/include/asm-x86/processor.h
>> @@ -527,6 +527,7 @@ void do_ ## _name(struct cpu_user_regs *
>> DECLARE_TRAP_HANDLER(divide_error);
>> DECLARE_TRAP_HANDLER(debug);
>> DECLARE_TRAP_HANDLER(nmi);
>> +DECLARE_TRAP_HANDLER(nmi_crash);
>> DECLARE_TRAP_HANDLER(int3);
>> DECLARE_TRAP_HANDLER(overflow);
>> DECLARE_TRAP_HANDLER(bounds);
>> @@ -545,6 +546,9 @@ DECLARE_TRAP_HANDLER(alignment_check);
>> DECLARE_TRAP_HANDLER(spurious_interrupt_bug);
>> #undef DECLARE_TRAP_HANDLER
>>
>> +void trap_nop(void);
>> +void enable_nmis(void);
>> +
>> void syscall_enter(void);
>> void sysenter_entry(void);
>> void sysenter_eflags_saved(void);
>
>



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
Re: [PATCH V5] x86/kexec: Change NMI and MCE handling on kexec path [ In reply to ]
>>> On 12.12.12 at 16:33, Keir Fraser <keir@xen.org> wrote:
> On 12/12/2012 12:50, "Jan Beulich" <JBeulich@suse.com> wrote:
>
>>> * Explicitly enable NMIs.
>>>
>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>
>> Acked-by: Jan Beulich <jbeulich@suse.com>
>>
>> (but for committing I'd want at least one other knowledgeable
>> person's ack)
>
> It looks okay to me too. Should I wait for an Ack from Tim too?

An extra pair of eyes certainly wouldn't hurt.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
Re: [PATCH V5] x86/kexec: Change NMI and MCE handling on kexec path [ In reply to ]
At 11:35 +0000 on 12 Dec (1355312134), Andrew Cooper wrote:
> diff -r ef8c1b607b10 -r 96b068439bc4 xen/arch/x86/crash.c
> --- a/xen/arch/x86/crash.c
> +++ b/xen/arch/x86/crash.c
> @@ -32,41 +32,127 @@
>
> static atomic_t waiting_for_crash_ipi;
> static unsigned int crashing_cpu;
> +static DEFINE_PER_CPU_READ_MOSTLY(bool_t, crash_save_done);
>
> -static int crash_nmi_callback(struct cpu_user_regs *regs, int cpu)
> +/* This becomes the NMI handler for non-crashing CPUs, when Xen is crashing. */
> +void __attribute__((noreturn)) do_nmi_crash(struct cpu_user_regs *regs)
> {
> - /* Don't do anything if this handler is invoked on crashing cpu.
> - * Otherwise, system will completely hang. Crashing cpu can get
> - * an NMI if system was initially booted with nmi_watchdog parameter.
> + int cpu = smp_processor_id();
> +
> + /* nmi_shootdown_cpus() should ensure that this assertion is correct. */
> + ASSERT( cpu != crashing_cpu );

nmi_shootdown_cpus() has a go, but I think it would have to do an atomic
compare-exchange on crashing_cpu to actually be sure that only one CPU
is crashing at a time. If two CPUs try to lead crashes at the same
time, it will deadlock here, with NMIs disabled on all CPUs.

The old behaviour was also not entirely race-free, but a little more
likey to make progress, as in most cases at least one CPU would see
cpu == crashing_cpu here and return.

Using cmpxchg in nmi_shootdown_cpus() would have the drawback that if
one CPU tried to kexec and wedged, another CPU couldn't then have a go.
Not sure which of all these unlikely scenarios is worst. :)

> diff -r ef8c1b607b10 -r 96b068439bc4 xen/arch/x86/x86_64/entry.S
> --- a/xen/arch/x86/x86_64/entry.S
> +++ b/xen/arch/x86/x86_64/entry.S
> @@ -635,11 +635,45 @@ ENTRY(nmi)
> movl $TRAP_nmi,4(%rsp)
> jmp handle_ist_exception
>
> +ENTRY(nmi_crash)
> + cli

Aren't interrupts disabled already because we came in through an
interrupt gate?

> + pushq $0
> + movl $TRAP_nmi,4(%rsp)
> + SAVE_ALL
> + movq %rsp,%rdi

Why? do_nmi_crash() doesn't actually use any of its arguments.
AFAICT, we could pretty much use do_nmi_crash explicitly in the IDT
entry.

> + callq do_nmi_crash /* Does not return */
> + ud2
> +
> ENTRY(machine_check)
> pushq $0
> movl $TRAP_machine_check,4(%rsp)
> jmp handle_ist_exception
>
> +/* Enable NMIs. No special register assumptions. Only %rax is not preserved. */
> +ENTRY(enable_nmis)
> + movq %rsp, %rax /* Grab RSP before pushing */
> +
> + /* Set up stack frame */
> + pushq $0 /* SS */
> + pushq %rax /* RSP */
> + pushfq /* RFLAGS */
> + pushq $__HYPERVISOR_CS /* CS */
> + leaq 1f(%rip),%rax
> + pushq %rax /* RIP */
> +
> + iretq /* Disable the hardware NMI latch */
> +1:
> + retq

This could be made a bit smaller by something like:

popq %rcx /* Remember return address */
movq %rsp, %rax /* Grab RSP before pushing */

/* Set up stack frame */
pushq $0 /* SS */
pushq %rax /* RSP */
pushfq /* RFLAGS */
pushq $__HYPERVISOR_CS /* CS */
pushq %rcx /* RIP */

iretq /* Disable the hardware NMI latch and return */

which also keeps the call/return counts balanced. But tbh I'm not sure
it's worth it. And I suspect you'll tell me why it's wrong. :)

> +
> +/* No op trap handler. Required for kexec crash path. This is not
> + * declared with the ENTRY() macro to avoid wasted alignment space.
> + */
> +.globl trap_nop
> +trap_nop:
> + iretq

The commit message says this reuses the iretq in enable_nmis().

Cheers,

Tim.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
Re: [PATCH V5] x86/kexec: Change NMI and MCE handling on kexec path [ In reply to ]
On 13/12/12 10:59, Tim Deegan wrote:
> At 11:35 +0000 on 12 Dec (1355312134), Andrew Cooper wrote:
>> diff -r ef8c1b607b10 -r 96b068439bc4 xen/arch/x86/crash.c
>> --- a/xen/arch/x86/crash.c
>> +++ b/xen/arch/x86/crash.c
>> @@ -32,41 +32,127 @@
>>
>> static atomic_t waiting_for_crash_ipi;
>> static unsigned int crashing_cpu;
>> +static DEFINE_PER_CPU_READ_MOSTLY(bool_t, crash_save_done);
>>
>> -static int crash_nmi_callback(struct cpu_user_regs *regs, int cpu)
>> +/* This becomes the NMI handler for non-crashing CPUs, when Xen is crashing. */
>> +void __attribute__((noreturn)) do_nmi_crash(struct cpu_user_regs *regs)
>> {
>> - /* Don't do anything if this handler is invoked on crashing cpu.
>> - * Otherwise, system will completely hang. Crashing cpu can get
>> - * an NMI if system was initially booted with nmi_watchdog parameter.
>> + int cpu = smp_processor_id();
>> +
>> + /* nmi_shootdown_cpus() should ensure that this assertion is correct. */
>> + ASSERT( cpu != crashing_cpu );
> nmi_shootdown_cpus() has a go, but I think it would have to do an atomic
> compare-exchange on crashing_cpu to actually be sure that only one CPU
> is crashing at a time. If two CPUs try to lead crashes at the same
> time, it will deadlock here, with NMIs disabled on all CPUs.
>
> The old behaviour was also not entirely race-free, but a little more
> likey to make progress, as in most cases at least one CPU would see
> cpu == crashing_cpu here and return.
>
> Using cmpxchg in nmi_shootdown_cpus() would have the drawback that if
> one CPU tried to kexec and wedged, another CPU couldn't then have a go.
> Not sure which of all these unlikely scenarios is worst. :)

kexec_common_shutdown() calls one_cpu_only() which gives the impression
that nmi_shootdown_cpus() can only be gotten to once. I think this is
race free for different pcpus attempting to crash at the same time, but
is certainly not safe for a cascade crash on the same pcpu.

As far as I can reason, this race can be worked around by doing a
combined atomic set of the KEXEC_IN_PROGRESS flag, and a set of the
crashing cpu id. However, there are also other race conditions along
the crash path which are addressed in subsequent patches which are still
in progress. As the choice here is between two differently nasty race
conditions, I would prefer to defer fixing it to later and doing it
properly.

>
>> diff -r ef8c1b607b10 -r 96b068439bc4 xen/arch/x86/x86_64/entry.S
>> --- a/xen/arch/x86/x86_64/entry.S
>> +++ b/xen/arch/x86/x86_64/entry.S
>> @@ -635,11 +635,45 @@ ENTRY(nmi)
>> movl $TRAP_nmi,4(%rsp)
>> jmp handle_ist_exception
>>
>> +ENTRY(nmi_crash)
>> + cli
> Aren't interrupts disabled already because we came in through an
> interrupt gate?

I was taking the pragmatic approach that in this case, we really have no
guarantees about Xen state. Having said that, we will (hopefully) have
entered here through a gate, which will be intacked, and the function
itself is deliberately safe to interruption. I shall remove it.

>
>> + pushq $0
>> + movl $TRAP_nmi,4(%rsp)
>> + SAVE_ALL
>> + movq %rsp,%rdi
> Why? do_nmi_crash() doesn't actually use any of its arguments.
> AFAICT, we could pretty much use do_nmi_crash explicitly in the IDT
> entry.

I was going with the DECLARE_TRAP_HANDLER() way of doing things, for
consistency with the other handlers. I can change it if you wish, but
as this is not a hotpath, I would err on the lazy side and leave it as
it is.

>
>> + callq do_nmi_crash /* Does not return */
>> + ud2
>> +
>> ENTRY(machine_check)
>> pushq $0
>> movl $TRAP_machine_check,4(%rsp)
>> jmp handle_ist_exception
>>
>> +/* Enable NMIs. No special register assumptions. Only %rax is not preserved. */
>> +ENTRY(enable_nmis)
>> + movq %rsp, %rax /* Grab RSP before pushing */
>> +
>> + /* Set up stack frame */
>> + pushq $0 /* SS */
>> + pushq %rax /* RSP */
>> + pushfq /* RFLAGS */
>> + pushq $__HYPERVISOR_CS /* CS */
>> + leaq 1f(%rip),%rax
>> + pushq %rax /* RIP */
>> +
>> + iretq /* Disable the hardware NMI latch */
>> +1:
>> + retq
> This could be made a bit smaller by something like:
>
> popq %rcx /* Remember return address */
> movq %rsp, %rax /* Grab RSP before pushing */
>
> /* Set up stack frame */
> pushq $0 /* SS */
> pushq %rax /* RSP */
> pushfq /* RFLAGS */
> pushq $__HYPERVISOR_CS /* CS */
> pushq %rcx /* RIP */
>
> iretq /* Disable the hardware NMI latch and return */
>
> which also keeps the call/return counts balanced. But tbh I'm not sure
> it's worth it. And I suspect you'll tell me why it's wrong. :)

I cant see any problem with it.

>
>> +
>> +/* No op trap handler. Required for kexec crash path. This is not
>> + * declared with the ENTRY() macro to avoid wasted alignment space.
>> + */
>> +.globl trap_nop
>> +trap_nop:
>> + iretq
> The commit message says this reuses the iretq in enable_nmis().
>
> Cheers,
>
> Tim.

Ok - I will fix the stale message.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
Re: [PATCH V5] x86/kexec: Change NMI and MCE handling on kexec path [ In reply to ]
At 11:47 +0000 on 13 Dec (1355399235), Andrew Cooper wrote:
> >nmi_shootdown_cpus() has a go, but I think it would have to do an atomic
> >compare-exchange on crashing_cpu to actually be sure that only one CPU
> >is crashing at a time. If two CPUs try to lead crashes at the same
> >time, it will deadlock here, with NMIs disabled on all CPUs.
>
> kexec_common_shutdown() calls one_cpu_only() which gives the impression
> that nmi_shootdown_cpus() can only be gotten to once.

Right, I missed that. So,

Acked-by: Tim Deegan <tim@xen.org>

You can tidy up entry.S or not, as you like. There's no correctness
issue with anything there, just things that could be neater if you were
already updating the patch.

Cheers,

Tim.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
Re: [PATCH V5] x86/kexec: Change NMI and MCE handling on kexec path [ In reply to ]
On 13/12/12 11:55, Tim Deegan wrote:
> At 11:47 +0000 on 13 Dec (1355399235), Andrew Cooper wrote:
>>> nmi_shootdown_cpus() has a go, but I think it would have to do an atomic
>>> compare-exchange on crashing_cpu to actually be sure that only one CPU
>>> is crashing at a time. If two CPUs try to lead crashes at the same
>>> time, it will deadlock here, with NMIs disabled on all CPUs.
>> kexec_common_shutdown() calls one_cpu_only() which gives the impression
>> that nmi_shootdown_cpus() can only be gotten to once.
> Right, I missed that. So,
>
> Acked-by: Tim Deegan<tim@xen.org>
>
> You can tidy up entry.S or not, as you like. There's no correctness
> issue with anything there, just things that could be neater if you were
> already updating the patch.
>
> Cheers,
>
> Tim.

I am just re spinning the patch for cli and the commit message. Nothing
major.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel