Mailing List Archive

[PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump
kexec disables (or "shoots down") all CPUs other than a crashing CPU before
entering the 2nd kernel. But the MCE handler is still enabled after that,
so if MCE happens and broadcasts over the CPUs after the main thread starts
the 2nd kernel (which might not initialize its MCE handler yet, or might decide
not to enable it,) MCE handler runs only on the other CPUs (not on the main
thread,) leading to kernel panic with MCE synchronization. The user-visible
effect of this bug is kdump failure.

Our standard MCE handler do_machine_check() assumes some about system's
status and it's hard to alter it to cover kexec/kdump context, so let's add
another kdump-specific one and switch to it.

Note that this problem exists since current MCE handler was implemented in
2.6.32, and recently commit 716079f66eac ("mce: Panic when a core has reached
a timeout") made it more visible by changing the default behavior of the
synchronization timeout from "ignore" to "panic".

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
ChangeLog: v7 -> v8
- rebase onto tip/master (resolve conflict with mce_severity rework)
- fix printk messages and comments
- s/switch_mce_handler_for_kdump/kdump_setup_mce/g
- print out msg in non-fatal MCE case too

ChangeLog: v6 -> v7
- use rdmsrl() instead of rdmsrl_safe() with mca_cfg.disabled check

ChangeLog v5 -> v6:
- drop "CC stable" tag
- stop using/exporting mce_gather_info(), mce_(rd|wr)msrl(), and mce_panic()
- drop quirk_no_way_out() part, because quirk_sandybridge_ifu() (only possible
callback) could just change a MCE_PANIC_SEVERITY case to a MCE_AR_SEVERITY
case, which doesn't affect the panic/return decision.

ChangeLog v4 -> v5:
- drop MCE_UC/AR_SEVERITY re-ordering
- move most of code to arch/x86/kernel/crash.c
- export some MCE internal variables/routines via arch/x86/include/asm/mce.h

ChangeLog v3 -> v4:
- fixed AR and UC order in enum severity_level because UC is severer than AR
by definition. Current code is not affected by this wrong order by chance.
- check severity in machine_check_under_kdump(), and call mce_panic() if the
resultant severity is as bad as or worse than MCE_AR_SEVERITY.
- use static global variable kdump_cpu instead of mca_cfg->kdump_cpu
- reduce "#ifdef CONFIG_KEXEC"
- add "#ifdef CONFIG_X86_MCE" for declaration of machine_check_under_kdump()
in mce.h
- update comment on switch_mce_handler_for_kdump()

ChangeLog v2 -> v3
- go to "switch MCE handler" approach

ChangeLog v1 -> v2
- clear MSR_IA32_MCG_CTL, MSR_IA32_MCx_CTL, and CR4.MCE instead of using
global flag to ignore MCE events.
- fixed the description of the problem
---
arch/x86/include/asm/mce.h | 15 ++++++
arch/x86/kernel/cpu/mcheck/mce-internal.h | 13 -----
arch/x86/kernel/crash.c | 87 +++++++++++++++++++++++++++++++
3 files changed, 102 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 1f5a86d518db..d35ea7e8a764 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -255,4 +255,19 @@ struct cper_sec_mem_err;
extern void apei_mce_report_mem_error(int corrected,
struct cper_sec_mem_err *mem_err);

+enum severity_level {
+ MCE_NO_SEVERITY,
+ MCE_DEFERRED_SEVERITY,
+ MCE_UCNA_SEVERITY = MCE_DEFERRED_SEVERITY,
+ MCE_KEEP_SEVERITY,
+ MCE_SOME_SEVERITY,
+ MCE_AO_SEVERITY,
+ MCE_UC_SEVERITY,
+ MCE_AR_SEVERITY,
+ MCE_PANIC_SEVERITY,
+};
+
+extern int (*mce_severity)(struct mce *a, int tolerant, char **msg,
+ bool is_excp);
+
#endif /* _ASM_X86_MCE_H */
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index fe32074b865b..47d1e5218fb5 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -1,18 +1,6 @@
#include <linux/device.h>
#include <asm/mce.h>

-enum severity_level {
- MCE_NO_SEVERITY,
- MCE_DEFERRED_SEVERITY,
- MCE_UCNA_SEVERITY = MCE_DEFERRED_SEVERITY,
- MCE_KEEP_SEVERITY,
- MCE_SOME_SEVERITY,
- MCE_AO_SEVERITY,
- MCE_UC_SEVERITY,
- MCE_AR_SEVERITY,
- MCE_PANIC_SEVERITY,
-};
-
#define ATTR_LEN 16
#define INITIAL_CHECK_INTERVAL 5 * 60 /* 5 minutes */

@@ -24,7 +12,6 @@ struct mce_bank {
char attrname[ATTR_LEN]; /* attribute name */
};

-extern int (*mce_severity)(struct mce *a, int tolerant, char **msg, bool is_excp);
struct dentry *mce_get_debugfs_dir(void);

extern struct mce_bank *mce_banks;
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index b71ff16cbdb0..0b12eb5452f6 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -34,6 +34,7 @@
#include <asm/cpu.h>
#include <asm/reboot.h>
#include <asm/virtext.h>
+#include <asm/mce.h>

/* Alignment required for elf header segment */
#define ELF_CORE_HEADER_ALIGN 4096
@@ -76,6 +77,90 @@ struct crash_memmap_data {

int in_crash_kexec;

+#ifdef CONFIG_X86_MCE
+static int kdump_cpu;
+
+/*
+ * kdump-specific machine check handler
+ *
+ * When kexec/kdump is running, what the MCE handler is expected to do
+ * changes depending on whether the CPU is running the main thread or not.
+ *
+ * The crashing CPU, controlling the whole system exclusively, should try to
+ * get kdump as hard as possible even if an MCE happens concurrently, because
+ * some types of MCEs (for example, uncorrected errors like SRAO,)
+ * are not fatal or don't ruin reliablility of the kdump (consider that an
+ * MCE can hit the other CPU, in which case corrupted data is never consumed.)
+ * If an MCE critically breaks the kdump operation, we are unlucky so let's
+ * accept the fate of whatever HW causes, hoping a dying message reaches admins.
+ *
+ * The other CPUs are supposed to be quiet during kexec/kdump, so after the
+ * crashing CPU shot them down, they should not do anything except clearing
+ * MCG_STATUS (without this the system is reset, which is undesirable.)
+ * Note that this is also true after the crashing CPU enter the 2nd kernel.
+ */
+static void machine_check_under_kdump(struct pt_regs *regs, long error_code)
+{
+ struct mce m = {};
+ char *msg = NULL;
+ char *nmsg = NULL;
+ int i;
+ int worst = 0;
+ int severity;
+
+ if (mca_cfg.disabled)
+ return;
+ if (!mca_cfg.banks)
+ goto out;
+ if (kdump_cpu != smp_processor_id())
+ goto clear_mcg_status;
+
+ rdmsrl(MSR_IA32_MCG_STATUS, m.mcgstatus);
+ if (regs && m.mcgstatus & (MCG_STATUS_RIPV|MCG_STATUS_EIPV))
+ m.cs = regs->cs;
+
+ for (i = 0; i < mca_cfg.banks; i++) {
+ rdmsrl(MSR_IA32_MCx_STATUS(i), m.status);
+ severity = mce_severity(&m, mca_cfg.tolerant, &nmsg, true);
+ if (severity > worst) {
+ worst = severity;
+ msg = nmsg;
+ }
+ }
+
+ if (worst >= MCE_UC_SEVERITY)
+ panic("kdump: Fatal machine check: %s", msg);
+
+ pr_emerg("kdump: Non-fatal MCE detected: %s - kernel dump might be unreliable.\n", msg);
+
+clear_mcg_status:
+ wrmsrl(MSR_IA32_MCG_STATUS, 0);
+out:
+ sync_core();
+}
+
+/*
+ * Switch the MCE handler to kdump-specific one
+ *
+ * Standard MCE handler do_machine_check() is not designed for kexec/kdump
+ * context, where we can't expect MCE's recovering and logging to work fine
+ * because the kernel might be unstable (all CPUs except one must be idle.)
+ *
+ * In such situations, getting a kernel dump is more important than handling
+ * MCEs because what the users are really interested in is to find out what
+ * caused the crash.
+ *
+ * So let's switch MCE handler to the one suitable for kexec/kdump situation.
+ */
+void kdump_setup_mce(void)
+{
+ kdump_cpu = smp_processor_id();
+ machine_check_vector = machine_check_under_kdump;
+}
+#else
+static inline void kdump_setup_mce(void) {}
+#endif
+
/*
* This is used to VMCLEAR all VMCSs loaded on the
* processor. And when loading kvm_intel module, the
@@ -166,6 +251,8 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
/* The kernel is broken so disable interrupts */
local_irq_disable();

+ kdump_setup_mce();
+
kdump_nmi_shootdown_cpus();

/*
--
2.1.0
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump [ In reply to ]
On Tue, Apr 07, 2015 at 08:02:18AM +0000, Naoya Horiguchi wrote:
> kexec disables (or "shoots down") all CPUs other than a crashing CPU before
> entering the 2nd kernel. But the MCE handler is still enabled after that,
> so if MCE happens and broadcasts over the CPUs after the main thread starts
> the 2nd kernel (which might not initialize its MCE handler yet, or might decide
> not to enable it,) MCE handler runs only on the other CPUs (not on the main
> thread,) leading to kernel panic with MCE synchronization. The user-visible
> effect of this bug is kdump failure.
>
> Our standard MCE handler do_machine_check() assumes some about system's
> status and it's hard to alter it to cover kexec/kdump context, so let's add
> another kdump-specific one and switch to it.
>
> Note that this problem exists since current MCE handler was implemented in
> 2.6.32, and recently commit 716079f66eac ("mce: Panic when a core has reached
> a timeout") made it more visible by changing the default behavior of the
> synchronization timeout from "ignore" to "panic".
>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> ---
> ChangeLog: v7 -> v8

Ok, I cleaned it up a bit and straightened some comments, see below. But
other than minor issues I don't see anything wrong with this patch so
far.

Btw, Ingo had some reservations about this. Ingo?

Also Tony hasn't Ack/Naked this...

The more important question for me is how are you testing this? I did
try injecting some MCEs with qemu while the second kernel is booting but
that caused a triple-fault or the guest froze completely.

Hmmm.

Thanks.

---
From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Date: Tue, 7 Apr 2015 08:02:18 +0000
Subject: [PATCH] x86/mce, crash: Switch MCE handler for kexec/kdump

kexec disables (or "shoots down") all CPUs other than the crashing
CPU before entering the 2nd kernel. However, MCA is still enabled so
if an MCE happens and broadcasts to the CPUs after the main thread
starts the 2nd kernel (which might not initialize its MCE handler yet,
or might decide not to enable it) the MCE handler runs only on the
other CPUs (not on the main thread) leading to kernel panic during MCE
synchronization. The user-visible effect of this bug is a kdump failure.

Our standard MCE handler do_machine_check() assumes a bunch of things
about system's status and it's hard to alter it to cover kexec/kdump
context, so add another, kdump-specific one and switch to it.

Note that this problem exists since current MCE handler was implemented
in 2.6.32, and recently commit 716079f66eac ("mce: Panic when a core
has reached a timeout") made it more visible by changing the default
behavior of the synchronization timeout from "ignore" to "panic".

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: "Tony Luck" <tony.luck@intel.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Cc: Junichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Link: http://lkml.kernel.org/r/1425373306-26187-1-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Borislav Petkov <bp@suse.de>
---
arch/x86/include/asm/mce.h | 14 +++++
arch/x86/kernel/cpu/mcheck/mce-internal.h | 13 -----
arch/x86/kernel/crash.c | 89 +++++++++++++++++++++++++++++++
3 files changed, 103 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 1f5a86d518db..a88a74e19d14 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -255,4 +255,18 @@ struct cper_sec_mem_err;
extern void apei_mce_report_mem_error(int corrected,
struct cper_sec_mem_err *mem_err);

+enum severity_level {
+ MCE_NO_SEVERITY,
+ MCE_DEFERRED_SEVERITY,
+ MCE_UCNA_SEVERITY = MCE_DEFERRED_SEVERITY,
+ MCE_KEEP_SEVERITY,
+ MCE_SOME_SEVERITY,
+ MCE_AO_SEVERITY,
+ MCE_UC_SEVERITY,
+ MCE_AR_SEVERITY,
+ MCE_PANIC_SEVERITY,
+};
+
+extern int (*mce_severity)(struct mce *a, int tolerant, char **msg, bool is_excp);
+
#endif /* _ASM_X86_MCE_H */
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index fe32074b865b..47d1e5218fb5 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -1,18 +1,6 @@
#include <linux/device.h>
#include <asm/mce.h>

-enum severity_level {
- MCE_NO_SEVERITY,
- MCE_DEFERRED_SEVERITY,
- MCE_UCNA_SEVERITY = MCE_DEFERRED_SEVERITY,
- MCE_KEEP_SEVERITY,
- MCE_SOME_SEVERITY,
- MCE_AO_SEVERITY,
- MCE_UC_SEVERITY,
- MCE_AR_SEVERITY,
- MCE_PANIC_SEVERITY,
-};
-
#define ATTR_LEN 16
#define INITIAL_CHECK_INTERVAL 5 * 60 /* 5 minutes */

@@ -24,7 +12,6 @@ struct mce_bank {
char attrname[ATTR_LEN]; /* attribute name */
};

-extern int (*mce_severity)(struct mce *a, int tolerant, char **msg, bool is_excp);
struct dentry *mce_get_debugfs_dir(void);

extern struct mce_bank *mce_banks;
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index aceb2f90c716..f4948a8d5fa6 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -34,6 +34,7 @@
#include <asm/cpu.h>
#include <asm/reboot.h>
#include <asm/virtext.h>
+#include <asm/mce.h>

/* Alignment required for elf header segment */
#define ELF_CORE_HEADER_ALIGN 4096
@@ -76,6 +77,92 @@ struct crash_memmap_data {

int in_crash_kexec;

+#ifdef CONFIG_X86_MCE
+static int kdump_cpu;
+
+/*
+ * kdump-specific machine check handler
+ *
+ * When kexec/kdump is running, what the MCE handler is expected to do
+ * changes depending on whether the CPU is running the main thread or not.
+ *
+ * The crashing CPU, controlling the whole system exclusively, should
+ * try to complete a dump as hard as possible even if an MCE happens
+ * concurrently because some types of non-fatal MCEs (for example,
+ * uncorrected errors like SRAO) don't necessarily impair kdump's
+ * reliability (consider that an MCE can hit another CPU, in which
+ * case corrupted data might not get consumed). If an MCE critically
+ * breaks kdump operation, we are unlucky and then have to accept the HW
+ * failure. In that case, we hope that at least a dying message reaches
+ * the admins.
+ *
+ * The other CPUs are supposed to be quiet during kexec/kdump, so after
+ * the crashing CPU shot them down, they should not do anything except
+ * clearing MCG_STATUS (they need to, otherwise the system gets reset,
+ * which is undesirable either). Note that this is also true after the
+ * crashing CPU enters the 2nd kernel.
+ */
+static void machine_check_under_kdump(struct pt_regs *regs, long error_code)
+{
+ char *msg = NULL, *nmsg = NULL;
+ int i, severity, worst = 0;
+ struct mce m = {};
+
+ if (mca_cfg.disabled)
+ return;
+
+ if (!mca_cfg.banks)
+ goto out;
+
+ if (kdump_cpu != smp_processor_id())
+ goto clear_mcg_status;
+
+ rdmsrl(MSR_IA32_MCG_STATUS, m.mcgstatus);
+ if (regs && m.mcgstatus & (MCG_STATUS_RIPV|MCG_STATUS_EIPV))
+ m.cs = regs->cs;
+
+ for (i = 0; i < mca_cfg.banks; i++) {
+ rdmsrl(MSR_IA32_MCx_STATUS(i), m.status);
+ severity = mce_severity(&m, mca_cfg.tolerant, &nmsg, true);
+ if (severity > worst) {
+ worst = severity;
+ msg = nmsg;
+ }
+ }
+
+ if (worst >= MCE_UC_SEVERITY)
+ panic("kdump: Fatal machine check: %s", msg);
+
+ pr_emerg("kdump: Non-fatal MCE detected: %s - kernel dump might be unreliable.\n", msg);
+
+clear_mcg_status:
+ wrmsrl(MSR_IA32_MCG_STATUS, 0);
+out:
+ sync_core();
+}
+
+/*
+ * Switch the MCE handler to kdump-specific one
+ *
+ * Standard MCE handler do_machine_check() is not designed for kexec/kdump
+ * context, where we can't expect MCE's recovering and logging to work fine
+ * because the kernel might be unstable (all CPUs except one must be idle).
+ *
+ * In such situations, getting a kernel dump is more important than handling
+ * MCEs because what the users are really interested in is to find out what
+ * caused the crash.
+ *
+ * So let's switch MCE handler to the one suitable for kexec/kdump situation.
+ */
+void kdump_setup_mce(void)
+{
+ kdump_cpu = smp_processor_id();
+ machine_check_vector = machine_check_under_kdump;
+}
+#else
+static inline void kdump_setup_mce(void) {}
+#endif
+
/*
* This is used to VMCLEAR all VMCSs loaded on the
* processor. And when loading kvm_intel module, the
@@ -157,6 +244,8 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
/* The kernel is broken so disable interrupts */
local_irq_disable();

+ kdump_setup_mce();
+
kdump_nmi_shootdown_cpus();

/*
--
2.3.3


--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump [ In reply to ]
On Thu, Apr 09, 2015 at 08:13:46AM +0200, Borislav Petkov wrote:
> On Tue, Apr 07, 2015 at 08:02:18AM +0000, Naoya Horiguchi wrote:
> > kexec disables (or "shoots down") all CPUs other than a crashing CPU before
> > entering the 2nd kernel. But the MCE handler is still enabled after that,
> > so if MCE happens and broadcasts over the CPUs after the main thread starts
> > the 2nd kernel (which might not initialize its MCE handler yet, or might decide
> > not to enable it,) MCE handler runs only on the other CPUs (not on the main
> > thread,) leading to kernel panic with MCE synchronization. The user-visible
> > effect of this bug is kdump failure.
> >
> > Our standard MCE handler do_machine_check() assumes some about system's
> > status and it's hard to alter it to cover kexec/kdump context, so let's add
> > another kdump-specific one and switch to it.
> >
> > Note that this problem exists since current MCE handler was implemented in
> > 2.6.32, and recently commit 716079f66eac ("mce: Panic when a core has reached
> > a timeout") made it more visible by changing the default behavior of the
> > synchronization timeout from "ignore" to "panic".
> >
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > ---
> > ChangeLog: v7 -> v8
>
> Ok, I cleaned it up a bit and straightened some comments, see below. But
> other than minor issues I don't see anything wrong with this patch so
> far.
>
> Btw, Ingo had some reservations about this. Ingo?
>
> Also Tony hasn't Ack/Naked this...
>
> The more important question for me is how are you testing this? I did
> try injecting some MCEs with qemu while the second kernel is booting but
> that caused a triple-fault or the guest froze completely.

Yes, I did see it at fisrt, so I did two tweaks for the testing:

1) to fix qemu code. I think that current mce injection code of qemu is buggy,
because when we try to inject MCE in broadcast mode, all injections other than
the first one are done with MCG_STATUS_MCIP (see cpu_x86_inject_mce()@target-i386/helper.c.)
It looks to me a bug because this means that every (broadcast mode) MCE injection
causes triplet-fault, which seems not mimicking the real HW behavior.

2) to insert the delay (for a few seconds) into kdump_nmi_callback() before
disable_local_APIC(). This is because MCE interrupt is delivered to CPUs in
different manners in qemu and in bare metal. Bare metals do respond to MCE
interrupts after disable_local_APIC(), but qemu not.


Unfortunately our testing (~15000 times kdump/reboot cycles) with the debug
kernel on bare metals didn't reproduce the problem yet, but I believe that
the above testing on qemu should hit a target.

Thanks,
Naoya Horiguchi

>
> Hmmm.
>
> Thanks.
>
> ---
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Date: Tue, 7 Apr 2015 08:02:18 +0000
> Subject: [PATCH] x86/mce, crash: Switch MCE handler for kexec/kdump
>
> kexec disables (or "shoots down") all CPUs other than the crashing
> CPU before entering the 2nd kernel. However, MCA is still enabled so
> if an MCE happens and broadcasts to the CPUs after the main thread
> starts the 2nd kernel (which might not initialize its MCE handler yet,
> or might decide not to enable it) the MCE handler runs only on the
> other CPUs (not on the main thread) leading to kernel panic during MCE
> synchronization. The user-visible effect of this bug is a kdump failure.
>
> Our standard MCE handler do_machine_check() assumes a bunch of things
> about system's status and it's hard to alter it to cover kexec/kdump
> context, so add another, kdump-specific one and switch to it.
>
> Note that this problem exists since current MCE handler was implemented
> in 2.6.32, and recently commit 716079f66eac ("mce: Panic when a core
> has reached a timeout") made it more visible by changing the default
> behavior of the synchronization timeout from "ignore" to "panic".
>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Cc: "Tony Luck" <tony.luck@intel.com>
> Cc: Prarit Bhargava <prarit@redhat.com>
> Cc: Vivek Goyal <vgoyal@redhat.com>
> Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
> Cc: Junichi Nomura <j-nomura@ce.jp.nec.com>
> Cc: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
> Link: http://lkml.kernel.org/r/1425373306-26187-1-git-send-email-n-horiguchi@ah.jp.nec.com
> Signed-off-by: Borislav Petkov <bp@suse.de>
> ---
> arch/x86/include/asm/mce.h | 14 +++++
> arch/x86/kernel/cpu/mcheck/mce-internal.h | 13 -----
> arch/x86/kernel/crash.c | 89 +++++++++++++++++++++++++++++++
> 3 files changed, 103 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
> index 1f5a86d518db..a88a74e19d14 100644
> --- a/arch/x86/include/asm/mce.h
> +++ b/arch/x86/include/asm/mce.h
> @@ -255,4 +255,18 @@ struct cper_sec_mem_err;
> extern void apei_mce_report_mem_error(int corrected,
> struct cper_sec_mem_err *mem_err);
>
> +enum severity_level {
> + MCE_NO_SEVERITY,
> + MCE_DEFERRED_SEVERITY,
> + MCE_UCNA_SEVERITY = MCE_DEFERRED_SEVERITY,
> + MCE_KEEP_SEVERITY,
> + MCE_SOME_SEVERITY,
> + MCE_AO_SEVERITY,
> + MCE_UC_SEVERITY,
> + MCE_AR_SEVERITY,
> + MCE_PANIC_SEVERITY,
> +};
> +
> +extern int (*mce_severity)(struct mce *a, int tolerant, char **msg, bool is_excp);
> +
> #endif /* _ASM_X86_MCE_H */
> diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
> index fe32074b865b..47d1e5218fb5 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
> +++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
> @@ -1,18 +1,6 @@
> #include <linux/device.h>
> #include <asm/mce.h>
>
> -enum severity_level {
> - MCE_NO_SEVERITY,
> - MCE_DEFERRED_SEVERITY,
> - MCE_UCNA_SEVERITY = MCE_DEFERRED_SEVERITY,
> - MCE_KEEP_SEVERITY,
> - MCE_SOME_SEVERITY,
> - MCE_AO_SEVERITY,
> - MCE_UC_SEVERITY,
> - MCE_AR_SEVERITY,
> - MCE_PANIC_SEVERITY,
> -};
> -
> #define ATTR_LEN 16
> #define INITIAL_CHECK_INTERVAL 5 * 60 /* 5 minutes */
>
> @@ -24,7 +12,6 @@ struct mce_bank {
> char attrname[ATTR_LEN]; /* attribute name */
> };
>
> -extern int (*mce_severity)(struct mce *a, int tolerant, char **msg, bool is_excp);
> struct dentry *mce_get_debugfs_dir(void);
>
> extern struct mce_bank *mce_banks;
> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> index aceb2f90c716..f4948a8d5fa6 100644
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -34,6 +34,7 @@
> #include <asm/cpu.h>
> #include <asm/reboot.h>
> #include <asm/virtext.h>
> +#include <asm/mce.h>
>
> /* Alignment required for elf header segment */
> #define ELF_CORE_HEADER_ALIGN 4096
> @@ -76,6 +77,92 @@ struct crash_memmap_data {
>
> int in_crash_kexec;
>
> +#ifdef CONFIG_X86_MCE
> +static int kdump_cpu;
> +
> +/*
> + * kdump-specific machine check handler
> + *
> + * When kexec/kdump is running, what the MCE handler is expected to do
> + * changes depending on whether the CPU is running the main thread or not.
> + *
> + * The crashing CPU, controlling the whole system exclusively, should
> + * try to complete a dump as hard as possible even if an MCE happens
> + * concurrently because some types of non-fatal MCEs (for example,
> + * uncorrected errors like SRAO) don't necessarily impair kdump's
> + * reliability (consider that an MCE can hit another CPU, in which
> + * case corrupted data might not get consumed). If an MCE critically
> + * breaks kdump operation, we are unlucky and then have to accept the HW
> + * failure. In that case, we hope that at least a dying message reaches
> + * the admins.
> + *
> + * The other CPUs are supposed to be quiet during kexec/kdump, so after
> + * the crashing CPU shot them down, they should not do anything except
> + * clearing MCG_STATUS (they need to, otherwise the system gets reset,
> + * which is undesirable either). Note that this is also true after the
> + * crashing CPU enters the 2nd kernel.
> + */
> +static void machine_check_under_kdump(struct pt_regs *regs, long error_code)
> +{
> + char *msg = NULL, *nmsg = NULL;
> + int i, severity, worst = 0;
> + struct mce m = {};
> +
> + if (mca_cfg.disabled)
> + return;
> +
> + if (!mca_cfg.banks)
> + goto out;
> +
> + if (kdump_cpu != smp_processor_id())
> + goto clear_mcg_status;
> +
> + rdmsrl(MSR_IA32_MCG_STATUS, m.mcgstatus);
> + if (regs && m.mcgstatus & (MCG_STATUS_RIPV|MCG_STATUS_EIPV))
> + m.cs = regs->cs;
> +
> + for (i = 0; i < mca_cfg.banks; i++) {
> + rdmsrl(MSR_IA32_MCx_STATUS(i), m.status);
> + severity = mce_severity(&m, mca_cfg.tolerant, &nmsg, true);
> + if (severity > worst) {
> + worst = severity;
> + msg = nmsg;
> + }
> + }
> +
> + if (worst >= MCE_UC_SEVERITY)
> + panic("kdump: Fatal machine check: %s", msg);
> +
> + pr_emerg("kdump: Non-fatal MCE detected: %s - kernel dump might be unreliable.\n", msg);
> +
> +clear_mcg_status:
> + wrmsrl(MSR_IA32_MCG_STATUS, 0);
> +out:
> + sync_core();
> +}
> +
> +/*
> + * Switch the MCE handler to kdump-specific one
> + *
> + * Standard MCE handler do_machine_check() is not designed for kexec/kdump
> + * context, where we can't expect MCE's recovering and logging to work fine
> + * because the kernel might be unstable (all CPUs except one must be idle).
> + *
> + * In such situations, getting a kernel dump is more important than handling
> + * MCEs because what the users are really interested in is to find out what
> + * caused the crash.
> + *
> + * So let's switch MCE handler to the one suitable for kexec/kdump situation.
> + */
> +void kdump_setup_mce(void)
> +{
> + kdump_cpu = smp_processor_id();
> + machine_check_vector = machine_check_under_kdump;
> +}
> +#else
> +static inline void kdump_setup_mce(void) {}
> +#endif
> +
> /*
> * This is used to VMCLEAR all VMCSs loaded on the
> * processor. And when loading kvm_intel module, the
> @@ -157,6 +244,8 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
> /* The kernel is broken so disable interrupts */
> local_irq_disable();
>
> + kdump_setup_mce();
> +
> kdump_nmi_shootdown_cpus();
>
> /*
> --
> 2.3.3
>
>
> --
> Regards/Gruss,
> Boris.
>
> ECO tip #101: Trim your mails when you reply.
> --
> --
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump [ In reply to ]
On Thu, Apr 09, 2015 at 06:57:38AM +0000, Naoya Horiguchi wrote:
> Yes, I did see it at fisrt, so I did two tweaks for the testing:
>
> 1) to fix qemu code. I think that current mce injection code of qemu is buggy,
> because when we try to inject MCE in broadcast mode, all injections other than
> the first one are done with MCG_STATUS_MCIP (see cpu_x86_inject_mce()@target-i386/helper.c.)
> It looks to me a bug because this means that every (broadcast mode) MCE injection
> causes triplet-fault, which seems not mimicking the real HW behavior.
>
> 2) to insert the delay (for a few seconds) into kdump_nmi_callback() before
> disable_local_APIC(). This is because MCE interrupt is delivered to CPUs in
> different manners in qemu and in bare metal. Bare metals do respond to MCE
> interrupts after disable_local_APIC(), but qemu not.

Lemme take a look at that.

> Unfortunately our testing (~15000 times kdump/reboot cycles) with the debug
> kernel on bare metals didn't reproduce the problem yet, but I believe that
> the above testing on qemu should hit a target.

If only APEI EINJ could be taught to do delayed injection, regardless of
OS kernel running. Tony, is something like that even possible at all?

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump [ In reply to ]
* Borislav Petkov <bp@alien8.de> wrote:

> Btw, Ingo had some reservations about this. Ingo?

Yeah, so my concerns are the following:

> kexec disables (or "shoots down") all CPUs other than the crashing
> CPU before entering the 2nd kernel. However, MCA is still enabled so
> if an MCE happens and broadcasts to the CPUs after the main thread
> starts the 2nd kernel (which might not initialize its MCE handler
> yet, or might decide not to enable it) the MCE handler runs only on
> the other CPUs (not on the main thread) leading to kernel panic
> during MCE synchronization. The user-visible effect of this bug is a
> kdump failure.

So the thing is, when we boot up the second kernel there will be a
window where the old handler isn't valid (because the new kernel has
its own pagetables, etc.) and the new handler is not installed yet.

If an MCE hits that window, it's bad luck. (unless the bootup sequence
is rearchitected significantly to allow cross-kernel inheritance of
MCE handlers.)

So I think we can ignore _that_ race.

The more significant question is: what happens when an MCE arrives
whiel the kdump is proceeding - as kdumps can take a long time to
finish when there's a lot of RAM.

But ... since the 'shootdown' is analogous to a CPU hotplug CPU-down
sequence, I suppose that the existing MCE code should already properly
handle the case where an MCE arrives on a (supposedly) dead CPU,
right? In that case installing a separate MCE handler looks like the
wrong thing.

> Our standard MCE handler do_machine_check() assumes a bunch of
> things about system's status and it's hard to alter it to cover
> kexec/kdump context, so add another, kdump-specific one and switch
> to it.

So I don't like this principle either: 'our current code is a mess
that might not work, add new one'.

> Note that this problem exists since current MCE handler was
> implemented in 2.6.32, and recently commit 716079f66eac ("mce: Panic
> when a core has reached a timeout") made it more visible by changing
> the default behavior of the synchronization timeout from "ignore" to
> "panic".

Looks like that's the real problem. How about the kdump crash dumper
sets it back to 'ignore' again when we crash, and also double check
how we handle various corner cases?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump [ In reply to ]
On Thu, Apr 09, 2015 at 10:00:30AM +0200, Ingo Molnar wrote:
> So the thing is, when we boot up the second kernel there will be a
> window where the old handler isn't valid (because the new kernel has
> its own pagetables, etc.) and the new handler is not installed yet.
>
> If an MCE hits that window, it's bad luck. (unless the bootup sequence
> is rearchitected significantly to allow cross-kernel inheritance of
> MCE handlers.)
>
> So I think we can ignore _that_ race.

Yah, that's the "tough luck" race.

> The more significant question is: what happens when an MCE arrives
> whiel the kdump is proceeding - as kdumps can take a long time to
> finish when there's a lot of RAM.

We say that the dump might be unreliable.

> But ... since the 'shootdown' is analogous to a CPU hotplug CPU-down
> sequence, I suppose that the existing MCE code should already properly
> handle the case where an MCE arrives on a (supposedly) dead CPU,
> right? In that case installing a separate MCE handler looks like the
> wrong thing.

Hmm, so mce_start() does look only on the online CPUs. So if crash does
maintain those masks correctly...

> So I don't like this principle either: 'our current code is a mess
> that might not work, add new one'.

Well, we can try to simplify it in the sense that those assumptions like
mcelog and other MCE consuming crap and notifier chain are tested for
their presence before using them...

I'd be open for this if we have a way to test this kdump scenario. For
now, not even qemu can do that.

> Looks like that's the real problem. How about the kdump crash dumper
> sets it back to 'ignore' again when we crash, and also double check
> how we handle various corner cases?

I think I even suggested that at some point. Or was it to increase the
tolerance level. So Naoya, what's wrong with this again? I forgot.

Because this would be the simplest. Simply set tolerance level to 3 and
dump away...

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump [ In reply to ]
On Thu, Apr 09, 2015 at 10:00:30AM +0200, Ingo Molnar wrote:
>
> * Borislav Petkov <bp@alien8.de> wrote:
>
> > Btw, Ingo had some reservations about this. Ingo?
>
> Yeah, so my concerns are the following:
>
> > kexec disables (or "shoots down") all CPUs other than the crashing
> > CPU before entering the 2nd kernel. However, MCA is still enabled so
> > if an MCE happens and broadcasts to the CPUs after the main thread
> > starts the 2nd kernel (which might not initialize its MCE handler
> > yet, or might decide not to enable it) the MCE handler runs only on
> > the other CPUs (not on the main thread) leading to kernel panic
> > during MCE synchronization. The user-visible effect of this bug is a
> > kdump failure.
>
> So the thing is, when we boot up the second kernel there will be a
> window where the old handler isn't valid (because the new kernel has
> its own pagetables, etc.) and the new handler is not installed yet.
>
> If an MCE hits that window, it's bad luck. (unless the bootup sequence
> is rearchitected significantly to allow cross-kernel inheritance of
> MCE handlers.)
>
> So I think we can ignore _that_ race.
>
> The more significant question is: what happens when an MCE arrives
> whiel the kdump is proceeding - as kdumps can take a long time to
> finish when there's a lot of RAM.

Without this patch, MCE makes idling CPUs unpreferably wake up and
needlessly run MCE handler, which disturbs memory so does harm on the kdump.
This patch improves not only the transition phase, but also that window.

> But ... since the 'shootdown' is analogous to a CPU hotplug CPU-down
> sequence, I suppose that the existing MCE code should already properly
> handle the case where an MCE arrives on a (supposedly) dead CPU,
> right?

Currently not, so Tony mentioned some idea about it (although not included
in this patch.)

> In that case installing a separate MCE handler looks like the
> wrong thing.

One difference bewteen kdump and CPU offline is whether we need handle
MCEs then or not. In CPU offline situation, running CPUs have to continue
their normal operations, so it's imporatant to handle MCE (i.e. log and/or
take recovery action), so I think that should be done in our main MCE
handler, do_machine_check().
But that's not the case in kdump situation (logging or recovering is
not possible/necessary any more.) So it seems make sense to me to
separate the handler.

> > Our standard MCE handler do_machine_check() assumes a bunch of
> > things about system's status and it's hard to alter it to cover
> > kexec/kdump context, so add another, kdump-specific one and switch
> > to it.
>
> So I don't like this principle either: 'our current code is a mess
> that might not work, add new one'.
>
> > Note that this problem exists since current MCE handler was
> > implemented in 2.6.32, and recently commit 716079f66eac ("mce: Panic
> > when a core has reached a timeout") made it more visible by changing
> > the default behavior of the synchronization timeout from "ignore" to
> > "panic".
>
> Looks like that's the real problem. How about the kdump crash dumper
> sets it back to 'ignore' again when we crash, and also double check
> how we handle various corner cases?

Boris mentions this in another email, so I'll reply to it.

Thanks,
Naoya Horiguchi--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump [ In reply to ]
On Thu, Apr 09, 2015 at 10:21:25AM +0200, Borislav Petkov wrote:
> On Thu, Apr 09, 2015 at 10:00:30AM +0200, Ingo Molnar wrote:
> > So the thing is, when we boot up the second kernel there will be a
> > window where the old handler isn't valid (because the new kernel has
> > its own pagetables, etc.) and the new handler is not installed yet.
> >
> > If an MCE hits that window, it's bad luck. (unless the bootup sequence
> > is rearchitected significantly to allow cross-kernel inheritance of
> > MCE handlers.)
> >
> > So I think we can ignore _that_ race.
>
> Yah, that's the "tough luck" race.
>
> > The more significant question is: what happens when an MCE arrives
> > whiel the kdump is proceeding - as kdumps can take a long time to
> > finish when there's a lot of RAM.
>
> We say that the dump might be unreliable.
>
> > But ... since the 'shootdown' is analogous to a CPU hotplug CPU-down
> > sequence, I suppose that the existing MCE code should already properly
> > handle the case where an MCE arrives on a (supposedly) dead CPU,
> > right? In that case installing a separate MCE handler looks like the
> > wrong thing.
>
> Hmm, so mce_start() does look only on the online CPUs. So if crash does
> maintain those masks correctly...
>
> > So I don't like this principle either: 'our current code is a mess
> > that might not work, add new one'.
>
> Well, we can try to simplify it in the sense that those assumptions like
> mcelog and other MCE consuming crap and notifier chain are tested for
> their presence before using them...
>
> I'd be open for this if we have a way to test this kdump scenario. For
> now, not even qemu can do that.

I replied about testing.
That might be tricky a little, but I hope it helps.

> > Looks like that's the real problem. How about the kdump crash dumper
> > sets it back to 'ignore' again when we crash, and also double check
> > how we handle various corner cases?
>
> I think I even suggested that at some point. Or was it to increase the
> tolerance level. So Naoya, what's wrong with this again? I forgot.

Even if we raise tolerant level in running kdump, that doesn't prevent
idling CPUs from running MCE handlers when MCE arrives, which makes memory
accesses (losing information from kdump's viewpoint) and spits
"MCE synchronization timeout" messages (unclear and confusing for users.)
And it also leaves a potential risk of being broken again when do_machine_check()
changes in the future (which maybe come from sharing code to handle different
situations.)

So raising tolerance is OK as a "minimum change" approach, but it has
above downsides to be traded off.

Thanks,
Naoya Horiguchi

> Because this would be the simplest. Simply set tolerance level to 3 and
> dump away...--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump [ In reply to ]
* Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:

> On Thu, Apr 09, 2015 at 10:00:30AM +0200, Ingo Molnar wrote:
> >
> > * Borislav Petkov <bp@alien8.de> wrote:
> >
> > > Btw, Ingo had some reservations about this. Ingo?
> >
> > Yeah, so my concerns are the following:
> >
> > > kexec disables (or "shoots down") all CPUs other than the crashing
> > > CPU before entering the 2nd kernel. However, MCA is still enabled so
> > > if an MCE happens and broadcasts to the CPUs after the main thread
> > > starts the 2nd kernel (which might not initialize its MCE handler
> > > yet, or might decide not to enable it) the MCE handler runs only on
> > > the other CPUs (not on the main thread) leading to kernel panic
> > > during MCE synchronization. The user-visible effect of this bug is a
> > > kdump failure.
> >
> > So the thing is, when we boot up the second kernel there will be a
> > window where the old handler isn't valid (because the new kernel has
> > its own pagetables, etc.) and the new handler is not installed yet.
> >
> > If an MCE hits that window, it's bad luck. (unless the bootup sequence
> > is rearchitected significantly to allow cross-kernel inheritance of
> > MCE handlers.)
> >
> > So I think we can ignore _that_ race.
> >
> > The more significant question is: what happens when an MCE arrives
> > whiel the kdump is proceeding - as kdumps can take a long time to
> > finish when there's a lot of RAM.
>
> Without this patch, MCE makes idling CPUs unpreferably wake up and
> needlessly run MCE handler, which disturbs memory so does harm on
> the kdump. This patch improves not only the transition phase, but
> also that window.

The way the kdump code stops CPUs already 'disturbs' the state of
those CPUs.

> > But ... since the 'shootdown' is analogous to a CPU hotplug
> > CPU-down sequence, I suppose that the existing MCE code should
> > already properly handle the case where an MCE arrives on a
> > (supposedly) dead CPU, right?
>
> Currently not, so Tony mentioned some idea about it (although not
> included in this patch.)
>
> > In that case installing a separate MCE handler looks like the
> > wrong thing.
>
> One difference bewteen kdump and CPU offline is whether we need handle
> MCEs then or not. In CPU offline situation, running CPUs have to continue
> their normal operations, so it's imporatant to handle MCE (i.e. log and/or
> take recovery action), so I think that should be done in our main MCE
> handler, do_machine_check().

I disagree: if offline CPUs are still active and can produce MCEs then
they should be reported regardless of whether they were shot down by
the CPU hotplug code or by kdump.

> But that's not the case in kdump situation (logging or recovering is
> not possible/necessary any more.) So it seems make sense to me to
> separate the handler.

I disagree: for example logging to the screen is still possible and
should be done if there's an uncorrectable error.

So I agree that MCE policy should be made non-fatal during kdump, but
I disagree that it needs a separate handler: it should be part of the
regular MCE handling routines to handle kdump gracefully.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump [ In reply to ]
On Thu, Apr 09, 2015 at 08:59:44AM +0000, Naoya Horiguchi wrote:
> I replied about testing. That might be tricky a little, but I hope it helps.

Yeah, whatever we do, we need this properly tested before upstreaming.
That's a given.

> Even if we raise tolerant level in running kdump, that doesn't prevent
> idling CPUs from running MCE handlers when MCE arrives, which makes memory
> accesses (losing information from kdump's viewpoint) and spits
> "MCE synchronization timeout" messages (unclear and confusing for users.)

Why? Those CPUs are offlined and num_online_cpus() in mce_start() should
account for that, no?

And if those are offlined, they're very very unlikely to trigger an MCE
as they're idle and not executing code.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump [ In reply to ]
> If only APEI EINJ could be taught to do delayed injection, regardless of
> OS kernel running. Tony, is something like that even possible at all?

Use:

# echo 1 > notrigger

that allows you to plant a land-mine in memory that will get tripped later. Pick the memory address in a clever way
and you can have the MCE trigger when some particular function runs.

-Tony
èº{.nÇ+‰·Ÿ®‰­†+%ŠËlzwm…ébë§²æìr¸›zX§»®w¥Š{ayºʇڙë,j­¢f£¢·hš‹àz¹®w¥¢¸ ¢·¦j:+v‰¨ŠwèjØm¶Ÿÿ¾«‘êçzZ+ƒùšŽŠÝ¢j"ú!¶iO•æ¬z·švØ^¶m§ÿðà nÆŠàþY&—
RE: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump [ In reply to ]
> Why? Those CPUs are offlined and num_online_cpus() in mce_start() should
> account for that, no?
>
> And if those are offlined, they're very very unlikely to trigger an MCE
> as they're idle and not executing code.

Let's step back a few feet and look at the big picture. There are three main classes of machine check
that we might see while trying to run kdump - an remember that all machine checks are currently
broadcast, so all cpus whether online or offline will see them

1) Fatal
We have to crash - lose the dump. Having a new machine check handler will make things a bit easier
to see what happened because we won't have any synchronization failed messages from the offline
cpus.

2) Execution path recoverable (SRAR in SDM parlance).
Also going to be fatal (kdump is all running in ring0, and we can't recover from errors in ring 0). Cleaner
messages as above. Potentially in the future we might be able to make the kdump machine check handler
actually recover by just skipping a page - if the location of the error was in the old kernel image.

3) Non-execution path recoverable (SRAO in SDM)
We ought to be able to keep kdump running if this happens - the "AO" stands for "action optional",
so we are going to choose to not take an action. Wherever the error was, it won't affect correctness
of execution of the current context.

-Tony
N‹§²æìr¸›yúèšØb²X¬¶Ç§vØ^–)Þº{.nÇ+‰·¥Š{±‘êçzX§¶›¡Ü¨}©ž²Æ zÚ&j:+v‰¨¾«‘êçzZ+€Ê+zf£¢·hšˆ§~†­†Ûiÿûàz¹®w¥¢¸?™¨è­Ú&¢)ߢf”ù^jÇ«y§m…á@A«a¶Úÿ 0¶ìh®å’i
Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump [ In reply to ]
On Thu, Apr 09, 2015 at 06:22:02PM +0000, Luck, Tony wrote:
> > Why? Those CPUs are offlined and num_online_cpus() in mce_start() should
> > account for that, no?
> >
> > And if those are offlined, they're very very unlikely to trigger an MCE
> > as they're idle and not executing code.
>
> Let's step back a few feet and look at the big picture. There are three main classes of machine check
> that we might see while trying to run kdump - an remember that all machine checks are currently
> broadcast, so all cpus whether online or offline will see them
>
> 1) Fatal
> We have to crash - lose the dump. Having a new machine check handler will make things a bit easier
> to see what happened because we won't have any synchronization failed messages from the offline
> cpus.

But this should not be a problem if kdump path keeps cpu_online_mask
uptodate. I'm looking at kdump_nmi_callback() or crash_nmi_callback() or
so. Those should clear cpu_online_mask and then mce_start() will work
fine on the crashing CPU.

IMHO, of course.

> 2) Execution path recoverable (SRAR in SDM parlance).
> Also going to be fatal (kdump is all running in ring0, and we can't recover from errors in ring 0). Cleaner
> messages as above. Potentially in the future we might be able to make the kdump machine check handler
> actually recover by just skipping a page - if the location of the error was in the old kernel image.
>
> 3) Non-execution path recoverable (SRAO in SDM)
> We ought to be able to keep kdump running if this happens - the "AO" stands for "action optional",
> so we are going to choose to not take an action. Wherever the error was, it won't affect correctness
> of execution of the current context.

Those could be simply made to go to dmesg during kdump, i.e. decouple
any MCE consumers. And we do that now anyway, i.e. box without mcelog or
some other ras daemon running.

So we could reuse the normal handler - we just need to do some tweaking
first... AFAICT, of course. I believe in that endeavor, the devil will
be in the detail.

Thanks.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump [ In reply to ]
On Thu, Apr 09, 2015 at 09:05:51PM +0200, Borislav Petkov wrote:
> On Thu, Apr 09, 2015 at 06:22:02PM +0000, Luck, Tony wrote:
> > > Why? Those CPUs are offlined and num_online_cpus() in mce_start() should
> > > account for that, no?
> > >
> > > And if those are offlined, they're very very unlikely to trigger an MCE
> > > as they're idle and not executing code.
> >
> > Let's step back a few feet and look at the big picture. There are three main classes of machine check
> > that we might see while trying to run kdump - an remember that all machine checks are currently
> > broadcast, so all cpus whether online or offline will see them
> >
> > 1) Fatal
> > We have to crash - lose the dump. Having a new machine check handler will make things a bit easier
> > to see what happened because we won't have any synchronization failed messages from the offline
> > cpus.
>
> But this should not be a problem if kdump path keeps cpu_online_mask
> uptodate. I'm looking at kdump_nmi_callback() or crash_nmi_callback() or
> so. Those should clear cpu_online_mask and then mce_start() will work
> fine on the crashing CPU.
>
> IMHO, of course.

Sorry, I misread you. With clearing cpu_online_mask in shootdown (not done
yet,) raising tolerance should work without timeout message.
So I think you are right.

> > 2) Execution path recoverable (SRAR in SDM parlance).
> > Also going to be fatal (kdump is all running in ring0, and we can't recover from errors in ring 0). Cleaner
> > messages as above. Potentially in the future we might be able to make the kdump machine check handler
> > actually recover by just skipping a page - if the location of the error was in the old kernel image.
> >
> > 3) Non-execution path recoverable (SRAO in SDM)
> > We ought to be able to keep kdump running if this happens - the "AO" stands for "action optional",
> > so we are going to choose to not take an action. Wherever the error was, it won't affect correctness
> > of execution of the current context.
>
> Those could be simply made to go to dmesg during kdump, i.e. decouple
> any MCE consumers. And we do that now anyway, i.e. box without mcelog or
> some other ras daemon running.
>
> So we could reuse the normal handler - we just need to do some tweaking
> first... AFAICT, of course. I believe in that endeavor, the devil will
> be in the detail.

OK, I'll try this approach with updating cpu_online_mask.

Thanks,
Naoya Horiguchi--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump [ In reply to ]
On Fri, Apr 10, 2015 at 12:49:33AM +0000, Horiguchi Naoya(堀口 直也) wrote:
> On Thu, Apr 09, 2015 at 09:05:51PM +0200, Borislav Petkov wrote:
> > On Thu, Apr 09, 2015 at 06:22:02PM +0000, Luck, Tony wrote:
> > > > Why? Those CPUs are offlined and num_online_cpus() in mce_start() should
> > > > account for that, no?
> > > >
> > > > And if those are offlined, they're very very unlikely to trigger an MCE
> > > > as they're idle and not executing code.
> > >
> > > Let's step back a few feet and look at the big picture. There are three main classes of machine check
> > > that we might see while trying to run kdump - an remember that all machine checks are currently
> > > broadcast, so all cpus whether online or offline will see them
> > >
> > > 1) Fatal
> > > We have to crash - lose the dump. Having a new machine check handler will make things a bit easier
> > > to see what happened because we won't have any synchronization failed messages from the offline
> > > cpus.
> >
> > But this should not be a problem if kdump path keeps cpu_online_mask
> > uptodate. I'm looking at kdump_nmi_callback() or crash_nmi_callback() or
> > so. Those should clear cpu_online_mask and then mce_start() will work
> > fine on the crashing CPU.
> >
> > IMHO, of course.
>
> Sorry, I misread you. With clearing cpu_online_mask in shootdown (not done
> yet,) raising tolerance should work without timeout message.
> So I think you are right.

... wait, changing cpu_online_mask might confuse admins who try to
analyze the kdump, especially when the problems causing panic are CPU
related issues?

In the similar way, changing tolerant value loses the original value,
although this is unlikely to be a problem. But if we change it, using
an upper bit to keep lowest 2 bit to save the original value is better?¢éì¹»®&Þ~º&¶¬–+-±éݶ¥Šw®žË›±Êâmébžìdz¹Þ–)í…æèw*jg¬±¨¶‰šŽŠÝ¢j/êäz¹Þ–Šà2ŠÞ™¨è­Ú&¢)ß¡«a¶Úþø®G«éh®æj:+v‰¨Šwè†Ù¥>Wš±êÞiÛaxPjØm¶Ÿÿà -»+ƒùdš_
Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump [ In reply to ]
On Fri, Apr 10, 2015 at 04:07:26AM +0000, Naoya Horiguchi wrote:
> ... wait, changing cpu_online_mask might confuse admins who try to
> analyze the kdump, especially when the problems causing panic are CPU
> related issues?

Well, if you're offlining the CPUs before doing the dump, you're already
changing the system, aren't you? So why lie to people?

> In the similar way, changing tolerant value loses the original value,
> although this is unlikely to be a problem. But if we change it, using
> an upper bit to keep lowest 2 bit to save the original value is better?

I don't think that you need to do that - you can see from the original
kernel's dmesg whether an MCE happened - we're normally very vocal. If
the user tried to deliberately suppress that, then that's her fault
only.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v8] x86: mce: kexec: switch MCE handler for kexec/kdump [ In reply to ]
On 04/10/15 at 12:49am, Naoya Horiguchi wrote:
> On Thu, Apr 09, 2015 at 09:05:51PM +0200, Borislav Petkov wrote:
> > On Thu, Apr 09, 2015 at 06:22:02PM +0000, Luck, Tony wrote:
> > > > Why? Those CPUs are offlined and num_online_cpus() in mce_start() should
> > > > account for that, no?
> > > >
> > > > And if those are offlined, they're very very unlikely to trigger an MCE
> > > > as they're idle and not executing code.
> > >
> > > Let's step back a few feet and look at the big picture. There are three main classes of machine check
> > > that we might see while trying to run kdump - an remember that all machine checks are currently
> > > broadcast, so all cpus whether online or offline will see them
> > >
> > > 1) Fatal
> > > We have to crash - lose the dump. Having a new machine check handler will make things a bit easier
> > > to see what happened because we won't have any synchronization failed messages from the offline
> > > cpus.
> >
> > But this should not be a problem if kdump path keeps cpu_online_mask
> > uptodate. I'm looking at kdump_nmi_callback() or crash_nmi_callback() or
> > so. Those should clear cpu_online_mask and then mce_start() will work
> > fine on the crashing CPU.
> >
> > IMHO, of course.
>
> Sorry, I misread you. With clearing cpu_online_mask in shootdown (not done
> yet,) raising tolerance should work without timeout message.
> So I think you are right.

Hi Naoya,

Thanks for great efforts you have made on this issue.

I am trying to catch up, and have read mails in this thread.
Please also add me to CC list when you post a new version. I would like to
review it.

Thanks
Baoquan

>
> > > 2) Execution path recoverable (SRAR in SDM parlance).
> > > Also going to be fatal (kdump is all running in ring0, and we can't recover from errors in ring 0). Cleaner
> > > messages as above. Potentially in the future we might be able to make the kdump machine check handler
> > > actually recover by just skipping a page - if the location of the error was in the old kernel image.
> > >
> > > 3) Non-execution path recoverable (SRAO in SDM)
> > > We ought to be able to keep kdump running if this happens - the "AO" stands for "action optional",
> > > so we are going to choose to not take an action. Wherever the error was, it won't affect correctness
> > > of execution of the current context.
> >
> > Those could be simply made to go to dmesg during kdump, i.e. decouple
> > any MCE consumers. And we do that now anyway, i.e. box without mcelog or
> > some other ras daemon running.
> >
> > So we could reuse the normal handler - we just need to do some tweaking
> > first... AFAICT, of course. I believe in that endeavor, the devil will
> > be in the detail.
>
> OK, I'll try this approach with updating cpu_online_mask.
>
> Thanks,
> Naoya Horiguchi--
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/