Mailing List Archive

[PATCH] NMI Crashdump with external/ipmi
Hi all,

I've written a small patch for externel/ipmi, so it's possible to
configure it not to reset a node, but trigger a crashdump via NMI.

If a node becomes unavailable for several reasons it will be fenced but
this makes investigating the root cause of the nodes unavailability very
difficult; if you have a chashdump you can reconstruct the root cause.

For this I added 3 new options:

crashdump -> set this to true to enable crashdump.
sshcheck -> if this is true, a ssh connection will be
established to eighter $sshipaddr, if this is not
set, $hostname will be used as remoteadress.
sshipaddr -> in case ssh is listening on an other interface,
where dns isn't equal $hostname.

Maybe it could be usefull for others too.

For any comments, suggestions I would be glad.


Tobias D. Oestreicher

--
Tobias D. Oestreicher
Linux Consultant & Trainer
Tel.: +49-160-5329935
Mail: oestreicher@b1-systems.de

B1 Systems GmbH
Osterfeldstraße 7 / 85088 Vohburg / http://www.b1-systems.de
GF: Ralph Dehner / Unternehmenssitz: Vohburg / AG: Ingolstadt,HRB 3537
Re: [PATCH] NMI Crashdump with external/ipmi [ In reply to ]
Hi,

On Mon, Jan 14, 2013 at 11:12:11PM +0100, Tobias D. Oestreicher wrote:
> Hi all,
>
> I've written a small patch for externel/ipmi, so it's possible to
> configure it not to reset a node, but trigger a crashdump via NMI.
>
> If a node becomes unavailable for several reasons it will be fenced but
> this makes investigating the root cause of the nodes unavailability very
> difficult; if you have a chashdump you can reconstruct the root cause.
>
> For this I added 3 new options:
>
> crashdump -> set this to true to enable crashdump.
>
> sshcheck -> if this is true, a ssh connection will be
> established to eighter $sshipaddr, if this is not
> set, $hostname will be used as remoteadress.
> sshipaddr -> in case ssh is listening on an other interface,
> where dns isn't equal $hostname.

ssh is used only in case sshcheck is set to true? If so, then
that should be mentioned in the description of sshipaddr.
Further, the sshcheck parameter should come first (i.e. exchange
place with sshipaddr).

The test is Linux specific, that should also be noted in the
parameter description.

Please see below for notes on code.

> Maybe it could be usefull for others too.
>
> For any comments, suggestions I would be glad.
>
>
> Tobias D. Oestreicher
>
> --
> Tobias D. Oestreicher
> Linux Consultant & Trainer
> Tel.: +49-160-5329935
> Mail: oestreicher@b1-systems.de
>
> B1 Systems GmbH
> Osterfeldstraße 7 / 85088 Vohburg / http://www.b1-systems.de
> GF: Ralph Dehner / Unternehmenssitz: Vohburg / AG: Ingolstadt,HRB 3537

> diff -r da5832ae23dd lib/plugins/stonith/external/ipmi
> --- a/lib/plugins/stonith/external/ipmi Sun Dec 23 16:05:11 2012 +0100
> +++ b/lib/plugins/stonith/external/ipmi Mon Jan 14 22:01:57 2013 +0100
> @@ -36,7 +36,11 @@
> POWEROFF="power off"
> POWERON="power on"
> STATUS="power status"
> +CRASHDUMP="chassis power diag"
> +
> IPMITOOL=${ipmitool:-"`which ipmitool 2>/dev/null`"}
> +SYSCTL=`which sysctl 2>/dev/null`

Normally, sysctl is in the PATH. As well as ssh (SSH_BIN below).

> +SSH_OPTS="-q -o PasswordAuthentication=no -o StrictHostKeyChecking=no"

Add "-l root", just in case?

>
> have_ipmi() {
> test -x "${IPMITOOL}"
> @@ -138,7 +142,11 @@
> ;;
> reset)
> if ipmi_is_power_on; then
> - do_ipmi "${RESET}"
> + if [ "${crashdump}" == "true" ]; then
> + do_ipmi "${CRASHDUMP}"
> + else
> + do_ipmi "${RESET}"
> + fi
> else
> do_ipmi "${POWERON}"
> fi
> @@ -149,11 +157,40 @@
> # the managed node. Hence, only check if we can contact the
> # IPMI device with "power status" command, don't pay attention
> # to whether the node is in fact powered on or off.
> + if [ "${crashdump}" == "true" ]; then
> + if [ "${sshcheck}" == "true" ];then

This should go to a separate function, sth like
check_crashdump_eligibility or check_crashdump_setup. Then you
can do:

if [ "${crashdump}" == "true" -a "${sshcheck}" == "true" ]; then
check_crashdump_setup || exit
fi

Otherwise, it'd be hard to get the meaning of this big chunk of
code.

> + if [ -z "${hostname}" -a -z "${sshipaddr}" ]; then
> + ha_log.sh err "Neigther hostname nor sshipaddr is set, crashdump testing not possible"
+ ha_log.sh err "Neither ...
> + elif [ -z "${sshipaddr}" ]; then
> + REMOTESSHHOST="${hostname}"
> + else
> + REMOTESSHHOST="${sshipaddr}"
> + fi
> + SSH_BIN=`which ssh 2>/dev/null`
> + SSH_COMMAND="${SSH_BIN} ${REMOTESSHHOST} ${SSH_OPTS}"
> + remote_crashdump_state=`${SSH_COMMAND} "grep -c crashkernel /proc/cmdline;${SYSCTL} -n kernel.unknown_nmi_panic kernel.panic_on_unrecovered_nmi"`

What if crashkernel is set to nothing? Would crash dump work then
too?

> + if [ $? -ne 0 ];then
> + ha_log.sh err "Not possible to connect via ssh to ${REMOTESSHHOST}"
> + exit 1
> + fi
> + unknown_nmi=`echo ${remote_crashdump_state}|awk '{print $2}'`
> + unrecovered_nmi=`echo ${remote_crashdump_state}|awk '{print $3}'`
> + crashdump_kernel_option=`echo ${remote_crashdump_state}|awk '{print $1}'`
> + if [ ${crashdump_kernel_option} -ne 1 ];then
> + ha_log.sh err "Crashdump seems not to be configured on host ${REMOTESSHHOST}"
> + exit 1
> + fi
> + if [ ${unknown_nmi} -eq 0 -o ${unrecovered_nmi} -eq 0 ]; then
> + ha_log.sh err "Non Maskerable Interupts do not trigger a reset. Set \"kernel.unknown_nmi_panic\" and \"kernel.panic_on_unrecovered_nmi\" to \"1\""

Replace "Non Maskerable Interupts" with NMI. Easier to read. And
if they don't know what it is, then this message has been wasted
anyway.

Cheers,

Dejan

> + exit 1
> + fi
> + fi
> + fi
> do_ipmi "${STATUS}"
> exit $?
> ;;
> getconfignames)
> - for i in hostname ipaddr userid passwd interface; do
> + for i in hostname ipaddr userid passwd interface crashdump sshipaddr sshcheck; do
> echo $i
> done
> exit 0
> @@ -266,6 +303,39 @@
> </longdesc>
> </parameter>
>
> +<parameter name="crashdump" unique="0" required="0">
> +<content type="string" default="false"/>
> +<shortdesc lang="en">
> +Trigger Crahdump
> +</shortdesc>
> +<longdesc lang="en">
> +Instead of sending a reset to the IPMI board, submit a NMI signal to trigger a crashdump.
> +
> +!!! ATTENTION USE ONLY FOR DEBUGGING PURPOSES. NMI MUST BE TESTED PRIOR TO USE !!!
> +</longdesc>
> +</parameter>
> +
> +<parameter name="sshipaddr" unique="0">
> +<content type="string" />
> +<shortdesc lang="en">
> +IP Address of the node to stonith.
> +</shortdesc>
> +<longdesc lang="en">
> +The IP address of the node to contact via ssh in case it differs from hostname to perform checks regarding crashdump and NMI configuration.
> +</longdesc>
> +</parameter>
> +
> +<parameter name="sshcheck" unique="0">
> +<content type="string" default="false"/>
> +<shortdesc lang="en">
> +Checks whether node is configured for crashdump.
> +</shortdesc>
> +<longdesc lang="en">
> +This will be done via ssh and requires a password-less ssh connection.
> +Enable Crashdump Checks. (true|false)
> +</longdesc>
> +</parameter>
> +
> </parameters>
> IPMIXML
> exit 0
>
>
>

> begin:vcard
> fn:Tobias Dominik Oestreicher
> n:Oestreicher;Tobias Dominik
> org:B1-Systems
> adr;quoted-printable:;;Osterfeldstra=C3=9Fe 7;Vohburg;;85088;Deutschland
> email;internet:oestreicher@b1-systems.de
> title:Linux / Unix Consultant & Trainer
> tel;cell:+49 160 53 299 35
> url:http://www.b1-systems.de
> version:2.1
> end:vcard
>




> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/