Mailing List Archive

odd cluster failure
For the second time in a few weeks, we have had one node of a particular
cluster getting fenced. It isn't totally clear why this is happening. On
the surviving node I see:

Feb 2 16:48:52 vmc1 stonith-ng[4331]: notice: stonith-vm2 can fence
(reboot) vmc2.ucar.edu: static-list
Feb 2 16:48:52 vmc1 stonith-ng[4331]: notice: stonith-vm2 can fence
(reboot) vmc2.ucar.edu: static-list
Feb 2 16:49:00 vmc1 kernel: igb 0000:03:00.1 eth3: igb: eth3 NIC Link is
Down
Feb 2 16:49:00 vmc1 kernel: xenbr0: port 1(eth3) entered disabled state
Feb 2 16:49:01 vmc1 corosync[2846]: [TOTEM ] A processor failed, forming
new configuration.

OK, so from this point of view, it looks like the link was lost between the
two hosts, resulting in fencing. The link is a crossover cable, so no
networking hardware other than the host NICs and the cable.

On the other side I see:

Feb 2 16:46:46 vmc2 kernel: xenbr1: port 16(vif17.0) entered disabled state
Feb 2 16:46:46 vmc2 kernel: xen:balloon: Cannot add additional memory (-17)
Feb 2 16:46:47 vmc2 kernel: xen:balloon: Cannot add additional memory (-17)
Feb 2 16:46:48 vmc2 kernel: xenbr1: port 16(vif17.0) entered disabled state
Feb 2 16:46:48 vmc2 kernel: device vif17.0 left promiscuous mode
Feb 2 16:46:48 vmc2 kernel: xenbr1: port 16(vif17.0) entered disabled state
Feb 2 16:46:48 vmc2 kernel: xen:balloon: Cannot add additional memory (-17)
Feb 2 16:46:49 vmc2 crmd[4191]: notice: State transition S_IDLE ->
S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
origin=abort_transition_graph ]
Feb 2 16:46:49 vmc2 attrd[4189]: notice: Sending flush op to all hosts
for: fail-count-VM-radnets (1)
Feb 2 16:46:49 vmc2 attrd[4189]: notice: Sent update 37:
fail-count-VM-radnets=1
Feb 2 16:46:49 vmc2 attrd[4189]: notice: Sending flush op to all hosts
for: last-failure-VM-radnets (1486079209)
Feb 2 16:46:49 vmc2 attrd[4189]: notice: Sent update 39:
last-failure-VM-radnets=1486079209
Feb 2 16:46:50 vmc2 pengine[4190]: notice: On loss of CCM Quorum: Ignore
Feb 2 16:46:50 vmc2 pengine[4190]: warning: Processing failed op monitor
for VM-radnets on vmc2.ucar.edu: not running (7)
Feb 2 16:46:50 vmc2 pengine[4190]: notice: Recover
VM-radnets#011(Started vmc2.ucar.edu)
Feb 2 16:46:50 vmc2 pengine[4190]: notice: Calculated Transition 2914:
/var/lib/pacemaker/pengine/pe-input-317.bz2
Feb 2 16:46:50 vmc2 crmd[4191]: notice: Initiating action 15: stop
VM-radnets_stop_0 on vmc2.ucar.edu (local)
Feb 2 16:46:51 vmc2 Xen(VM-radnets)[1016]: INFO: Xen domain radnets will
be stopped (timeout: 80s)
Feb 2 16:46:52 vmc2 kernel: device vif21.0 entered promiscuous mode
Feb 2 16:46:52 vmc2 kernel: IPv6: ADDRCONF(NETDEV_UP): vif21.0: link is
not ready
Feb 2 16:46:57 vmc2 kernel: xen-blkback:ring-ref 9, event-channel 10,
protocol 1 (x86_64-abi)
Feb 2 16:46:57 vmc2 kernel: vif vif-21-0 vif21.0: Guest Rx ready
Feb 2 16:46:57 vmc2 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vif21.0: link
becomes ready
Feb 2 16:46:57 vmc2 kernel: xenbr1: port 2(vif21.0) entered forwarding
state
Feb 2 16:46:57 vmc2 kernel: xenbr1: port 2(vif21.0) entered forwarding
state
Feb 2 16:47:12 vmc2 kernel: xenbr1: port 2(vif21.0) entered forwarding
state

(and then there are a bunch of null bytes, and the log resumes with reboot)

More messages about networking, except that xenbr1 is not the bridge device
associated with the NIC in question.

I don't see any reason why the link between the hosts should suddenly stop
working, so I am suspecting a hardware problem that only crops up rarely
(but will most likely get worse over time).
Is there anything anyone can see in the log that would suggest otherwise?

Thank you,
--Greg
_______________________________________________
Linux-HA mailing list is closing down.
Please subscribe to users@clusterlabs.org instead.
http://clusterlabs.org/mailman/listinfo/users
_______________________________________________
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
Re: odd cluster failure [ In reply to ]
Greg Woods <woods@ucar.edu> writes:

> For the second time in a few weeks, we have had one node of a particular
> cluster getting fenced. It isn't totally clear why this is happening. On
> the surviving node I see:
>
> Feb 2 16:48:52 vmc1 stonith-ng[4331]: notice: stonith-vm2 can fence (reboot) vmc2.ucar.edu: static-list
> Feb 2 16:48:52 vmc1 stonith-ng[4331]: notice: stonith-vm2 can fence (reboot) vmc2.ucar.edu: static-list
> Feb 2 16:49:00 vmc1 kernel: igb 0000:03:00.1 eth3: igb: eth3 NIC Link is Down
> Feb 2 16:49:00 vmc1 kernel: xenbr0: port 1(eth3) entered disabled state
> Feb 2 16:49:01 vmc1 corosync[2846]: [TOTEM ] A processor failed, forming new configuration.
>
> OK, so from this point of view, it looks like the link was lost
> between the two hosts, resulting in fencing.

I'd say the other way around: the fencing resulted in a link loss.

> Feb 2 16:46:46 vmc2 kernel: xenbr1: port 16(vif17.0) entered disabled state
> Feb 2 16:46:46 vmc2 kernel: xen:balloon: Cannot add additional memory (-17)
> Feb 2 16:46:47 vmc2 kernel: xen:balloon: Cannot add additional memory (-17)
> Feb 2 16:46:48 vmc2 kernel: xenbr1: port 16(vif17.0) entered disabled state
> Feb 2 16:46:48 vmc2 kernel: device vif17.0 left promiscuous mode
> Feb 2 16:46:48 vmc2 kernel: xenbr1: port 16(vif17.0) entered disabled state
> Feb 2 16:46:48 vmc2 kernel: xen:balloon: Cannot add additional memory (-17)
> Feb 2 16:46:49 vmc2 crmd[4191]: notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> Feb 2 16:46:49 vmc2 attrd[4189]: notice: Sending flush op to all hosts for: fail-count-VM-radnets (1)
> Feb 2 16:46:49 vmc2 attrd[4189]: notice: Sent update 37: fail-count-VM-radnets=1
> Feb 2 16:46:49 vmc2 attrd[4189]: notice: Sending flush op to all hosts for: last-failure-VM-radnets (1486079209)
> Feb 2 16:46:49 vmc2 attrd[4189]: notice: Sent update 39: last-failure-VM-radnets=1486079209
> Feb 2 16:46:50 vmc2 pengine[4190]: notice: On loss of CCM Quorum: Ignore
> Feb 2 16:46:50 vmc2 pengine[4190]: warning: Processing failed op monitor for VM-radnets on vmc2.ucar.edu: not running (7)

Looks like your VM resource was destroyed (maybe due to the xen balloon
errors above), and the monitor operation noticed this.

> Feb 2 16:46:50 vmc2 pengine[4190]: notice: Recover VM-radnets#011(Started vmc2.ucar.edu)
> Feb 2 16:46:50 vmc2 pengine[4190]: notice: Calculated Transition 2914: /var/lib/pacemaker/pengine/pe-input-317.bz2
> Feb 2 16:46:50 vmc2 crmd[4191]: notice: Initiating action 15: stop VM-radnets_stop_0 on vmc2.ucar.edu (local)
> Feb 2 16:46:51 vmc2 Xen(VM-radnets)[1016]: INFO: Xen domain radnets will be stopped (timeout: 80s)

If that stop operation failed for any reason, fencing could be expected.

> Feb 2 16:46:52 vmc2 kernel: device vif21.0 entered promiscuous mode
> Feb 2 16:46:52 vmc2 kernel: IPv6: ADDRCONF(NETDEV_UP): vif21.0: link is not ready
> Feb 2 16:46:57 vmc2 kernel: xen-blkback:ring-ref 9, event-channel 10, protocol 1 (x86_64-abi)
> Feb 2 16:46:57 vmc2 kernel: vif vif-21-0 vif21.0: Guest Rx ready
> Feb 2 16:46:57 vmc2 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vif21.0: link becomes ready
> Feb 2 16:46:57 vmc2 kernel: xenbr1: port 2(vif21.0) entered forwarding state
> Feb 2 16:46:57 vmc2 kernel: xenbr1: port 2(vif21.0) entered forwarding state
> Feb 2 16:47:12 vmc2 kernel: xenbr1: port 2(vif21.0) entered forwarding state
>
> (and then there are a bunch of null bytes, and the log resumes with reboot)

Remote logging help much with such issues.
--
Feri
_______________________________________________
Linux-HA mailing list is closing down.
Please subscribe to users@clusterlabs.org instead.
http://clusterlabs.org/mailman/listinfo/users
_______________________________________________
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
Re: odd cluster failure [ In reply to ]
On Thu, Feb 9, 2017 at 2:28 AM, Ferenc Wágner <wferi@niif.hu> wrote:

> Looks like your VM resource was destroyed (maybe due to the xen balloon
> errors above), and the monitor operation noticed this.
>

Thank you for helping me interpret that. I think what happened is that the
VM in question (radnets) is the only one that did not have maxmem specified
in the config file. It probably suffered memory pressure and the hypervisor
tried to give it more memory, but ballooning is turned off in the
hypervisor. That's probably where the balloon errors come from. The VM
probably got hung up because it ran out of memory, causing the monitor to
fail.

There is a little guesswork going on here, because I do not fully
understand how Xen ballooning works (or is supposed to work), but it seems
like I should set maxmem for this VM like all the others, and I increased
it's available memory as well. Now I just wait and see if it happens again.

--Greg
_______________________________________________
Linux-HA mailing list is closing down.
Please subscribe to users@clusterlabs.org instead.
http://clusterlabs.org/mailman/listinfo/users
_______________________________________________
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha