Mailing List Archive

Pacemaker doesn't actually call STONITH, instead in stops itself
Hi all,

I've faced a problem when a node is not actually rebooted in case a
resource fails to stop on it.
A fence agent is a self-written. And it works in case of network outage and
all other cases.
I went through all the logs on both nodes and I couldn't understand why
node-0 is not actually rebooted.
I would be appreciated for some help here.
Bellow are two brief (most interesting from my point) snips from the logs.
I also attached log files and screenshot of "crm_mon" on node-1.


The problem:
-------------------
"stop" action for "sm1dh" fails on "node-0", and "node-0" is not actually
rebooted by "node-1".


The setup:
---------------
There are two nodes: node-0,node-1


Fence agents are configured with:
-------------------------------------
crm configure primitive STONITH_node-1 stonith:fence_avid_sbb_hw
crm configure primitive STONITH_node-0 stonith:fence_avid_sbb_hw \
params delay="10"

crm configure location dont_run_STONITH_node-1_on_node-1 STONITH_node-1
-inf: node-1
crm configure location dont_run_STONITH_node-0_on_node-0 STONITH_node-0
-inf: node-0


Few lines from the vim /var/log/cluster/corosync.log on "node-0":
-------------------------------------------------------------------------------------------
Feb 10 19:29:40 [3204] isis-seth943f pengine: info: native_print:
sm1dh (ocf::avid:diskHelper): FAILED node-0
...
Feb 10 19:29:40 [3201] isis-seth943f stonithd: notice:
handle_request: Client
crmd.3205.09022f74 wants to fence (reboot) 'node-0' with device '(any)'
Feb 10 19:29:40 [3201] isis-seth943f stonithd: notice:
initiate_remote_stonith_op: Initiating remote operation reboot for node-0:
51063a89-0df0-4dd7-8f22-667ca5db05f0 (0)
Feb 10 19:29:41 [3201] isis-seth943f stonithd: info:
process_remote_stonith_query: Query result 2 of 2 from node-1 for
node-0/reboot (1 devices) 51063a89-0df0-4dd7-8f22-667ca5db05f0
...
Feb 10 19:29:51 [3205] isis-seth943f crmd: crit:
tengine_stonith_notify: We were alegedly just fenced by node-1 for node-0!
...
Feb 10 19:29:51 [3198] isis-seth943f pacemakerd: error:
pcmk_child_exit: Child
process crmd (3205) exited: Network is down (100)
Feb 10 19:29:51 [3198] isis-seth943f pacemakerd: warning:
pcmk_child_exit: Pacemaker
child process crmd no longer wishes to be respawned. Shutting ourselves down


Few lines from the vim /var/log/cluster/corosync.log on "node-1":
-------------------------------------------------------------------------------------------
Feb 10 19:28:15 [3184] isis-seth944b stonithd: notice:
log_operation: Operation
'reboot' [4596] (call 2 from crmd.3205) for host 'node-0' with device
'STONITH_node-0' returned: 0 (OK)
Feb 10 19:28:15 [3184] isis-seth944b stonithd: warning:
get_xpath_object: No match for //@st_delegate in /st-reply
Feb 10 19:28:15 [3184] isis-seth944b stonithd: notice:
remote_op_done: Operation
reboot of node-0 by node-1 for crmd.3205@node-0.51063a89: OK
Feb 10 19:28:15 [3188] isis-seth944b crmd: notice:
tengine_stonith_notify: Peer node-0 was terminated (reboot) by node-1 for
node-0: OK (ref=51063a89-0df0-4dd7-8f22-667ca5db05f0) by client crmd.3205


Time difference between the nodes (sorry for that):
------------------------------------------------------------------------
node-0: t
node-1: t - 97 seconds


Thank you,
Kostya