Mailing List Archive

Pacemaker fencing and DLM/cLVM
Hello,

In my pacemaker/corosync cluster it looks like I have some issues with
fencing ACK on DLM/cLVM.

When a node is fenced, dlm/cLVM are not aware of the fencing results and
LVM commands hangs unless I run “dlm_tools fence_ack <ID_OF_THE_NODE>”

Here are some log around the fencing of nebula1:

Nov 24 09:51:06 nebula3 crmd[6043]: warning: update_failcount: Updating failcount for clvm on nebula1 after failed stop: rc=1 (update=INFINITY, time=1416819066)
Nov 24 09:51:06 nebula3 pengine[6042]: warning: unpack_rsc_op: Processing failed op stop for clvm:0 on nebula1: unknown error (1)
Nov 24 09:51:06 nebula3 pengine[6042]: warning: pe_fence_node: Node nebula1 will be fenced because of resource failure(s)
Nov 24 09:51:06 nebula3 pengine[6042]: warning: stage6: Scheduling Node nebula1 for STONITH
Nov 24 09:51:06 nebula3 pengine[6042]: notice: native_stop_constraints: Stop of failed resource clvm:0 is implicit after nebula1 is fenced
Nov 24 09:51:06 nebula3 pengine[6042]: notice: LogActions: Move Stonith-nebula3-IPMILAN#011(Started nebula1 -> nebula2)
Nov 24 09:51:06 nebula3 pengine[6042]: notice: LogActions: Stop dlm:0#011(nebula1)
Nov 24 09:51:06 nebula3 pengine[6042]: notice: LogActions: Stop clvm:0#011(nebula1)
Nov 24 09:51:06 nebula3 pengine[6042]: warning: process_pe_message: Calculated Transition 4: /var/lib/pacemaker/pengine/pe-warn-1.bz2
Nov 24 09:51:06 nebula3 pengine[6042]: warning: unpack_rsc_op: Processing failed op stop for clvm:0 on nebula1: unknown error (1)
Nov 24 09:51:06 nebula3 pengine[6042]: warning: pe_fence_node: Node nebula1 will be fenced because of resource failure(s)
Nov 24 09:51:06 nebula3 pengine[6042]: warning: stage6: Scheduling Node nebula1 for STONITH
Nov 24 09:51:06 nebula3 pengine[6042]: notice: native_stop_constraints: Stop of failed resource clvm:0 is implicit after nebula1 is fenced
Nov 24 09:51:06 nebula3 pengine[6042]: notice: LogActions: Move Stonith-nebula3-IPMILAN#011(Started nebula1 -> nebula2)
Nov 24 09:51:06 nebula3 pengine[6042]: notice: LogActions: Stop dlm:0#011(nebula1)
Nov 24 09:51:06 nebula3 pengine[6042]: notice: LogActions: Stop clvm:0#011(nebula1)
Nov 24 09:51:06 nebula3 pengine[6042]: warning: process_pe_message: Calculated Transition 5: /var/lib/pacemaker/pengine/pe-warn-2.bz2
Nov 24 09:51:06 nebula3 crmd[6043]: notice: te_fence_node: Executing reboot fencing operation (79) on nebula1 (timeout=30000)
Nov 24 09:51:06 nebula3 stonith-ng[6039]: notice: handle_request: Client crmd.6043.5ec58277 wants to fence (reboot) 'nebula1' with device '(any)'
Nov 24 09:51:06 nebula3 stonith-ng[6039]: notice: initiate_remote_stonith_op: Initiating remote operation reboot for nebula1: 50c93bed-e66f-48a5-bd2f-100a9e7ca7a1 (0)
Nov 24 09:51:06 nebula3 stonith-ng[6039]: notice: can_fence_host_with_device: Stonith-nebula1-IPMILAN can fence nebula1: static-list
Nov 24 09:51:06 nebula3 stonith-ng[6039]: notice: can_fence_host_with_device: Stonith-nebula2-IPMILAN can not fence nebula1: static-list
Nov 24 09:51:06 nebula3 stonith-ng[6039]: notice: can_fence_host_with_device: Stonith-ONE-Frontend can not fence nebula1: static-list
Nov 24 09:51:09 nebula3 corosync[5987]: [TOTEM ] A processor failed, forming new configuration.
Nov 24 09:51:13 nebula3 corosync[5987]: [TOTEM ] A new membership (192.168.231.71:81200) was formed. Members left: 1084811078
Nov 24 09:51:13 nebula3 lvm[6311]: confchg callback. 0 joined, 1 left, 2 members
Nov 24 09:51:13 nebula3 corosync[5987]: [QUORUM] Members[2]: 1084811079 1084811080
Nov 24 09:51:13 nebula3 corosync[5987]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 24 09:51:13 nebula3 pacemakerd[6036]: notice: crm_update_peer_state: pcmk_quorum_notification: Node nebula1[1084811078] - state is now lost (was member)
Nov 24 09:51:13 nebula3 crmd[6043]: notice: crm_update_peer_state: pcmk_quorum_notification: Node nebula1[1084811078] - state is now lost (was member)
Nov 24 09:51:13 nebula3 kernel: [ 510.140107] dlm: closing connection to node 1084811078
Nov 24 09:51:13 nebula3 dlm_controld[6263]: 509 fence status 1084811078 receive 1 from 1084811079 walltime 1416819073 local 509
Nov 24 09:51:13 nebula3 dlm_controld[6263]: 509 fence request 1084811078 pid 7142 nodedown time 1416819073 fence_all dlm_stonith
Nov 24 09:51:13 nebula3 dlm_controld[6263]: 509 fence result 1084811078 pid 7142 result 1 exit status
Nov 24 09:51:13 nebula3 dlm_controld[6263]: 509 fence status 1084811078 receive 1 from 1084811080 walltime 1416819073 local 509
Nov 24 09:51:13 nebula3 dlm_controld[6263]: 509 fence request 1084811078 no actor
Nov 24 09:51:13 nebula3 stonith-ng[6039]: notice: remote_op_done: Operation reboot of nebula1 by nebula2 for crmd.6043@nebula3.50c93bed: OK
Nov 24 09:51:13 nebula3 crmd[6043]: notice: tengine_stonith_callback: Stonith operation 4/79:5:0:817919e5-fa6d-4381-b0bd-42141ce0bb41: OK (0)
Nov 24 09:51:13 nebula3 crmd[6043]: notice: tengine_stonith_notify: Peer nebula1 was terminated (reboot) by nebula2 for nebula3: OK (ref=50c93bed-e66f-48a5-bd2f-100a9e7ca7a1) by client crmd.6043
Nov 24 09:51:13 nebula3 crmd[6043]: notice: te_rsc_command: Initiating action 22: start Stonith-nebula3-IPMILAN_start_0 on nebula2
Nov 24 09:51:14 nebula3 crmd[6043]: notice: run_graph: Transition 5 (Complete=11, Pending=0, Fired=0, Skipped=1, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped
Nov 24 09:51:14 nebula3 pengine[6042]: notice: process_pe_message: Calculated Transition 6: /var/lib/pacemaker/pengine/pe-input-2.bz2
Nov 24 09:51:14 nebula3 crmd[6043]: notice: te_rsc_command: Initiating action 21: monitor Stonith-nebula3-IPMILAN_monitor_1800000 on nebula2
Nov 24 09:51:15 nebula3 crmd[6043]: notice: run_graph: Transition 6 (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-2.bz2): Complete
Nov 24 09:51:15 nebula3 crmd[6043]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Nov 24 09:52:10 nebula3 dlm_controld[6263]: 566 datastores wait for fencing
Nov 24 09:52:10 nebula3 dlm_controld[6263]: 566 clvmd wait for fencing
Nov 24 09:55:10 nebula3 dlm_controld[6263]: 747 fence status 1084811078 receive -125 from 1084811079 walltime 1416819310 local 747

When the node is fenced I have “clvmd wait for fencing” and “datastores
wait for fencing” (datastores is my GFS2 volume).

Any idea of something I can check when this happens?

Regards.
--
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF
Re: Pacemaker fencing and DLM/cLVM [ In reply to ]
Am Montag, 24. November 2014, 15:14:26 schrieb Daniel Dehennin:
> Hello,
>
> In my pacemaker/corosync cluster it looks like I have some issues with
> fencing ACK on DLM/cLVM.
>
> When a node is fenced, dlm/cLVM are not aware of the fencing results and
> LVM commands hangs unless I run “dlm_tools fence_ack <ID_OF_THE_NODE>”
>
> Here are some log around the fencing of nebula1:
>
> Nov 24 09:51:06 nebula3 crmd[6043]: warning: update_failcount: Updating
> failcount for clvm on nebula1 after failed stop: rc=1 (update=INFINITY,
> time=1416819066) Nov 24 09:51:06 nebula3 pengine[6042]: warning:
> unpack_rsc_op: Processing failed op stop for clvm:0 on nebula1: unknown
> error (1) Nov 24 09:51:06 nebula3 pengine[6042]: warning: pe_fence_node:
> Node nebula1 will be fenced because of resource failure(s) Nov 24 09:51:06
> nebula3 pengine[6042]: warning: stage6: Scheduling Node nebula1 for
> STONITH Nov 24 09:51:06 nebula3 pengine[6042]: notice:
> native_stop_constraints: Stop of failed resource clvm:0 is implicit after
> nebula1 is fenced Nov 24 09:51:06 nebula3 pengine[6042]: notice:
> LogActions: Move Stonith-nebula3-IPMILAN#011(Started nebula1 -> nebula2)
> Nov 24 09:51:06 nebula3 pengine[6042]: notice: LogActions: Stop
> dlm:0#011(nebula1) Nov 24 09:51:06 nebula3 pengine[6042]: notice:
> LogActions: Stop clvm:0#011(nebula1) Nov 24 09:51:06 nebula3
> pengine[6042]: warning: process_pe_message: Calculated Transition 4:
> /var/lib/pacemaker/pengine/pe-warn-1.bz2 Nov 24 09:51:06 nebula3
> pengine[6042]: warning: unpack_rsc_op: Processing failed op stop for
> clvm:0 on nebula1: unknown error (1) Nov 24 09:51:06 nebula3 pengine[6042]:
> warning: pe_fence_node: Node nebula1 will be fenced because of resource
> failure(s) Nov 24 09:51:06 nebula3 pengine[6042]: warning: stage6:
> Scheduling Node nebula1 for STONITH Nov 24 09:51:06 nebula3 pengine[6042]:
> notice: native_stop_constraints: Stop of failed resource clvm:0 is
> implicit after nebula1 is fenced Nov 24 09:51:06 nebula3 pengine[6042]:
> notice: LogActions: Move Stonith-nebula3-IPMILAN#011(Started nebula1 ->
> nebula2) Nov 24 09:51:06 nebula3 pengine[6042]: notice: LogActions: Stop
> dlm:0#011(nebula1) Nov 24 09:51:06 nebula3 pengine[6042]: notice:
> LogActions: Stop clvm:0#011(nebula1) Nov 24 09:51:06 nebula3
> pengine[6042]: warning: process_pe_message: Calculated Transition 5:
> /var/lib/pacemaker/pengine/pe-warn-2.bz2 Nov 24 09:51:06 nebula3
> crmd[6043]: notice: te_fence_node: Executing reboot fencing operation
> (79) on nebula1 (timeout=30000) Nov 24 09:51:06 nebula3 stonith-ng[6039]:
> notice: handle_request: Client crmd.6043.5ec58277 wants to fence (reboot)
> 'nebula1' with device '(any)' Nov 24 09:51:06 nebula3 stonith-ng[6039]:
> notice: initiate_remote_stonith_op: Initiating remote operation reboot for
> nebula1: 50c93bed-e66f-48a5-bd2f-100a9e7ca7a1 (0) Nov 24 09:51:06 nebula3
> stonith-ng[6039]: notice: can_fence_host_with_device:
> Stonith-nebula1-IPMILAN can fence nebula1: static-list Nov 24 09:51:06
> nebula3 stonith-ng[6039]: notice: can_fence_host_with_device:
> Stonith-nebula2-IPMILAN can not fence nebula1: static-list Nov 24 09:51:06
> nebula3 stonith-ng[6039]: notice: can_fence_host_with_device:
> Stonith-ONE-Frontend can not fence nebula1: static-list Nov 24 09:51:09
> nebula3 corosync[5987]: [TOTEM ] A processor failed, forming new
> configuration. Nov 24 09:51:13 nebula3 corosync[5987]: [TOTEM ] A new
> membership (192.168.231.71:81200) was formed. Members left: 1084811078 Nov
> 24 09:51:13 nebula3 lvm[6311]: confchg callback. 0 joined, 1 left, 2
> members Nov 24 09:51:13 nebula3 corosync[5987]: [QUORUM] Members[2]:
> 1084811079 1084811080 Nov 24 09:51:13 nebula3 corosync[5987]: [MAIN ]
> Completed service synchronization, ready to provide service. Nov 24
> 09:51:13 nebula3 pacemakerd[6036]: notice: crm_update_peer_state:
> pcmk_quorum_notification: Node nebula1[1084811078] - state is now lost (was
> member) Nov 24 09:51:13 nebula3 crmd[6043]: notice:
> crm_update_peer_state: pcmk_quorum_notification: Node nebula1[1084811078] -
> state is now lost (was member) Nov 24 09:51:13 nebula3 kernel: [
> 510.140107] dlm: closing connection to node 1084811078 Nov 24 09:51:13
> nebula3 dlm_controld[6263]: 509 fence status 1084811078 receive 1 from
> 1084811079 walltime 1416819073 local 509 Nov 24 09:51:13 nebula3
> dlm_controld[6263]: 509 fence request 1084811078 pid 7142 nodedown time
> 1416819073 fence_all dlm_stonith Nov 24 09:51:13 nebula3
> dlm_controld[6263]: 509 fence result 1084811078 pid 7142 result 1 exit
> status Nov 24 09:51:13 nebula3 dlm_controld[6263]: 509 fence status
> 1084811078 receive 1 from 1084811080 walltime 1416819073 local 509 Nov 24
> 09:51:13 nebula3 dlm_controld[6263]: 509 fence request 1084811078 no actor
> Nov 24 09:51:13 nebula3 stonith-ng[6039]: notice: remote_op_done:
> Operation reboot of nebula1 by nebula2 for crmd.6043@nebula3.50c93bed: OK
> Nov 24 09:51:13 nebula3 crmd[6043]: notice: tengine_stonith_callback:
> Stonith operation 4/79:5:0:817919e5-fa6d-4381-b0bd-42141ce0bb41: OK (0) Nov
> 24 09:51:13 nebula3 crmd[6043]: notice: tengine_stonith_notify: Peer
> nebula1 was terminated (reboot) by nebula2 for nebula3: OK
> (ref=50c93bed-e66f-48a5-bd2f-100a9e7ca7a1) by client crmd.6043 Nov 24
> 09:51:13 nebula3 crmd[6043]: notice: te_rsc_command: Initiating action
> 22: start Stonith-nebula3-IPMILAN_start_0 on nebula2 Nov 24 09:51:14
> nebula3 crmd[6043]: notice: run_graph: Transition 5 (Complete=11,
> Pending=0, Fired=0, Skipped=1, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped Nov 24 09:51:14
> nebula3 pengine[6042]: notice: process_pe_message: Calculated Transition
> 6: /var/lib/pacemaker/pengine/pe-input-2.bz2 Nov 24 09:51:14 nebula3
> crmd[6043]: notice: te_rsc_command: Initiating action 21: monitor
> Stonith-nebula3-IPMILAN_monitor_1800000 on nebula2 Nov 24 09:51:15 nebula3
> crmd[6043]: notice: run_graph: Transition 6 (Complete=1, Pending=0,
> Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-2.bz2): Complete Nov 24 09:51:15
> nebula3 crmd[6043]: notice: do_state_transition: State transition
> S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL
> origin=notify_crmd ] Nov 24 09:52:10 nebula3 dlm_controld[6263]: 566
> datastores wait for fencing Nov 24 09:52:10 nebula3 dlm_controld[6263]: 566
> clvmd wait for fencing Nov 24 09:55:10 nebula3 dlm_controld[6263]: 747
> fence status 1084811078 receive -125 from 1084811079 walltime 1416819310
> local 747
>
> When the node is fenced I have “clvmd wait for fencing” and “datastores
> wait for fencing” (datastores is my GFS2 volume).
>
> Any idea of something I can check when this happens?
>
> Regards.

Yes. You have to tell all the underlying infrastructure to use the fencing of
pacemaker. I assume that you are working on a RH clone.

See: http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html/Clusters_from_Scratch/ch08s02s03.html


Mit freundlichen Grüßen,

Michael Schwartzkopff

--
[*] sys4 AG

http://sys4.de, +49 (89) 30 90 46 64, +49 (162) 165 0044
Franziskanerstraße 15, 81669 München

Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263
Vorstand: Patrick Ben Koetter, Marc Schiffbauer
Aufsichtsratsvorsitzender: Florian Kirstein
Re: Pacemaker fencing and DLM/cLVM [ In reply to ]
Michael Schwartzkopff <ms@sys4.de> writes:

> Yes. You have to tell all the underlying infrastructure to use the fencing of
> pacemaker. I assume that you are working on a RH clone.
>
> See: http://clusterlabs.org/doc/en-US/Pacemaker/1.1-plugin/html/Clusters_from_Scratch/ch08s02s03.html

Sorry, this is my fault.

I'm using Ubuntu 14.04:

- corosync 2.3.3-1ubuntu1
- pacemaker 1.1.10+git20130802-1ubuntu2.1

I thought everything was integrated in such configuration.

Regards.

--
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF
Re: Pacemaker fencing and DLM/cLVM [ In reply to ]
Daniel Dehennin <daniel.dehennin@baby-gnu.org> writes:

> I'm using Ubuntu 14.04:
>
> - corosync 2.3.3-1ubuntu1
> - pacemaker 1.1.10+git20130802-1ubuntu2.1
>
> I thought everything was integrated in such configuration.

Here are some more informations:

- the pacemaker configuration
- the log of the DC nebula1 with marks for each step
- the log of the nebula2 with marks for each step
- the log of the nebula3 with marks for each step
- the output of “dlm_tool ls” and dlm_tool status” before/during/after
nebula2 fencing

The steps are:

1. All nodes up, cluster down
2. Start corosync on all nodes
3. Start pacemaker on all nodes
4. Start resource ONE-Storage-Clone (dlm, cLVM, VG, GFS2)
5. Crash nebula2
6. Start corosync on nebula2 after reboot
7. Start pacemaker on nebula2 after reboot

Does someone understand why DLM did not get the ACK of the fencing
automatically from stonith?

Why ONE-Storage-Clone does not manage to start on nebula2 after fencing.

Regards.
--
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF
Re: Pacemaker fencing and DLM/cLVM [ In reply to ]
On 25/11/14 10:45, Daniel Dehennin wrote:
> Daniel Dehennin <daniel.dehennin@baby-gnu.org> writes:
>
>> I'm using Ubuntu 14.04:
>>
>> - corosync 2.3.3-1ubuntu1
>> - pacemaker 1.1.10+git20130802-1ubuntu2.1
>>
>> I thought everything was integrated in such configuration.
>
> Here are some more informations:
>
> - the pacemaker configuration
> - the log of the DC nebula1 with marks for each step
> - the log of the nebula2 with marks for each step
> - the log of the nebula3 with marks for each step
> - the output of “dlm_tool ls” and dlm_tool status” before/during/after
> nebula2 fencing
>
> The steps are:
>
> 1. All nodes up, cluster down
> 2. Start corosync on all nodes
> 3. Start pacemaker on all nodes
> 4. Start resource ONE-Storage-Clone (dlm, cLVM, VG, GFS2)
> 5. Crash nebula2
> 6. Start corosync on nebula2 after reboot
> 7. Start pacemaker on nebula2 after reboot
>
> Does someone understand why DLM did not get the ACK of the fencing
> automatically from stonith?
>
> Why ONE-Storage-Clone does not manage to start on nebula2 after fencing.
>

It seems to me that fencing is failing for some reason, though I can't
tell from the logs exactly why, so you might have to investgate your
setup for IPMI to see just what is happening (I'm no IPMI expert, sorry).

The logs files tell me this though:

Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence request
1084811079 pid 7358 nodedown time 1416909392 fence_all dlm_stonith
Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence result 1084811079
pid 7358 result 1 exit status
Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence status 1084811079
receive 1 from 1084811080 walltime 1416909392 local 1035
Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence request
1084811079 no actor


Showing a status code '1' from dlm_stonith - the result should be 0 if
fencing completed succesfully.


Chrissie

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
Re: Pacemaker fencing and DLM/cLVM [ In reply to ]
Christine Caulfield <ccaulfie@redhat.com> writes:

> It seems to me that fencing is failing for some reason, though I can't
> tell from the logs exactly why, so you might have to investgate your
> setup for IPMI to see just what is happening (I'm no IPMI expert,
> sorry).

Thanks for looking, but actually IPMI stonith is working, for all nodes
I tested:

stonith_adm --reboot <node>

And it works.

> The logs files tell me this though:
>
> Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence request
> 1084811079 pid 7358 nodedown time 1416909392 fence_all dlm_stonith
> Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence result
> 1084811079 pid 7358 result 1 exit status
> Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence status
> 1084811079 receive 1 from 1084811080 walltime 1416909392 local 1035
> Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence request
> 1084811079 no actor
>
>
> Showing a status code '1' from dlm_stonith - the result should be 0 if
> fencing completed succesfully.

But 1084811080 is nebula3 and in its logs I see:

Nov 25 10:56:33 nebula3 stonith-ng[6232]: notice: can_fence_host_with_device: Stonith-nebula2-IPMILAN can fence nebula2: static-list
[...]
Nov 25 10:56:34 nebula3 stonith-ng[6232]: notice: log_operation: Operation 'reboot' [7359] (call 4 from crmd.5038) for host 'nebula2' with device 'Stonith-nebula2-IPMILAN' returned: 0 (OK)
Nov 25 10:56:34 nebula3 stonith-ng[6232]: error: crm_abort: crm_glib_handler: Forked child 7376 to record non-fatal assert at logging.c:63 : Source ID 20 was not found when attempting to remove it
Nov 25 10:56:34 nebula3 stonith-ng[6232]: error: crm_abort: crm_glib_handler: Forked child 7377 to record non-fatal assert at logging.c:63 : Source ID 21 was not found when attempting to remove it
Nov 25 10:56:34 nebula3 stonith-ng[6232]: notice: remote_op_done: Operation reboot of nebula2 by nebula1 for crmd.5038@nebula1.34bed18c: OK
Nov 25 10:56:34 nebula3 crmd[6236]: notice: tengine_stonith_notify: Peer nebula2 was terminated (reboot) by nebula1 for nebula1: OK (ref=34bed18c-c395-4de2-b323-e00208cac6c7) by client crmd.5038
Nov 25 10:56:34 nebula3 crmd[6236]: notice: crm_update_peer_state: tengine_stonith_notify: Node nebula2[0] - state is now lost (was (null))

Which means to me that stonith-ng manage to fence the node and notify
its success.

How the “returned: 0 (OK)” could became “receive 1”?

A logic issue somewhere between stonith-ng and dlm_controld?

Thanks.
--
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF
Re: Pacemaker fencing and DLM/cLVM [ In reply to ]
On 25/11/14 19:55, Daniel Dehennin wrote:
> Christine Caulfield <ccaulfie@redhat.com> writes:
>
>> It seems to me that fencing is failing for some reason, though I can't
>> tell from the logs exactly why, so you might have to investgate your
>> setup for IPMI to see just what is happening (I'm no IPMI expert,
>> sorry).
>
> Thanks for looking, but actually IPMI stonith is working, for all nodes
> I tested:
>
> stonith_adm --reboot <node>
>
> And it works.
>
>> The logs files tell me this though:
>>
>> Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence request
>> 1084811079 pid 7358 nodedown time 1416909392 fence_all dlm_stonith
>> Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence result
>> 1084811079 pid 7358 result 1 exit status
>> Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence status
>> 1084811079 receive 1 from 1084811080 walltime 1416909392 local 1035
>> Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence request
>> 1084811079 no actor
>>
>>
>> Showing a status code '1' from dlm_stonith - the result should be 0 if
>> fencing completed succesfully.
>
> But 1084811080 is nebula3 and in its logs I see:
>
> Nov 25 10:56:33 nebula3 stonith-ng[6232]: notice: can_fence_host_with_device: Stonith-nebula2-IPMILAN can fence nebula2: static-list
> [...]
> Nov 25 10:56:34 nebula3 stonith-ng[6232]: notice: log_operation: Operation 'reboot' [7359] (call 4 from crmd.5038) for host 'nebula2' with device 'Stonith-nebula2-IPMILAN' returned: 0 (OK)
> Nov 25 10:56:34 nebula3 stonith-ng[6232]: error: crm_abort: crm_glib_handler: Forked child 7376 to record non-fatal assert at logging.c:63 : Source ID 20 was not found when attempting to remove it
> Nov 25 10:56:34 nebula3 stonith-ng[6232]: error: crm_abort: crm_glib_handler: Forked child 7377 to record non-fatal assert at logging.c:63 : Source ID 21 was not found when attempting to remove it
> Nov 25 10:56:34 nebula3 stonith-ng[6232]: notice: remote_op_done: Operation reboot of nebula2 by nebula1 for crmd.5038@nebula1.34bed18c: OK
> Nov 25 10:56:34 nebula3 crmd[6236]: notice: tengine_stonith_notify: Peer nebula2 was terminated (reboot) by nebula1 for nebula1: OK (ref=34bed18c-c395-4de2-b323-e00208cac6c7) by client crmd.5038
> Nov 25 10:56:34 nebula3 crmd[6236]: notice: crm_update_peer_state: tengine_stonith_notify: Node nebula2[0] - state is now lost (was (null))
>
> Which means to me that stonith-ng manage to fence the node and notify
> its success.
>
> How the “returned: 0 (OK)” could became “receive 1”?
>
> A logic issue somewhere between stonith-ng and dlm_controld?
>

it could be, I don't know enough about pacemaker to be able to comment
on that, sorry. The 'no actors' message from dlm_controld worries me
though.


Chrissie

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
Re: Pacemaker fencing and DLM/cLVM [ In reply to ]
> On 27 Nov 2014, at 8:17 pm, Christine Caulfield <ccaulfie@redhat.com> wrote:
>
> On 25/11/14 19:55, Daniel Dehennin wrote:
>> Christine Caulfield <ccaulfie@redhat.com> writes:
>>
>>> It seems to me that fencing is failing for some reason, though I can't
>>> tell from the logs exactly why, so you might have to investgate your
>>> setup for IPMI to see just what is happening (I'm no IPMI expert,
>>> sorry).
>>
>> Thanks for looking, but actually IPMI stonith is working, for all nodes
>> I tested:
>>
>> stonith_adm --reboot <node>
>>
>> And it works.
>>
>>> The logs files tell me this though:
>>>
>>> Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence request
>>> 1084811079 pid 7358 nodedown time 1416909392 fence_all dlm_stonith
>>> Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence result
>>> 1084811079 pid 7358 result 1 exit status
>>> Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence status
>>> 1084811079 receive 1 from 1084811080 walltime 1416909392 local 1035
>>> Nov 25 10:56:32 nebula3 dlm_controld[6465]: 1035 fence request
>>> 1084811079 no actor
>>>
>>>
>>> Showing a status code '1' from dlm_stonith - the result should be 0 if
>>> fencing completed succesfully.
>>
>> But 1084811080 is nebula3 and in its logs I see:
>>
>> Nov 25 10:56:33 nebula3 stonith-ng[6232]: notice: can_fence_host_with_device: Stonith-nebula2-IPMILAN can fence nebula2: static-list
>> [...]
>> Nov 25 10:56:34 nebula3 stonith-ng[6232]: notice: log_operation: Operation 'reboot' [7359] (call 4 from crmd.5038) for host 'nebula2' with device 'Stonith-nebula2-IPMILAN' returned: 0 (OK)
>> Nov 25 10:56:34 nebula3 stonith-ng[6232]: error: crm_abort: crm_glib_handler: Forked child 7376 to record non-fatal assert at logging.c:63 : Source ID 20 was not found when attempting to remove it
>> Nov 25 10:56:34 nebula3 stonith-ng[6232]: error: crm_abort: crm_glib_handler: Forked child 7377 to record non-fatal assert at logging.c:63 : Source ID 21 was not found when attempting to remove it
>> Nov 25 10:56:34 nebula3 stonith-ng[6232]: notice: remote_op_done: Operation reboot of nebula2 by nebula1 for crmd.5038@nebula1.34bed18c: OK
>> Nov 25 10:56:34 nebula3 crmd[6236]: notice: tengine_stonith_notify: Peer nebula2 was terminated (reboot) by nebula1 for nebula1: OK (ref=34bed18c-c395-4de2-b323-e00208cac6c7) by client crmd.5038
>> Nov 25 10:56:34 nebula3 crmd[6236]: notice: crm_update_peer_state: tengine_stonith_notify: Node nebula2[0] - state is now lost (was (null))
>>
>> Which means to me that stonith-ng manage to fence the node and notify
>> its success.
>>
>> How the “returned: 0 (OK)” could became “receive 1”?
>>
>> A logic issue somewhere between stonith-ng and dlm_controld?
>>
>
> it could be, I don't know enough about pacemaker to be able to comment on that, sorry. The 'no actors' message from dlm_controld worries me though.

This was fixed a few months ago:

+ David Vossel (9 months ago) 054fedf: Fix: stonith_api_time_helper now returns when the most recent fencing operation completed (origin/pr/444)
+ Andrew Beekhof (9 months ago) d9921e5: Fix: Fencing: Pass the correct options when looking up the history by node name
+ Andrew Beekhof (9 months ago) b0a8876: Log: Fencing: Send details of stonith_api_time() and stonith_api_kick() to syslog

It doesn't seem Ubuntu has these patches
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
Re: Pacemaker fencing and DLM/cLVM [ In reply to ]
Andrew Beekhof <andrew@beekhof.net> writes:

> This was fixed a few months ago:
>
> + David Vossel (9 months ago) 054fedf: Fix: stonith_api_time_helper now returns when the most recent fencing operation completed (origin/pr/444)
> + Andrew Beekhof (9 months ago) d9921e5: Fix: Fencing: Pass the correct options when looking up the history by node name
> + Andrew Beekhof (9 months ago) b0a8876: Log: Fencing: Send details of stonith_api_time() and stonith_api_kick() to syslog
>
> It doesn't seem Ubuntu has these patches

Thanks, I just opened a bug report[1].

Footnotes:
[1] https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1397278

--
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6 2AAD CC1E 9E5B 7A6F E2DF