Mailing List Archive

More Diagnosis help
Hi

Had another node die


Everything is looking good, I am guessing corrosync tried to talk to the other node and it failed, I believe

Nov 1 00:08:48 demorp2 ntpd[2461]: peers refreshed
Nov 1 00:08:51 demorp2 corosync[2039]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Nov 1 00:08:51 demorp2 corosync[2039]: [CPG ] chosen downlist: sender r(0) ip(10.172.218.52) ; members(old:1 left:0)
Nov 1 00:08:51 demorp2 corosync[2039]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 1 00:09:05 demorp2 corosync[2039]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Nov 1 00:09:05 demorp2 corosync[2039]: [CMAN ] quorum regained, resuming activity
Nov 1 00:09:05 demorp2 corosync[2039]: [QUORUM] This node is within the primary component and will provide service.
Nov 1 00:09:05 demorp2 corosync[2039]: [QUORUM] Members[2]: 1 2
Nov 1 00:09:05 demorp2 corosync[2039]: [QUORUM] Members[2]: 1 2
Nov 1 00:09:05 demorp2 crmd[2725]: notice: cman_event_callback: Membership 320: quorum acquired
Nov 1 00:09:05 demorp2 crmd[2725]: notice: crm_update_peer_state: cman_event_callback: Node demorp1[1] - state is now member (was lost)
Nov 1 00:09:05 demorp2 corosync[2039]: [CPG ] chosen downlist: sender r(0) ip(10.172.218.52) ; members(old:1 left:0)
Nov 1 00:09:05 demorp2 corosync[2039]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 1 00:09:05 demorp2 crmd[2725]: notice: do_state_transition: State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=peer_update_callback ]
Nov 1 00:09:06 demorp2 corosync[2039]: cman killed by node 1 because we were killed by cman_tool or other application
Nov 1 00:09:06 demorp2 attrd[2723]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Nov 1 00:09:06 demorp2 attrd[2723]: crit: attrd_cs_destroy: Lost connection to Corosync service!
Nov 1 00:09:06 demorp2 attrd[2723]: notice: main: Exiting...
Nov 1 00:09:06 demorp2 attrd[2723]: notice: main: Disconnecting client 0xdc3020, pid=2725...
Nov 1 00:09:06 demorp2 pacemakerd[2712]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Nov 1 00:09:06 demorp2 pacemakerd[2712]: error: mcp_cpg_destroy: Connection destroyed
Nov 1 00:09:06 demorp2 stonith-ng[2721]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Nov 1 00:09:06 demorp2 crmd[2725]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Nov 1 00:09:06 demorp2 crmd[2725]: error: crmd_cs_destroy: connection terminated
Nov 1 00:09:06 demorp2 gfs_controld[2173]: cluster is down, exiting
Nov 1 00:09:06 demorp2 gfs_controld[2173]: daemon cpg_dispatch error 2
Nov 1 00:09:06 demorp2 attrd[2723]: error: attrd_cib_connection_destroy: Connection to the CIB terminated...
Nov 1 00:09:06 demorp2 fenced[2098]: cluster is down, exiting
Nov 1 00:09:06 demorp2 fenced[2098]: daemon cpg_dispatch error 2
Nov 1 00:09:06 demorp2 dlm_controld[2124]: cluster is down, exiting
Nov 1 00:09:06 demorp2 dlm_controld[2124]: daemon cpg_dispatch error 2
Nov 1 00:09:06 demorp2 stonith-ng[2721]: error: stonith_peer_cs_destroy: Corosync connection terminated
Nov 1 00:09:06 demorp2 cib[2720]: warning: qb_ipcs_event_sendv: new_event_notification (2720-2721-11): Broken pipe (32)
Nov 1 00:09:06 demorp2 cib[2720]: warning: cib_notify_send_one: Notification of client crmd/4c1076bf-8a95-4f77-b866-e1bbf5e2ceda failed
Nov 1 00:09:06 demorp2 cib[2720]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Nov 1 00:09:06 demorp2 cib[2720]: error: cib_cs_destroy: Corosync connection lost! Exiting.
Nov 1 00:09:06 demorp2 crmd[2725]: notice: crmd_exit: Forcing immediate exit: Link has been severed (67)
Nov 1 00:09:06 demorp2 lrmd[2722]: warning: qb_ipcs_event_sendv: new_event_notification (2722-2725-6): Bad file descriptor (9)
Nov 1 00:09:06 demorp2 lrmd[2722]: warning: send_client_notify: Notification of client crmd/3598d3e2-600a-4f15-aae2-e087437d6213 failed
Nov 1 00:09:06 demorp2 lrmd[2722]: warning: send_client_notify: Notification of client crmd/3598d3e2-600a-4f15-aae2-e087437d6213 failed
Nov 1 00:09:08 demorp2 kernel: dlm: closing connection to node 1


The other node

It looks to me, like VMWare took too long to give this vm a time slice and corosync responded by killing one node


ov 1 00:08:50 demorp1 lrmd[2433]: warning: child_timeout_callback: ybrpstat_monitor_5000 process (PID 32026) timed out
Nov 1 00:08:50 demorp1 lrmd[2433]: warning: operation_finished: ybrpstat_monitor_5000:32026 - timed out after 20000ms
Nov 1 00:08:51 demorp1 crmd[2436]: error: process_lrm_event: LRM operation ybrpstat_monitor_5000 (17) Timed Out (timeout=20000ms)
Nov 1 00:08:52 demorp1 crmd[2436]: notice: process_lrm_event: demorp1-ybrpstat_monitor_5000:17 [ Service running for 18 hours 8 minutes 30 seconds.\n ]
Nov 1 00:08:53 demorp1 lrmd[2433]: warning: child_timeout_callback: ybrpip_monitor_5000 process (PID 32033) timed out
Nov 1 00:08:53 demorp1 lrmd[2433]: warning: operation_finished: ybrpip_monitor_5000:32033 - timed out after 20000ms
Nov 1 00:08:53 demorp1 crmd[2436]: error: process_lrm_event: LRM operation ybrpip_monitor_5000 (22) Timed Out (timeout=20000ms)
Nov 1 00:09:05 demorp1 corosync[1748]: [MAIN ] Corosync main process was not scheduled for 16241.7002 ms (threshold is 8000.0000 ms). Consider token timeout increase.
Nov 1 00:09:05 demorp1 corosync[1748]: [TOTEM ] A processor failed, forming new configuration.
Nov 1 00:09:05 demorp1 corosync[1748]: [TOTEM ] Process pause detected for 15555 ms, flushing membership messages.
Nov 1 00:09:05 demorp1 corosync[1748]: [MAIN ] Corosync main process was not scheduled for 15555.0029 ms (threshold is 8000.0000 ms). Consider token timeout increase.
Nov 1 00:09:05 demorp1 corosync[1748]: [CMAN ] quorum lost, blocking activity
Nov 1 00:09:05 demorp1 corosync[1748]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
Nov 1 00:09:05 demorp1 corosync[1748]: [QUORUM] Members[1]: 1
Nov 1 00:09:05 demorp1 corosync[1748]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Nov 1 00:09:05 demorp1 corosync[1748]: [CMAN ] quorum regained, resuming activity
Nov 1 00:09:05 demorp1 corosync[1748]: [QUORUM] This node is within the primary component and will provide service.
Nov 1 00:09:05 demorp1 corosync[1748]: [QUORUM] Members[2]: 1 2
Nov 1 00:09:05 demorp1 corosync[1748]: [QUORUM] Members[2]: 1 2
Nov 1 00:09:05 demorp1 corosync[1748]: [CPG ] chosen downlist: sender r(0) ip(10.172.218.51) ; members(old:2 left:1)
Nov 1 00:09:05 demorp1 corosync[1748]: [MAIN ] Completed service synchronization, ready to provide service.
Nov 1 00:09:05 demorp1 crmd[2436]: notice: process_lrm_event: LRM operation ybrpip_monitor_5000 (call=22, rc=0, cib-update=17, confirmed=false) ok
Nov 1 00:09:05 demorp1 crmd[2436]: notice: peer_update_callback: Our peer on the DC is dead
Nov 1 00:09:05 demorp1 crmd[2436]: notice: cman_event_callback: Membership 320: quorum lost
Nov 1 00:09:05 demorp1 crmd[2436]: notice: cman_event_callback: Membership 320: quorum acquired
Nov 1 00:09:05 demorp1 crmd[2436]: notice: do_state_transition: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callba
ck ]
Nov 1 00:09:05 demorp1 crmd[2436]: notice: process_lrm_event: LRM operation ybrpstat_monitor_5000 (call=17, rc=0, cib-update=18, confirmed=false) ok
Nov 1 00:09:06 demorp1 crmd[2436]: warning: do_log: FSA: Input I_JOIN_OFFER from route_message() received in state S_ELECTION
Nov 1 00:09:06 demorp1 crmd[2436]: notice: do_state_transition: State transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ]
Nov 1 00:09:06 demorp1 fenced[1822]: telling cman to remove nodeid 2 from cluster
Nov 1 00:09:06 demorp1 fenced[1822]: receive_start 2:3 add node with started_count 1




_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
Re: More Diagnosis help [ In reply to ]
> On 1 Nov 2014, at 7:07 am, Alex Samad - Yieldbroker <Alex.Samad@yieldbroker.com> wrote:
>
> It looks to me, like VMWare took too long to give this vm a time slice and corosync responded by killing one node

That does sound reasonable from the logs you posted.
(sorry, I'm only just catching up on old posts)
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org