Mailing List Archive: One node thinks everyone is online, but the other node doesn't think so

One node thinks everyone is online, but the other node doesn't think so

Jan 25, 2015, 3:53 PM

Post #1 of 4 (2085 views)

Hello.

I have a 3-node cluster where node1 and node2 are running
corosync+pacemaker and node3 is running corosync only (for quorum).
Corosync 2.3.3, pacemaker 1.1.10. Everything worked fine the first couple
of days.

Once upon a time I discovered the following situation: node2 thinks that
both node1 and node2 are online, but node1 thinks that node2 is down. Could
you please say: how could it be? There are no connectivity problems between
the nodes at the moment (maybe they were, but why the system hasn't
recovered?). The "crm status" is below. What other logs should I attach for
the diagnostics?

Also, "service corosync stop" on node1 hangs forever with no additional
lines in logs, so I cannot even stop the service. (But after "service
corosync stop" on node1 the node node2 starts thinking that node1 is
offline, although the command still hangs.)

root@node2:~# crm status
Current DC: node1 (1760315215) - partition with quorum
2 Nodes configured
6 Resources configured
Online: [ node1 node2 ]
Master/Slave Set: ms_drbd [drbd]
Masters: [ node2 ]
Slaves: [ node1 ]
Resource Group: server
fs (ocf::heartbeat:Filesystem): Started node2
postgresql (lsb:postgresql): Started node2
bind9 (lsb:bind9): Started node2
nginx (lsb:nginx): Started node2

root@node1:/var/log# crm status
Current DC: node1 (1760315215) - partition with quorum
2 Nodes configured
6 Resources configured
Online: [ node1 ]
OFFLINE: [ node2 ]
Master/Slave Set: ms_drbd [drbd]
Masters: [ node1 ]
Stopped: [ node2 ]
Resource Group: server
fs (ocf::heartbeat:Filesystem): Started node1
postgresql (lsb:postgresql): Started node1
bind9 (lsb:bind9): Started node1
nginx (lsb:nginx): Started node1
Failed actions:
drbd_promote_0 (node=node1, call=634, rc=1, status=Timed Out,
last-rc-change=Thu Jan 22 10:30:08 2015, queued=20004ms, exec=0ms): unknown
error

A part of "crm configure show":

property $id="cib-bootstrap-options" \
dc-version="1.1.10-42f2063" \
cluster-infrastructure="corosync" \
stonith-enabled="false" \
last-lrm-refresh="1421250983"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"

Also I see in logs on node1 (maybe they're related to the issue, maybe not):

Jan 22 10:14:02 node1 pengine[2772]: warning: pe_fence_node: Node node2 is
unclean because it is partially and/or un-expectedly down
Jan 22 10:14:02 node1 pengine[2772]: warning: determine_online_status:
Node node2 is unclean
Jan 22 10:14:02 node1 pengine[2772]: warning: stage6: Node node2 is
unclean!
Jan 22 10:14:02 node1 pengine[2772]: warning: stage6: YOUR RESOURCES ARE
NOW LIKELY COMPROMISED
Jan 22 10:14:02 node1 pengine[2772]: error: stage6: ENABLE STONITH TO
KEEP YOUR RESOURCES SAFE

On node2 the logs are:

Jan 22 10:13:57 node2 corosync[32761]: [TOTEM ] A new membership (
188.166.54.190:6276) was formed. Members left: 1760315215 13071578
Jan 22 10:13:57 node2 crmd[311]: notice: peer_update_callback: Our peer
on the DC is dead
Jan 22 10:13:57 node2 crmd[311]: notice: do_state_transition: State
transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION
cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
Jan 22 10:13:57 node2 corosync[32761]: [QUORUM] This node is within the
non-primary component and will NOT provide any services.
Jan 22 10:13:57 node2 corosync[32761]: [QUORUM] Members[1]: 1017525950
Jan 22 10:13:57 node2 crmd[311]: notice: pcmk_quorum_notification:
Membership 6276: quorum lost (1)
Jan 22 10:13:57 node2 crmd[311]: notice: crm_update_peer_state:
pcmk_quorum_notification: Node (null)[13071578] - state is now lost (was
member)
Jan 22 10:13:57 node2 crmd[311]: notice: crm_update_peer_state:
pcmk_quorum_notification: Node node1[1760315215] - state is now lost (was
member)
Jan 22 10:13:57 node2 pacemakerd[302]: notice: pcmk_quorum_notification:
Membership 6276: quorum lost (1)
Jan 22 10:13:57 node2 pacemakerd[302]: notice: crm_update_peer_state:
pcmk_quorum_notification: Node node1[1760315215] - state is now lost (was
member)
Jan 22 10:13:57 node2 pacemakerd[302]: notice: crm_update_peer_state:
pcmk_quorum_notification: Node (null)[13071578] - state is now lost (was
member)
Jan 22 10:13:57 node2 corosync[32761]: [MAIN ] Completed service
synchronization, ready to provide service.
Jan 22 10:14:01 node2 corosync[32761]: [TOTEM ] A new membership (
104.236.71.79:6288) was formed. Members joined: 1760315215 13071578
Jan 22 10:14:02 node2 crmd[311]: error: pcmk_cpg_membership: Node
node1[1760315215] appears to be online even though we think it is dead
Jan 22 10:14:02 node2 crmd[311]: notice: crm_update_peer_state:
pcmk_cpg_membership: Node node1[1760315215] - state is now member (was lost)
Jan 22 10:14:03 node2 corosync[32761]: [QUORUM] This node is within the
primary component and will provide service.
Jan 22 10:14:03 node2 corosync[32761]: [QUORUM] Members[3]: 1760315215
13071578 1017525950
Jan 22 10:14:03 node2 crmd[311]: notice: pcmk_quorum_notification:
Membership 6288: quorum acquired (3)
Jan 22 10:14:03 node2 pacemakerd[302]: notice: pcmk_quorum_notification:
Membership 6288: quorum acquired (3)
Jan 22 10:14:03 node2 pacemakerd[302]: notice: crm_update_peer_state:
pcmk_quorum_notification: Node node1[1760315215] - state is now member (was
lost)
Jan 22 10:14:03 node2 corosync[32761]: [MAIN ] Completed service
synchronization, ready to provide service.
Jan 22 10:14:03 node2 crmd[311]: notice: corosync_node_name: Unable to
get node name for nodeid 13071578
Jan 22 10:14:03 node2 crmd[311]: notice: crm_update_peer_state:
pcmk_quorum_notification: Node (null)[13071578] - state is now member (was
lost)
Jan 22 10:14:03 node2 pacemakerd[302]: notice: corosync_node_name: Unable
to get node name for nodeid 13071578
Jan 22 10:14:03 node2 pacemakerd[302]: notice: crm_update_peer_state:
pcmk_quorum_notification: Node (null)[13071578] - state is now member (was
lost)
Jan 22 10:14:03 node2 crmd[311]: warning: do_log: FSA: Input I_JOIN_OFFER
from route_message() received in state S_ELECTION
Jan 22 10:14:04 node2 crmd[311]: notice: do_state_transition: State
transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL
origin=do_election_count_vote ]
Jan 22 10:14:05 node2 attrd[310]: notice: attrd_local_callback: Sending
full refresh (origin=crmd)
Jan 22 10:14:05 node2 attrd[310]: notice: attrd_trigger_update: Sending
flush op to all hosts for: master-drbd (10000)
Jan 22 10:14:05 node2 attrd[310]: notice: attrd_trigger_update: Sending
flush op to all hosts for: probe_complete (true)
Jan 22 10:14:05 node2 crmd[311]: notice: do_state_transition: State
transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE
origin=do_cl_join_finalize_respond ]
Jan 22 10:15:11 node2 corosync[32761]: [TOTEM ] A new membership (
104.236.71.79:6296) was formed. Members left: 13071578
Jan 22 10:15:14 node2 corosync[32761]: [TOTEM ] A new membership (
128.199.116.218:6312) was formed. Members joined: 13071578 left: 1760315215
Jan 22 10:15:17 node2 corosync[32761]: [TOTEM ] A new membership (
104.236.71.79:6324) was formed. Members joined: 1760315215
Jan 22 10:15:19 node2 crmd[311]: notice: peer_update_callback: Our peer
on the DC is dead
Jan 22 10:15:19 node2 crmd[311]: notice: do_state_transition: State
transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION
cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
Jan 22 10:15:20 node2 kernel: [690741.179442] block drbd0: peer( Primary ->
Secondary )
Jan 22 10:15:20 node2 corosync[32761]: [QUORUM] Members[3]: 1760315215
13071578 1017525950
Jan 22 10:15:20 node2 corosync[32761]: [MAIN ] Completed service
synchronization, ready to provide service.

Re: One node thinks everyone is online, but the other node doesn't think so [ In reply to ]

andrew at beekhof

Feb 23, 2015, 3:07 PM

Post #2 of 4 (2008 views)

Permalink

> On 26 Jan 2015, at 10:53 am, Dmitry Koterov <dmitry.koterov@gmail.com> wrote:
>
> Hello.
>
> I have a 3-node cluster where node1 and node2 are running corosync+pacemaker and node3 is running corosync only (for quorum). Corosync 2.3.3, pacemaker 1.1.10. Everything worked fine the first couple of days.
>
> Once upon a time I discovered the following situation: node2 thinks that both node1 and node2 are online, but node1 thinks that node2 is down. Could you please say: how could it be? There are no connectivity problems between the nodes at the moment (maybe they were, but why the system hasn't recovered?).

The logs show connectivity problems occurring, so no doubt there.
As to why it hasn't recovered, first check corosync - if it does not have a consistent view of the world pacemaker has no hope.
Alternatively, I recall there was a bug that could be preventing this in your version. So if corosync looks fine, perhaps try an upgrade.

> The "crm status" is below. What other logs should I attach for the diagnostics?
>
> Also, "service corosync stop" on node1 hangs forever with no additional lines in logs, so I cannot even stop the service. (But after "service corosync stop" on node1 the node node2 starts thinking that node1 is offline, although the command still hangs.)
>
>
> root@node2:~# crm status
> Current DC: node1 (1760315215) - partition with quorum
> 2 Nodes configured
> 6 Resources configured
> Online: [ node1 node2 ]
> Master/Slave Set: ms_drbd [drbd]
> Masters: [ node2 ]
> Slaves: [ node1 ]
> Resource Group: server
> fs (ocf::heartbeat:Filesystem): Started node2
> postgresql (lsb:postgresql): Started node2
> bind9 (lsb:bind9): Started node2
> nginx (lsb:nginx): Started node2
>
>
> root@node1:/var/log# crm status
> Current DC: node1 (1760315215) - partition with quorum
> 2 Nodes configured
> 6 Resources configured
> Online: [ node1 ]
> OFFLINE: [ node2 ]
> Master/Slave Set: ms_drbd [drbd]
> Masters: [ node1 ]
> Stopped: [ node2 ]
> Resource Group: server
> fs (ocf::heartbeat:Filesystem): Started node1
> postgresql (lsb:postgresql): Started node1
> bind9 (lsb:bind9): Started node1
> nginx (lsb:nginx): Started node1
> Failed actions:
> drbd_promote_0 (node=node1, call=634, rc=1, status=Timed Out, last-rc-change=Thu Jan 22 10:30:08 2015, queued=20004ms, exec=0ms): unknown error
>
>
> A part of "crm configure show":
>
> property $id="cib-bootstrap-options" \
> dc-version="1.1.10-42f2063" \
> cluster-infrastructure="corosync" \
> stonith-enabled="false" \
> last-lrm-refresh="1421250983"
> rsc_defaults $id="rsc-options" \
> resource-stickiness="100"
>
>
> Also I see in logs on node1 (maybe they're related to the issue, maybe not):
>
> Jan 22 10:14:02 node1 pengine[2772]: warning: pe_fence_node: Node node2 is unclean because it is partially and/or un-expectedly down
> Jan 22 10:14:02 node1 pengine[2772]: warning: determine_online_status: Node node2 is unclean
> Jan 22 10:14:02 node1 pengine[2772]: warning: stage6: Node node2 is unclean!
> Jan 22 10:14:02 node1 pengine[2772]: warning: stage6: YOUR RESOURCES ARE NOW LIKELY COMPROMISED
> Jan 22 10:14:02 node1 pengine[2772]: error: stage6: ENABLE STONITH TO KEEP YOUR RESOURCES SAFE
>
>
> On node2 the logs are:
>
> Jan 22 10:13:57 node2 corosync[32761]: [TOTEM ] A new membership (188.166.54.190:6276) was formed. Members left: 1760315215 13071578
> Jan 22 10:13:57 node2 crmd[311]: notice: peer_update_callback: Our peer on the DC is dead
> Jan 22 10:13:57 node2 crmd[311]: notice: do_state_transition: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
> Jan 22 10:13:57 node2 corosync[32761]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
> Jan 22 10:13:57 node2 corosync[32761]: [QUORUM] Members[1]: 1017525950
> Jan 22 10:13:57 node2 crmd[311]: notice: pcmk_quorum_notification: Membership 6276: quorum lost (1)
> Jan 22 10:13:57 node2 crmd[311]: notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[13071578] - state is now lost (was member)
> Jan 22 10:13:57 node2 crmd[311]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[1760315215] - state is now lost (was member)
> Jan 22 10:13:57 node2 pacemakerd[302]: notice: pcmk_quorum_notification: Membership 6276: quorum lost (1)
> Jan 22 10:13:57 node2 pacemakerd[302]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[1760315215] - state is now lost (was member)
> Jan 22 10:13:57 node2 pacemakerd[302]: notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[13071578] - state is now lost (was member)
> Jan 22 10:13:57 node2 corosync[32761]: [MAIN ] Completed service synchronization, ready to provide service.
> Jan 22 10:14:01 node2 corosync[32761]: [TOTEM ] A new membership (104.236.71.79:6288) was formed. Members joined: 1760315215 13071578
> Jan 22 10:14:02 node2 crmd[311]: error: pcmk_cpg_membership: Node node1[1760315215] appears to be online even though we think it is dead
> Jan 22 10:14:02 node2 crmd[311]: notice: crm_update_peer_state: pcmk_cpg_membership: Node node1[1760315215] - state is now member (was lost)
> Jan 22 10:14:03 node2 corosync[32761]: [QUORUM] This node is within the primary component and will provide service.
> Jan 22 10:14:03 node2 corosync[32761]: [QUORUM] Members[3]: 1760315215 13071578 1017525950
> Jan 22 10:14:03 node2 crmd[311]: notice: pcmk_quorum_notification: Membership 6288: quorum acquired (3)
> Jan 22 10:14:03 node2 pacemakerd[302]: notice: pcmk_quorum_notification: Membership 6288: quorum acquired (3)
> Jan 22 10:14:03 node2 pacemakerd[302]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[1760315215] - state is now member (was lost)
> Jan 22 10:14:03 node2 corosync[32761]: [MAIN ] Completed service synchronization, ready to provide service.
> Jan 22 10:14:03 node2 crmd[311]: notice: corosync_node_name: Unable to get node name for nodeid 13071578
> Jan 22 10:14:03 node2 crmd[311]: notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[13071578] - state is now member (was lost)
> Jan 22 10:14:03 node2 pacemakerd[302]: notice: corosync_node_name: Unable to get node name for nodeid 13071578
> Jan 22 10:14:03 node2 pacemakerd[302]: notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[13071578] - state is now member (was lost)
> Jan 22 10:14:03 node2 crmd[311]: warning: do_log: FSA: Input I_JOIN_OFFER from route_message() received in state S_ELECTION
> Jan 22 10:14:04 node2 crmd[311]: notice: do_state_transition: State transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ]
> Jan 22 10:14:05 node2 attrd[310]: notice: attrd_local_callback: Sending full refresh (origin=crmd)
> Jan 22 10:14:05 node2 attrd[310]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd (10000)
> Jan 22 10:14:05 node2 attrd[310]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
> Jan 22 10:14:05 node2 crmd[311]: notice: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
> Jan 22 10:15:11 node2 corosync[32761]: [TOTEM ] A new membership (104.236.71.79:6296) was formed. Members left: 13071578
> Jan 22 10:15:14 node2 corosync[32761]: [TOTEM ] A new membership (128.199.116.218:6312) was formed. Members joined: 13071578 left: 1760315215
> Jan 22 10:15:17 node2 corosync[32761]: [TOTEM ] A new membership (104.236.71.79:6324) was formed. Members joined: 1760315215
> Jan 22 10:15:19 node2 crmd[311]: notice: peer_update_callback: Our peer on the DC is dead
> Jan 22 10:15:19 node2 crmd[311]: notice: do_state_transition: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
> Jan 22 10:15:20 node2 kernel: [690741.179442] block drbd0: peer( Primary -> Secondary )
> Jan 22 10:15:20 node2 corosync[32761]: [QUORUM] Members[3]: 1760315215 13071578 1017525950
> Jan 22 10:15:20 node2 corosync[32761]: [MAIN ] Completed service synchronization, ready to provide service.
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: One node thinks everyone is online, but the other node doesn't think so [ In reply to ]

dmitry.koterov at gmail

Mar 10, 2015, 8:21 AM

Post #3 of 4 (1977 views)

Permalink

On Tue, Feb 24, 2015 at 2:07 AM, Andrew Beekhof <andrew@beekhof.net> wrote:

>
> > I have a 3-node cluster where node1 and node2 are running
> corosync+pacemaker and node3 is running corosync only (for quorum).
> Corosync 2.3.3, pacemaker 1.1.10. Everything worked fine the first couple
> of days.
> >
> > Once upon a time I discovered the following situation: node2 thinks that
> both node1 and node2 are online, but node1 thinks that node2 is down. Could
> you please say: how could it be? There are no connectivity problems between
> the nodes at the moment (maybe they were, but why the system hasn't
> recovered?).
>
> The logs show connectivity problems occurring, so no doubt there.
> As to why it hasn't recovered, first check corosync - if it does not have
> a consistent view of the world pacemaker has no hope.
> Alternatively, I recall there was a bug that could be preventing this in
> your version. So if corosync looks fine, perhaps try an upgrade.

Thanks.
Are you talking about this bug:
https://bugs.launchpad.net/ubuntu/+source/libqb/+bug/1341496 ?

I believe I reproduced the problem one more time (it's very unstable), the
symptoms were the following:

1. Once upon a time node2 became down.
2. The last message from corosync at node1 was "Quorum lost" (I suspect
there was a temporary misconnection with node3).
3. Then, in a couple of days, at node3 "service corosync stop" hanged (only
killall -9 helps). I tried to run strace during the service is stopping, it
shows:

[pid 19449] futex(0x7f580b4c62e0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished
...>
[pid 19448] --- SIGQUIT {si_signo=SIGQUIT, si_code=SI_USER, si_pid=28183,
si_uid=0} ---
[pid 19448] write(6, "\3\0\0\0", 4) = 4
[pid 19448] rt_sigreturn() = 360
... <and repeats for 19448 again and again>

where pstree shows:

init,1
â”œâ”€corosync,19448
â”‚ â””â”€{corosync},19449

4. As well as at node1: "service corosync stop" hangs at node1 too with
same symptoms, only killall -9 helps.
5. Restarting corosync & pacemaker at node1 and node2 solved the problem.

Could you please say is it related to the above bug in libqb?

> > The "crm status" is below. What other logs should I attach for the
> diagnostics?
> >
> > Also, "service corosync stop" on node1 hangs forever with no additional
> lines in logs, so I cannot even stop the service. (But after "service
> corosync stop" on node1 the node node2 starts thinking that node1 is
> offline, although the command still hangs.)
> >
> >
> > root@node2:~# crm status
> > Current DC: node1 (1760315215) - partition with quorum
> > 2 Nodes configured
> > 6 Resources configured
> > Online: [ node1 node2 ]
> > Master/Slave Set: ms_drbd [drbd]
> > Masters: [ node2 ]
> > Slaves: [ node1 ]
> > Resource Group: server
> > fs (ocf::heartbeat:Filesystem): Started node2
> > postgresql (lsb:postgresql): Started node2
> > bind9 (lsb:bind9): Started node2
> > nginx (lsb:nginx): Started node2
> >
> >
> > root@node1:/var/log# crm status
> > Current DC: node1 (1760315215) - partition with quorum
> > 2 Nodes configured
> > 6 Resources configured
> > Online: [ node1 ]
> > OFFLINE: [ node2 ]
> > Master/Slave Set: ms_drbd [drbd]
> > Masters: [ node1 ]
> > Stopped: [ node2 ]
> > Resource Group: server
> > fs (ocf::heartbeat:Filesystem): Started node1
> > postgresql (lsb:postgresql): Started node1
> > bind9 (lsb:bind9): Started node1
> > nginx (lsb:nginx): Started node1
> > Failed actions:
> > drbd_promote_0 (node=node1, call=634, rc=1, status=Timed Out,
> last-rc-change=Thu Jan 22 10:30:08 2015, queued=20004ms, exec=0ms): unknown
> error
> >
> >
> > A part of "crm configure show":
> >
> > property $id="cib-bootstrap-options" \
> > dc-version="1.1.10-42f2063" \
> > cluster-infrastructure="corosync" \
> > stonith-enabled="false" \
> > last-lrm-refresh="1421250983"
> > rsc_defaults $id="rsc-options" \
> > resource-stickiness="100"
> >
> >
> > Also I see in logs on node1 (maybe they're related to the issue, maybe
> not):
> >
> > Jan 22 10:14:02 node1 pengine[2772]: warning: pe_fence_node: Node node2
> is unclean because it is partially and/or un-expectedly down
> > Jan 22 10:14:02 node1 pengine[2772]: warning: determine_online_status:
> Node node2 is unclean
> > Jan 22 10:14:02 node1 pengine[2772]: warning: stage6: Node node2 is
> unclean!
> > Jan 22 10:14:02 node1 pengine[2772]: warning: stage6: YOUR RESOURCES
> ARE NOW LIKELY COMPROMISED
> > Jan 22 10:14:02 node1 pengine[2772]: error: stage6: ENABLE STONITH TO
> KEEP YOUR RESOURCES SAFE
> >
> >
> > On node2 the logs are:
> >
> > Jan 22 10:13:57 node2 corosync[32761]: [TOTEM ] A new membership (
> 188.166.54.190:6276) was formed. Members left: 1760315215 13071578
> > Jan 22 10:13:57 node2 crmd[311]: notice: peer_update_callback: Our
> peer on the DC is dead
> > Jan 22 10:13:57 node2 crmd[311]: notice: do_state_transition: State
> transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION
> cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
> > Jan 22 10:13:57 node2 corosync[32761]: [QUORUM] This node is within the
> non-primary component and will NOT provide any services.
> > Jan 22 10:13:57 node2 corosync[32761]: [QUORUM] Members[1]: 1017525950
> > Jan 22 10:13:57 node2 crmd[311]: notice: pcmk_quorum_notification:
> Membership 6276: quorum lost (1)
> > Jan 22 10:13:57 node2 crmd[311]: notice: crm_update_peer_state:
> pcmk_quorum_notification: Node (null)[13071578] - state is now lost (was
> member)
> > Jan 22 10:13:57 node2 crmd[311]: notice: crm_update_peer_state:
> pcmk_quorum_notification: Node node1[1760315215] - state is now lost (was
> member)
> > Jan 22 10:13:57 node2 pacemakerd[302]: notice:
> pcmk_quorum_notification: Membership 6276: quorum lost (1)
> > Jan 22 10:13:57 node2 pacemakerd[302]: notice: crm_update_peer_state:
> pcmk_quorum_notification: Node node1[1760315215] - state is now lost (was
> member)
> > Jan 22 10:13:57 node2 pacemakerd[302]: notice: crm_update_peer_state:
> pcmk_quorum_notification: Node (null)[13071578] - state is now lost (was
> member)
> > Jan 22 10:13:57 node2 corosync[32761]: [MAIN ] Completed service
> synchronization, ready to provide service.
> > Jan 22 10:14:01 node2 corosync[32761]: [TOTEM ] A new membership (
> 104.236.71.79:6288) was formed. Members joined: 1760315215 13071578
> > Jan 22 10:14:02 node2 crmd[311]: error: pcmk_cpg_membership: Node
> node1[1760315215] appears to be online even though we think it is dead
> > Jan 22 10:14:02 node2 crmd[311]: notice: crm_update_peer_state:
> pcmk_cpg_membership: Node node1[1760315215] - state is now member (was lost)
> > Jan 22 10:14:03 node2 corosync[32761]: [QUORUM] This node is within the
> primary component and will provide service.
> > Jan 22 10:14:03 node2 corosync[32761]: [QUORUM] Members[3]: 1760315215
> 13071578 1017525950
> > Jan 22 10:14:03 node2 crmd[311]: notice: pcmk_quorum_notification:
> Membership 6288: quorum acquired (3)
> > Jan 22 10:14:03 node2 pacemakerd[302]: notice:
> pcmk_quorum_notification: Membership 6288: quorum acquired (3)
> > Jan 22 10:14:03 node2 pacemakerd[302]: notice: crm_update_peer_state:
> pcmk_quorum_notification: Node node1[1760315215] - state is now member (was
> lost)
> > Jan 22 10:14:03 node2 corosync[32761]: [MAIN ] Completed service
> synchronization, ready to provide service.
> > Jan 22 10:14:03 node2 crmd[311]: notice: corosync_node_name: Unable to
> get node name for nodeid 13071578
> > Jan 22 10:14:03 node2 crmd[311]: notice: crm_update_peer_state:
> pcmk_quorum_notification: Node (null)[13071578] - state is now member (was
> lost)
> > Jan 22 10:14:03 node2 pacemakerd[302]: notice: corosync_node_name:
> Unable to get node name for nodeid 13071578
> > Jan 22 10:14:03 node2 pacemakerd[302]: notice: crm_update_peer_state:
> pcmk_quorum_notification: Node (null)[13071578] - state is now member (was
> lost)
> > Jan 22 10:14:03 node2 crmd[311]: warning: do_log: FSA: Input
> I_JOIN_OFFER from route_message() received in state S_ELECTION
> > Jan 22 10:14:04 node2 crmd[311]: notice: do_state_transition: State
> transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL
> origin=do_election_count_vote ]
> > Jan 22 10:14:05 node2 attrd[310]: notice: attrd_local_callback:
> Sending full refresh (origin=crmd)
> > Jan 22 10:14:05 node2 attrd[310]: notice: attrd_trigger_update:
> Sending flush op to all hosts for: master-drbd (10000)
> > Jan 22 10:14:05 node2 attrd[310]: notice: attrd_trigger_update:
> Sending flush op to all hosts for: probe_complete (true)
> > Jan 22 10:14:05 node2 crmd[311]: notice: do_state_transition: State
> transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE
> origin=do_cl_join_finalize_respond ]
> > Jan 22 10:15:11 node2 corosync[32761]: [TOTEM ] A new membership (
> 104.236.71.79:6296) was formed. Members left: 13071578
> > Jan 22 10:15:14 node2 corosync[32761]: [TOTEM ] A new membership (
> 128.199.116.218:6312) was formed. Members joined: 13071578 left:
> 1760315215
> > Jan 22 10:15:17 node2 corosync[32761]: [TOTEM ] A new membership (
> 104.236.71.79:6324) was formed. Members joined: 1760315215
> > Jan 22 10:15:19 node2 crmd[311]: notice: peer_update_callback: Our
> peer on the DC is dead
> > Jan 22 10:15:19 node2 crmd[311]: notice: do_state_transition: State
> transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION
> cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
> > Jan 22 10:15:20 node2 kernel: [690741.179442] block drbd0: peer( Primary
> -> Secondary )
> > Jan 22 10:15:20 node2 corosync[32761]: [QUORUM] Members[3]: 1760315215
> 13071578 1017525950
> > Jan 22 10:15:20 node2 corosync[32761]: [MAIN ] Completed service
> synchronization, ready to provide service.
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

Re: One node thinks everyone is online, but the other node doesn't think so [ In reply to ]

andrew at beekhof

Mar 29, 2015, 5:33 PM

Post #4 of 4 (1957 views)

Permalink

> On 11 Mar 2015, at 2:21 am, Dmitry Koterov <dmitry.koterov@gmail.com> wrote:
>
> On Tue, Feb 24, 2015 at 2:07 AM, Andrew Beekhof <andrew@beekhof.net> wrote:
>
> > I have a 3-node cluster where node1 and node2 are running corosync+pacemaker and node3 is running corosync only (for quorum). Corosync 2.3.3, pacemaker 1.1.10. Everything worked fine the first couple of days.
> >
> > Once upon a time I discovered the following situation: node2 thinks that both node1 and node2 are online, but node1 thinks that node2 is down. Could you please say: how could it be? There are no connectivity problems between the nodes at the moment (maybe they were, but why the system hasn't recovered?).
>
> The logs show connectivity problems occurring, so no doubt there.
> As to why it hasn't recovered, first check corosync - if it does not have a consistent view of the world pacemaker has no hope.
> Alternatively, I recall there was a bug that could be preventing this in your version. So if corosync looks fine, perhaps try an upgrade.
>
> Thanks.
> Are you talking about this bug: https://bugs.launchpad.net/ubuntu/+source/libqb/+bug/1341496 ?
>
> I believe I reproduced the problem one more time (it's very unstable), the symptoms were the following:
>
> 1. Once upon a time node2 became down.
> 2. The last message from corosync at node1 was "Quorum lost" (I suspect there was a temporary misconnection with node3).
> 3. Then, in a couple of days, at node3 "service corosync stop" hanged (only killall -9 helps). I tried to run strace during the service is stopping, it shows:
>
> [pid 19449] futex(0x7f580b4c62e0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
> [pid 19448] --- SIGQUIT {si_signo=SIGQUIT, si_code=SI_USER, si_pid=28183, si_uid=0} ---
> [pid 19448] write(6, "\3\0\0\0", 4) = 4
> [pid 19448] rt_sigreturn() = 360
> ... <and repeats for 19448 again and again>
>
> where pstree shows:
>
> init,1
> â”œâ”€corosync,19448
> â”‚ â””â”€{corosync},19449
>
> 4. As well as at node1: "service corosync stop" hangs at node1 too with same symptoms, only killall -9 helps.
> 5. Restarting corosync & pacemaker at node1 and node2 solved the problem.
>
> Could you please say is it related to the above bug in libqb?

I'm no expert in libqb, but it certainly looks likely.

>
>
>
> > The "crm status" is below. What other logs should I attach for the diagnostics?
> >
> > Also, "service corosync stop" on node1 hangs forever with no additional lines in logs, so I cannot even stop the service. (But after "service corosync stop" on node1 the node node2 starts thinking that node1 is offline, although the command still hangs.)
> >
> >
> > root@node2:~# crm status
> > Current DC: node1 (1760315215) - partition with quorum
> > 2 Nodes configured
> > 6 Resources configured
> > Online: [ node1 node2 ]
> > Master/Slave Set: ms_drbd [drbd]
> > Masters: [ node2 ]
> > Slaves: [ node1 ]
> > Resource Group: server
> > fs (ocf::heartbeat:Filesystem): Started node2
> > postgresql (lsb:postgresql): Started node2
> > bind9 (lsb:bind9): Started node2
> > nginx (lsb:nginx): Started node2
> >
> >
> > root@node1:/var/log# crm status
> > Current DC: node1 (1760315215) - partition with quorum
> > 2 Nodes configured
> > 6 Resources configured
> > Online: [ node1 ]
> > OFFLINE: [ node2 ]
> > Master/Slave Set: ms_drbd [drbd]
> > Masters: [ node1 ]
> > Stopped: [ node2 ]
> > Resource Group: server
> > fs (ocf::heartbeat:Filesystem): Started node1
> > postgresql (lsb:postgresql): Started node1
> > bind9 (lsb:bind9): Started node1
> > nginx (lsb:nginx): Started node1
> > Failed actions:
> > drbd_promote_0 (node=node1, call=634, rc=1, status=Timed Out, last-rc-change=Thu Jan 22 10:30:08 2015, queued=20004ms, exec=0ms): unknown error
> >
> >
> > A part of "crm configure show":
> >
> > property $id="cib-bootstrap-options" \
> > dc-version="1.1.10-42f2063" \
> > cluster-infrastructure="corosync" \
> > stonith-enabled="false" \
> > last-lrm-refresh="1421250983"
> > rsc_defaults $id="rsc-options" \
> > resource-stickiness="100"
> >
> >
> > Also I see in logs on node1 (maybe they're related to the issue, maybe not):
> >
> > Jan 22 10:14:02 node1 pengine[2772]: warning: pe_fence_node: Node node2 is unclean because it is partially and/or un-expectedly down
> > Jan 22 10:14:02 node1 pengine[2772]: warning: determine_online_status: Node node2 is unclean
> > Jan 22 10:14:02 node1 pengine[2772]: warning: stage6: Node node2 is unclean!
> > Jan 22 10:14:02 node1 pengine[2772]: warning: stage6: YOUR RESOURCES ARE NOW LIKELY COMPROMISED
> > Jan 22 10:14:02 node1 pengine[2772]: error: stage6: ENABLE STONITH TO KEEP YOUR RESOURCES SAFE
> >
> >
> > On node2 the logs are:
> >
> > Jan 22 10:13:57 node2 corosync[32761]: [TOTEM ] A new membership (188.166.54.190:6276) was formed. Members left: 1760315215 13071578
> > Jan 22 10:13:57 node2 crmd[311]: notice: peer_update_callback: Our peer on the DC is dead
> > Jan 22 10:13:57 node2 crmd[311]: notice: do_state_transition: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
> > Jan 22 10:13:57 node2 corosync[32761]: [QUORUM] This node is within the non-primary component and will NOT provide any services.
> > Jan 22 10:13:57 node2 corosync[32761]: [QUORUM] Members[1]: 1017525950
> > Jan 22 10:13:57 node2 crmd[311]: notice: pcmk_quorum_notification: Membership 6276: quorum lost (1)
> > Jan 22 10:13:57 node2 crmd[311]: notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[13071578] - state is now lost (was member)
> > Jan 22 10:13:57 node2 crmd[311]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[1760315215] - state is now lost (was member)
> > Jan 22 10:13:57 node2 pacemakerd[302]: notice: pcmk_quorum_notification: Membership 6276: quorum lost (1)
> > Jan 22 10:13:57 node2 pacemakerd[302]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[1760315215] - state is now lost (was member)
> > Jan 22 10:13:57 node2 pacemakerd[302]: notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[13071578] - state is now lost (was member)
> > Jan 22 10:13:57 node2 corosync[32761]: [MAIN ] Completed service synchronization, ready to provide service.
> > Jan 22 10:14:01 node2 corosync[32761]: [TOTEM ] A new membership (104.236.71.79:6288) was formed. Members joined: 1760315215 13071578
> > Jan 22 10:14:02 node2 crmd[311]: error: pcmk_cpg_membership: Node node1[1760315215] appears to be online even though we think it is dead
> > Jan 22 10:14:02 node2 crmd[311]: notice: crm_update_peer_state: pcmk_cpg_membership: Node node1[1760315215] - state is now member (was lost)
> > Jan 22 10:14:03 node2 corosync[32761]: [QUORUM] This node is within the primary component and will provide service.
> > Jan 22 10:14:03 node2 corosync[32761]: [QUORUM] Members[3]: 1760315215 13071578 1017525950
> > Jan 22 10:14:03 node2 crmd[311]: notice: pcmk_quorum_notification: Membership 6288: quorum acquired (3)
> > Jan 22 10:14:03 node2 pacemakerd[302]: notice: pcmk_quorum_notification: Membership 6288: quorum acquired (3)
> > Jan 22 10:14:03 node2 pacemakerd[302]: notice: crm_update_peer_state: pcmk_quorum_notification: Node node1[1760315215] - state is now member (was lost)
> > Jan 22 10:14:03 node2 corosync[32761]: [MAIN ] Completed service synchronization, ready to provide service.
> > Jan 22 10:14:03 node2 crmd[311]: notice: corosync_node_name: Unable to get node name for nodeid 13071578
> > Jan 22 10:14:03 node2 crmd[311]: notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[13071578] - state is now member (was lost)
> > Jan 22 10:14:03 node2 pacemakerd[302]: notice: corosync_node_name: Unable to get node name for nodeid 13071578
> > Jan 22 10:14:03 node2 pacemakerd[302]: notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[13071578] - state is now member (was lost)
> > Jan 22 10:14:03 node2 crmd[311]: warning: do_log: FSA: Input I_JOIN_OFFER from route_message() received in state S_ELECTION
> > Jan 22 10:14:04 node2 crmd[311]: notice: do_state_transition: State transition S_ELECTION -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ]
> > Jan 22 10:14:05 node2 attrd[310]: notice: attrd_local_callback: Sending full refresh (origin=crmd)
> > Jan 22 10:14:05 node2 attrd[310]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd (10000)
> > Jan 22 10:14:05 node2 attrd[310]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
> > Jan 22 10:14:05 node2 crmd[311]: notice: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
> > Jan 22 10:15:11 node2 corosync[32761]: [TOTEM ] A new membership (104.236.71.79:6296) was formed. Members left: 13071578
> > Jan 22 10:15:14 node2 corosync[32761]: [TOTEM ] A new membership (128.199.116.218:6312) was formed. Members joined: 13071578 left: 1760315215
> > Jan 22 10:15:17 node2 corosync[32761]: [TOTEM ] A new membership (104.236.71.79:6324) was formed. Members joined: 1760315215
> > Jan 22 10:15:19 node2 crmd[311]: notice: peer_update_callback: Our peer on the DC is dead
> > Jan 22 10:15:19 node2 crmd[311]: notice: do_state_transition: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
> > Jan 22 10:15:20 node2 kernel: [690741.179442] block drbd0: peer( Primary -> Secondary )
> > Jan 22 10:15:20 node2 corosync[32761]: [QUORUM] Members[3]: 1760315215 13071578 1017525950
> > Jan 22 10:15:20 node2 corosync[32761]: [MAIN ] Completed service synchronization, ready to provide service.
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org