Mailing List Archive: Sometimes "not in our membership" failure in 2-node cluster

Sometimes "not in our membership" failure in 2-node cluster

Nov 14, 2011, 3:43 AM

Post #1 of 4 (2170 views)

Hi,

I have some problems with a 2-node cluster running Heartbeat 3.0.3 on
Debian 6.0.3. I set up 2 nodes in VMwares. To test failover behaviour I
disable eth0 and eth1. The NICs are configigured in heartbeat. Eth0 with
multicast and eth1 with broadcast. In most cases when I reactivate the
NICs crm_mon shows me both nodes running after a few seconds. But
sometimes not. Then the ha-log shows following errors:

Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: info: Link
debian60-clnode1:eth1 up.
Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: CRIT: Cluster node
debian60-clnode1 returning after partition.
Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: info: For
information on cluster partitions, See URL:
http://linux-ha.org/wiki/Split_Brain
Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: WARN: Deadtime
value may be too small.
Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: info: See FAQ for
information on tuning deadtime.
Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: info: URL:
http://linux-ha.org/wiki/FAQ#Heavy_Load
Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: WARN: Late
heartbeat: Node debian60-clnode1: interval 234500 ms
Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: info: Status update
for node debian60-clnode1: status active
Nov 14 12:05:21 debian60-clnode2 crmd: [32584]: notice:
crmd_ha_status_callback: Status update: Node debian60-clnode1 now has
status [active] (DC=true)
Nov 14 12:05:21 debian60-clnode2 crmd: [32584]: info:
crm_update_peer_proc: debian60-clnode1.ais is now online
Nov 14 12:05:21 debian60-clnode2 cib: [32580]: WARN: cib_peer_callback:
Discarding cib_apply_diff message (1518) from debian60-clnode1: not in
our membership
Nov 14 12:05:22 debian60-clnode2 ccm: [32579]: info: Break tie for 2
nodes cluster
Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: mem_handle_event:
Got an event OC_EV_MS_INVALID from ccm
Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: mem_handle_event:
no mbr_track info
Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: mem_handle_event:
Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm
Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: mem_handle_event:
instance=514, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info:
crmd_ccm_msg_callback: Quorum (re)attained after event=NEW MEMBERSHIP
(id=514)
Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: ccm_event_detail:
NEW MEMBERSHIP: trans=514, nodes=1, new=0, lost=0 n_idx=0, new_idx=1,
old_idx=3
Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info:
ccm_event_detail: CURRENT: debian60-clnode2 [nodeid=1, born=514]
Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info:
populate_cib_nodes_ha: Requesting the list of configured nodes
Nov 14 12:05:22 debian60-clnode2 cib: [32580]: info: mem_handle_event:
Got an event OC_EV_MS_INVALID from ccm
Nov 14 12:05:22 debian60-clnode2 cib: [32580]: info: mem_handle_event:
no mbr_track info
Nov 14 12:05:22 debian60-clnode2 cib: [32580]: info: mem_handle_event:
Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm
Nov 14 12:05:22 debian60-clnode2 cib: [32580]: info: mem_handle_event:
instance=514, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
Nov 14 12:05:22 debian60-clnode2 cib: [32580]: info:
cib_ccm_msg_callback: Processing CCM event=NEW MEMBERSHIP (id=514)
Nov 14 12:05:22 debian60-clnode2 ccm: [32579]: WARN: ccm_state_joined:
dropping message of type CCM_TYPE_PROTOVERSION_RESP. Is this a
Byzantine failure?
Nov 14 12:05:22 debian60-clnode2 cib: [32580]: info:
cib_process_request: Operation complete: op cib_modify for section nodes
(origin=local/crmd/1497, version=0.1131.11): ok (rc=0)

The only way to get the cluster back to a consistent status is to
restart heartbeat on one node.

I tested different downtimes and it seems that the behavior depends not
on the downtime.

Is this behavior of heartbeat correct?

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: Sometimes "not in our membership" failure in 2-node cluster [ In reply to ]

lars.ellenberg at linbit

Nov 14, 2011, 9:09 AM

Post #2 of 4 (2187 views)

Permalink

On Mon, Nov 14, 2011 at 12:43:03PM +0100, Marcus wrote:
> Hi,
>
> I have some problems with a 2-node cluster running Heartbeat 3.0.3 on

please reproduce with most recent heartbeat,
http://hg.linux-ha.org/heartbeat-STABLE_3_0/
(http:// ... /archive/tip.tar.bz2)

There have been changes in the cleanup path for dead nodes.

> Debian 6.0.3. I set up 2 nodes in VMwares.

You did clone one into the other,
or did a clean, independend install of both?

Pacemaker version?
Glue version?

> To test failover behaviour I
> disable eth0 and eth1.

How, exactly? Is this "symmetrical" or "asymmetrical",
i.e. does that block only incoming/only outgoing/both?
Have both nodes seen the respective other as dead?

Does that block both links at the same time, or one after the other?

> The NICs are configigured in heartbeat. Eth0 with
> multicast and eth1 with broadcast. In most cases when I reactivate the
> NICs crm_mon shows me both nodes running after a few seconds. But
> sometimes not. Then the ha-log shows following errors:
>
> Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: info: Link debian60-clnode1:eth1 up.
> Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: CRIT: Cluster node debian60-clnode1 returning after partition.
> Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: info: For information on cluster partitions, See URL: http://linux-ha.org/wiki/Split_Brain
> Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: WARN: Deadtime value may be too small.
> Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: info: See FAQ for information on tuning deadtime.
> Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: info: URL: http://linux-ha.org/wiki/FAQ#Heavy_Load
> Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: WARN: Late heartbeat: Node debian60-clnode1: interval 234500 ms
> Nov 14 12:05:21 debian60-clnode2 heartbeat: [32565]: info: Status update for node debian60-clnode1: status active
> Nov 14 12:05:21 debian60-clnode2 crmd: [32584]: notice: crmd_ha_status_callback: Status update: Node debian60-clnode1 now has status [active] (DC=true)
> Nov 14 12:05:21 debian60-clnode2 crmd: [32584]: info: crm_update_peer_proc: debian60-clnode1.ais is now online
> Nov 14 12:05:21 debian60-clnode2 cib: [32580]: WARN: cib_peer_callback: Discarding cib_apply_diff message (1518) from debian60-clnode1: not in our membership
> Nov 14 12:05:22 debian60-clnode2 ccm: [32579]: info: Break tie for 2 nodes cluster
> Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm
> Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: mem_handle_event: no mbr_track info
> Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm
> Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: mem_handle_event: instance=514, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: crmd_ccm_msg_callback: Quorum (re)attained after event=NEW MEMBERSHIP (id=514)
> Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: ccm_event_detail: NEW MEMBERSHIP: trans=514, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3
> Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: ccm_event_detail: CURRENT: debian60-clnode2 [nodeid=1, born=514]

Membership only sees _one_ node here, still.
nodes=1; this somehow looks like stale data.

What does /usr/lib/heartbeat/ccm_test_client say
during such an experiment? both nodes?

(or lib64, may be in the -dev package)
btw, as long as it can talk to ccm,
that does not terminate by itself,
you need to ctrl-c it...

> Nov 14 12:05:22 debian60-clnode2 crmd: [32584]: info: populate_cib_nodes_ha: Requesting the list of configured nodes
> Nov 14 12:05:22 debian60-clnode2 cib: [32580]: info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm
> Nov 14 12:05:22 debian60-clnode2 cib: [32580]: info: mem_handle_event: no mbr_track info
> Nov 14 12:05:22 debian60-clnode2 cib: [32580]: info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm
> Nov 14 12:05:22 debian60-clnode2 cib: [32580]: info: mem_handle_event: instance=514, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> Nov 14 12:05:22 debian60-clnode2 cib: [32580]: info: cib_ccm_msg_callback: Processing CCM event=NEW MEMBERSHIP (id=514)
> Nov 14 12:05:22 debian60-clnode2 ccm: [32579]: WARN: ccm_state_joined: dropping message of type CCM_TYPE_PROTOVERSION_RESP. Is this a Byzantine failure?
> Nov 14 12:05:22 debian60-clnode2 cib: [32580]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/1497, version=0.1131.11): ok (rc=0)
>
> The only way to get the cluster back to a consistent status is to
> restart heartbeat on one node.
>
> I tested different downtimes and it seems that the behavior depends not
> on the downtime.
>
> Is this behavior of heartbeat correct?

I don't think so.
Anything about lost packets?

If you can "easily" reproduce (with most recent heartbeat),
a tcpdump may be useful from, say, 10 heartbeats before you disable,
to half a minute minute after you re-enable the network link.
(or just crank up debuggin so high that you even see message dumps in the logs...)

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: Sometimes "not in our membership" failure in 2-node cluster [ In reply to ]

marcus at undecided-aim

Nov 17, 2011, 4:09 AM

Post #3 of 4 (2176 views)

Permalink

Hi, sorry for the delay.

I tested now with tip.tar.bz2
> please reproduce with most recent heartbeat,
> http://hg.linux-ha.org/heartbeat-STABLE_3_0/
> (http:// ... /archive/tip.tar.bz2)
>
> There have been changes in the cleanup path for dead nodes.
>
>
> You did clone one into the other,
> or did a clean, independend install of both?
No clones. Both machines are set up with the installer.
> Pacemaker version?
> Glue version?
Pacemaker is 1.0.9.1+hg15626-1 (from Debian 6.0.3)
Glue is 1.0.6 (from Debian 6.0.3)
running on Kernel 2.6.32

> How, exactly? Is this "symmetrical" or "asymmetrical", i.e. does that
> block only incoming/only outgoing/both? Have both nodes seen the
> respective other as dead?

> Does that block both links at the same time, or one after the other?

I now tested both. Symetrical and asymetrical, delay 2sec. One time the
error occured after disabling/enabling asymetrical. But not every time
the error occurs.
Both nodes show the other offline.
> Membership only sees _one_ node here, still.
> nodes=1; this somehow looks like stale data.
>
> What does /usr/lib/heartbeat/ccm_test_client say
> during such an experiment? both nodes?
It shows:

./ccm_testclient[27359]: 2011/11/17_12:53:50 info: NODES IN THE PRIMARY
MEMBERSHIP
./ccm_testclient[27359]: 2011/11/17_12:53:50 info: nodeid=0,
uname=debian60-clnode1, born=514
./ccm_testclient[27359]: 2011/11/17_12:53:50 info: MY NODE IS A MEMBER
OF THE MEMBERSHIP LIST
./ccm_testclient[27359]: 2011/11/17_12:53:50 info: NEW MEMBERS
./ccm_testclient[27359]: 2011/11/17_12:53:50 info: NONE
./ccm_testclient[27359]: 2011/11/17_12:53:50 info: MEMBERS LOST
./ccm_testclient[27359]: 2011/11/17_12:53:50 info: NONE
./ccm_testclient[27359]: 2011/11/17_12:53:50 info: -----------------------
./ccm_testclient[27359]: 2011/11/17_12:53:51 info: mem_handle_event: Got
an event OC_EV_MS_INVALID from ccm
./ccm_testclient[27359]: 2011/11/17_12:53:51 info: mem_handle_event: no
mbr_track info
./ccm_testclient[27359]: 2011/11/17_12:53:51 info: mem_handle_event: Got
an event OC_EV_MS_NEW_MEMBERSHIP from ccm
./ccm_testclient[27359]: 2011/11/17_12:53:51 info: mem_handle_event:
instance=515, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
./ccm_testclient[27359]: 2011/11/17_12:53:51 info: event=NEW MEMBERSHIP:
./ccm_testclient[27359]: 2011/11/17_12:53:51 info: instance=515
# ttl members=1, ttl_idx=0
# new members=0, new_idx=1
# out members=0, out_idx=3
./ccm_testclient[27359]: 2011/11/17_12:53:51 info: NODES IN THE PRIMARY
MEMBERSHIP
./ccm_testclient[27359]: 2011/11/17_12:53:51 info: nodeid=0,
uname=debian60-clnode1, born=515
./ccm_testclient[27359]: 2011/11/17_12:53:51 info: MY NODE IS A MEMBER
OF THE MEMBERSHIP LIST
./ccm_testclient[27359]: 2011/11/17_12:53:51 info: NEW MEMBERS
./ccm_testclient[27359]: 2011/11/17_12:53:51 info: NONE
./ccm_testclient[27359]: 2011/11/17_12:53:51 info: MEMBERS LOST
./ccm_testclient[27359]: 2011/11/17_12:53:51 info: NONE
./ccm_testclient[27359]: 2011/11/17_12:53:51 info: -----------------------
./ccm_testclient[27359]: 2011/11/17_12:53:52 info: mem_handle_event: Got
an event OC_EV_MS_INVALID from ccm
./ccm_testclient[27359]: 2011/11/17_12:53:52 info: mem_handle_event: no
mbr_track info
./ccm_testclient[27359]: 2011/11/17_12:53:52 info: mem_handle_event: Got
an event OC_EV_MS_NEW_MEMBERSHIP from ccm
./ccm_testclient[27359]: 2011/11/17_12:53:52 info: mem_handle_event:
instance=516, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
./ccm_testclient[27359]: 2011/11/17_12:53:52 info: event=NEW MEMBERSHIP:
./ccm_testclient[27359]: 2011/11/17_12:53:52 info: instance=516
# ttl members=1, ttl_idx=0
# new members=0, new_idx=1
# out members=0, out_idx=3
./ccm_testclient[27359]: 2011/11/17_12:53:52 info: NODES IN THE PRIMARY
MEMBERSHIP
./ccm_testclient[27359]: 2011/11/17_12:53:52 info: nodeid=0,
uname=debian60-clnode1, born=516
./ccm_testclient[27359]: 2011/11/17_12:53:52 info: MY NODE IS A MEMBER
OF THE MEMBERSHIP LIST
./ccm_testclient[27359]: 2011/11/17_12:53:52 info: NEW MEMBERS
./ccm_testclient[27359]: 2011/11/17_12:53:52 info: NONE
./ccm_testclient[27359]: 2011/11/17_12:53:52 info: MEMBERS LOST
./ccm_testclient[27359]: 2011/11/17_12:53:52 info: NONE
./ccm_testclient[27359]: 2011/11/17_12:53:52 info: -----------------------
./ccm_testclient[27359]: 2011/11/17_12:53:53 info: mem_handle_event: Got
an event OC_EV_MS_INVALID from ccm
./ccm_testclient[27359]: 2011/11/17_12:53:53 info: mem_handle_event: no
mbr_track info
./ccm_testclient[27359]: 2011/11/17_12:53:53 info: mem_handle_event: Got
an event OC_EV_MS_NEW_MEMBERSHIP from ccm
./ccm_testclient[27359]: 2011/11/17_12:53:53 info: mem_handle_event:
instance=517, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
./ccm_testclient[27359]: 2011/11/17_12:53:53 info: event=NEW MEMBERSHIP:
./ccm_testclient[27359]: 2011/11/17_12:53:53 info: instance=517
# ttl members=1, ttl_idx=0
# new members=0, new_idx=1
# out members=0, out_idx=3

.
.
.

> (or lib64, may be in the -dev package)
> btw, as long as it can talk to ccm,
> that does not terminate by itself,
> you need to ctrl-c it...
>
> I don't think so.
> Anything about lost packets?
>

> If you can "easily" reproduce (with most recent heartbeat),
> a tcpdump may be useful from, say, 10 heartbeats before you disable,
> to half a minute minute after you re-enable the network link.
> (or just crank up debuggin so high that you even see message dumps in the logs...)
>
>
I have to sniff.

In this case when the nodes show each other offline, I disconnected and
reconnected the interfaces again. After a few seconds they show each
other online.
After that ccm_testclient shows:

./ccm_testclient[27417]: 2011/11/17_12:56:48 info: mem_handle_event: Got
an event OC_EV_MS_NEW_MEMBERSHIP from ccm
./ccm_testclient[27417]: 2011/11/17_12:56:48 info: mem_handle_event:
instance=574, nodes=2, new=2, lost=0, n_idx=0, new_idx=0, old_idx=4
./ccm_testclient[27417]: 2011/11/17_12:56:48 info: event=NEW MEMBERSHIP:
./ccm_testclient[27417]: 2011/11/17_12:56:48 info: instance=574
# ttl members=2, ttl_idx=0
# new members=2, new_idx=0
# out members=0, out_idx=4
./ccm_testclient[27417]: 2011/11/17_12:56:48 info: NODES IN THE PRIMARY
MEMBERSHIP
./ccm_testclient[27417]: 2011/11/17_12:56:48 info: nodeid=1,
uname=debian60-clnode2, born=1
./ccm_testclient[27417]: 2011/11/17_12:56:48 info: nodeid=0,
uname=debian60-clnode1, born=574
./ccm_testclient[27417]: 2011/11/17_12:56:48 info: MY NODE IS A MEMBER
OF THE MEMBERSHIP LIST
./ccm_testclient[27417]: 2011/11/17_12:56:48 info: NEW MEMBERS
./ccm_testclient[27417]: 2011/11/17_12:56:48 info: nodeid=1,
uname=debian60-clnode2, born=1
./ccm_testclient[27417]: 2011/11/17_12:56:48 info: nodeid=0,
uname=debian60-clnode1, born=574
./ccm_testclient[27417]: 2011/11/17_12:56:48 info: MEMBERS LOST
./ccm_testclient[27417]: 2011/11/17_12:56:48 info: NONE
./ccm_testclient[27417]: 2011/11/17_12:56:48 info: -----------------------

I also checked times. They are in sync.

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: Sometimes "not in our membership" failure in 2-node cluster [ In reply to ]

lars.ellenberg at linbit

Nov 17, 2011, 11:49 AM

Post #4 of 4 (2150 views)

Permalink

On Thu, Nov 17, 2011 at 01:09:19PM +0100, Marcus wrote:
> Hi, sorry for the delay.
>
> I tested now with tip.tar.bz2
> > please reproduce with most recent heartbeat,
> > http://hg.linux-ha.org/heartbeat-STABLE_3_0/
> > (http:// ... /archive/tip.tar.bz2)
> >
> > There have been changes in the cleanup path for dead nodes.

Did you already upgrade to this?

> > You did clone one into the other,
> > or did a clean, independend install of both?
> No clones. Both machines are set up with the installer.
> > Pacemaker version?
> > Glue version?
> Pacemaker is 1.0.9.1+hg15626-1 (from Debian 6.0.3)
> Glue is 1.0.6 (from Debian 6.0.3)
> running on Kernel 2.6.32
>
> > How, exactly? Is this "symmetrical" or "asymmetrical", i.e. does that
> > block only incoming/only outgoing/both? Have both nodes seen the
> > respective other as dead?
>
> > Does that block both links at the same time, or one after the other?
>
> I now tested both. Symetrical and asymetrical, delay 2sec. One time the
> error occured after disabling/enabling asymetrical.

Again *HOW* do you disable/enable.

Is there any ifconfig down involved,
or any networkmanager seeing the interface(s) as down or similar.
or do you break the connectivity without touching the interfaces?
(iptables -j DROP, "unplug" the network cables, disable the
"switchport", or similar; what do you do?)

If the interfaces actually "vanish" and then "reappear", it will take
heartbeat a "long" time to be able to send again,
while potentially receiving may already be working again.

> But not every time the error occurs.

Well, how frequently.
Like 1 out of 3, 1 out of 10, 1 out of 1000 iterations do you need to reproduce?

What's your ha.cf?

> Both nodes show the other offline.

I asked about the ccm "dead" status, not the pacemaker "offline" status.
The logs should show
heartbeat: [3971]: WARN: node XYZ: is dead
heartbeat: [3971]: info: Link XYZ:ethx dead.
heartbeat: [3971]: info: Link XYZ:ethy dead.
...

That's also what I asked for the ccm_testclient log, see below.

> > Membership only sees _one_ node here, still.
> > nodes=1; this somehow looks like stale data.
> >
> > What does /usr/lib/heartbeat/ccm_test_client say
> > during such an experiment? both nodes?
> It shows:

I meant that you should start the ccm_testclient
while all is good and connected,
then do your break links, sleep, unbreak links.

So I would expect the membership to start with

> ./ccm_testclient[27359]: 2011/11/17_12:53:50 info: NODES IN THE PRIMARY
> MEMBERSHIP
> ./ccm_testclient[27359]: 2011/11/17_12:53:50 info: nodeid=0,
> uname=debian60-clnode1, born=514

Both nodes here, then

> ./ccm_testclient[27359]: 2011/11/17_12:53:50 info: MY NODE IS A MEMBER
> OF THE MEMBERSHIP LIST
> ./ccm_testclient[27359]: 2011/11/17_12:53:50 info: NEW MEMBERS
> ./ccm_testclient[27359]: 2011/11/17_12:53:50 info: NONE
> ./ccm_testclient[27359]: 2011/11/17_12:53:50 info: MEMBERS LOST
> ./ccm_testclient[27359]: 2011/11/17_12:53:50 info: NONE

on both nodes lose the respective other, ...

> ./ccm_testclient[27359]: 2011/11/17_12:53:50 info: -----------------------
> ./ccm_testclient[27359]: 2011/11/17_12:53:51 info: mem_handle_event: Got
> an event OC_EV_MS_INVALID from ccm
> ./ccm_testclient[27359]: 2011/11/17_12:53:51 info: mem_handle_event: no
> mbr_track info
> ./ccm_testclient[27359]: 2011/11/17_12:53:51 info: mem_handle_event: Got
> an event OC_EV_MS_NEW_MEMBERSHIP from ccm
> ./ccm_testclient[27359]: 2011/11/17_12:53:51 info: mem_handle_event:
> instance=515, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> ./ccm_testclient[27359]: 2011/11/17_12:53:51 info: event=NEW MEMBERSHIP:
> ./ccm_testclient[27359]: 2011/11/17_12:53:51 info: instance=515
> # ttl members=1, ttl_idx=0
> # new members=0, new_idx=1
> # out members=0, out_idx=3
> ./ccm_testclient[27359]: 2011/11/17_12:53:51 info: NODES IN THE PRIMARY
> MEMBERSHIP
> ./ccm_testclient[27359]: 2011/11/17_12:53:51 info: nodeid=0,
> uname=debian60-clnode1, born=515
> ./ccm_testclient[27359]: 2011/11/17_12:53:51 info: MY NODE IS A MEMBER
> OF THE MEMBERSHIP LIST
> ./ccm_testclient[27359]: 2011/11/17_12:53:51 info: NEW MEMBERS
> ./ccm_testclient[27359]: 2011/11/17_12:53:51 info: NONE

... then both nodes see the respective other as new member again.

> ./ccm_testclient[27359]: 2011/11/17_12:53:51 info: MEMBERS LOST
> ./ccm_testclient[27359]: 2011/11/17_12:53:51 info: NONE

Which is not what I can see here.

You apparently did only start it after you already broke and
unbroke the links, so this does not help at all.

> ./ccm_testclient[27359]: 2011/11/17_12:53:52 info: mem_handle_event: Got
> an event OC_EV_MS_INVALID from ccm
> ./ccm_testclient[27359]: 2011/11/17_12:53:52 info: mem_handle_event: no
> mbr_track info
> ./ccm_testclient[27359]: 2011/11/17_12:53:52 info: mem_handle_event: Got
> an event OC_EV_MS_NEW_MEMBERSHIP from ccm
> ./ccm_testclient[27359]: 2011/11/17_12:53:52 info: mem_handle_event:
> instance=516, nodes=1, new=0, lost=0, n_idx=0, new_idx=1, old_idx=3
> ./ccm_testclient[27359]: 2011/11/17_12:53:52 info: event=NEW MEMBERSHIP:

Still it is unusual for heartbeat to send out "NEW MEMBERSHIP" events,
if nothing actually changed: no new members, no members lost...

> ./ccm_testclient[27359]: 2011/11/17_12:53:52 info: instance=516
> # ttl members=1, ttl_idx=0
> # new members=0, new_idx=1
> # out members=0, out_idx=3
> ./ccm_testclient[27359]: 2011/11/17_12:53:52 info: NODES IN THE PRIMARY
> MEMBERSHIP
> ./ccm_testclient[27359]: 2011/11/17_12:53:52 info: nodeid=0,
> uname=debian60-clnode1, born=516
> ./ccm_testclient[27359]: 2011/11/17_12:53:52 info: MY NODE IS A MEMBER
> OF THE MEMBERSHIP LIST
> ./ccm_testclient[27359]: 2011/11/17_12:53:52 info: NEW MEMBERS
> ./ccm_testclient[27359]: 2011/11/17_12:53:52 info: NONE
> ./ccm_testclient[27359]: 2011/11/17_12:53:52 info: MEMBERS LOST
> ./ccm_testclient[27359]: 2011/11/17_12:53:52 info: NONE
> ./ccm_testclient[27359]: 2011/11/17_12:53:52 info: -----------------------

> > (or lib64, may be in the -dev package)
> > btw, as long as it can talk to ccm,
> > that does not terminate by itself,
> > you need to ctrl-c it...
> >
> > I don't think so.
> > Anything about lost packets?

I rephrase: any logs similar to "WARN: [0-9]* lost packet(s) for ..."

> > If you can "easily" reproduce (with most recent heartbeat),
> > a tcpdump may be useful from, say, 10 heartbeats before you disable,
> > to half a minute minute after you re-enable the network link.
> > (or just crank up debuggin so high that you even see message dumps in the logs...)
> >
> >
> I have to sniff.

As I said: with a high enough debug level, you get insane amounts of
logging, including heartbeat packet traces. But that may be too much
logging, so it may be better to tcpdump/tshark/wireshark it, yes.

In any case, at least use debug level 1.

> In this case when the nodes show each other offline, I disconnected and
> reconnected the interfaces again. After a few seconds they show each
> other online.
> After that ccm_testclient shows:
>
> ./ccm_testclient[27417]: 2011/11/17_12:56:48 info: mem_handle_event: Got
> an event OC_EV_MS_NEW_MEMBERSHIP from ccm
> ./ccm_testclient[27417]: 2011/11/17_12:56:48 info: mem_handle_event:
> instance=574, nodes=2, new=2, lost=0, n_idx=0, new_idx=0, old_idx=4
> ./ccm_testclient[27417]: 2011/11/17_12:56:48 info: event=NEW MEMBERSHIP:
> ./ccm_testclient[27417]: 2011/11/17_12:56:48 info: instance=574
> # ttl members=2, ttl_idx=0
> # new members=2, new_idx=0
> # out members=0, out_idx=4
> ./ccm_testclient[27417]: 2011/11/17_12:56:48 info: NODES IN THE PRIMARY
> MEMBERSHIP
> ./ccm_testclient[27417]: 2011/11/17_12:56:48 info: nodeid=1,
> uname=debian60-clnode2, born=1
> ./ccm_testclient[27417]: 2011/11/17_12:56:48 info: nodeid=0,
> uname=debian60-clnode1, born=574
> ./ccm_testclient[27417]: 2011/11/17_12:56:48 info: MY NODE IS A MEMBER
> OF THE MEMBERSHIP LIST
> ./ccm_testclient[27417]: 2011/11/17_12:56:48 info: NEW MEMBERS
> ./ccm_testclient[27417]: 2011/11/17_12:56:48 info: nodeid=1,
> uname=debian60-clnode2, born=1
> ./ccm_testclient[27417]: 2011/11/17_12:56:48 info: nodeid=0,
> uname=debian60-clnode1, born=574

clnode1 seems to have some serious problem there,
it was much more often "born"...
does it keep respawning the ccm process for some reason?
---> look at the debug logs.

Which means this one even lost _itself_ from the membership?
(it sees two new nodes... it should only see one new node...)
Any log message like "No local heartbeat. Forcing restart."

I suspect your test method is somehow "invalid",
you seem to test something that is not supposed to work...

Do you have "crm respawn" in your ha.cf, by any chance?

If so, does it behave "better" with "crm on"?
(in which case, if the ccm terminates,
it will not be respawned, but cause a reboot).

> ./ccm_testclient[27417]: 2011/11/17_12:56:48 info: MEMBERS LOST
> ./ccm_testclient[27417]: 2011/11/17_12:56:48 info: NONE
> ./ccm_testclient[27417]: 2011/11/17_12:56:48 info: -----------------------
>
>
> I also checked times. They are in sync.

Wall clock time is completely irrelevant for heartbeat,
though ntp synced time certainly makes correlating logs much easier.

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/