Mailing List Archive

unable to recover from split-brain in a two-node cluster
Hi,

New to this list and hope I can get some help here.

I'm using pacemaker 1.0.10 and heartbeat 3.0.5 for a two-node cluster. I'm having split-brain problem when heartbeat messages sometimes get dropped when system is under high load. However the problem is it never recover back when system load became low.

I created a test setup to test this by setting dead time to 6 seconds, and continuously dropping one-way heartbeat packets (udp dst port 694) for 5~8 seconds and resume the traffic for 1~2 seconds using iptables. After the system got into split-brain state, I stop the test and allow all heartbeat traffic to go through. Sometimes the system recovered by sometimes it didn't. There are various symptoms when the system didn't recovered from split-brain:

1. In one instance, cl_status listnodes becomes empty. The syslog keeps showing
2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.warning] [2853]: WARN: Message hist queue is filling up (436 messages in queue)
2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: hist->ackseq =12111
2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: hist->lowseq =12111, hist->hiseq=12547
2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: expecting from node-1
2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: it's ackseq=12111

2. In another instance, cl_status nodestatus <node> shows both nodes are active, but "crm_mon -1" shows that each of the two nodes thinks itself is the DC, and peer node is offline. Pengine process is running on one node only. The node not running pengine (but still thinks itself is DC) has log shows crmd terminated pengine because it detected peer is active. After that, the peer status keeps flapping between dead and active, but pengine has never being started again. The last log shows the peer is active (after I stopped the test and allow all traffic). However "crm_mon -1" shows itself is the DC and peer is offline as:

[root@node-1 ~]# crm_mon -1
============
Last updated: Fri Jun 20 19:12:23 2014
Stack: Heartbeat
Current DC: node-1 (bf053fc5-5afd-b483-2ad2-3c9fc354f7fa) - partition with quorum
Version: 1.0.9-da7075976b5ff0bee71074385f8fd02f296ec8a3
2 Nodes configured, unknown expected votes
1 Resources configured.
============

Online: [ node-1 ]
OFFLINE: [ node-0 ]

cluster (heartbeat:ha): Started node-1


Any help, like pointer to the source code where the problem might be, or any existing bug filed for this (I did some search but didn't find matched symptoms) is appreciated.

Thanks,
-Kaiwei
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Re: unable to recover from split-brain in a two-node cluster [ In reply to ]
On 20/06/14 03:18 PM, fank@vmware.com wrote:
> Hi,
>
> New to this list and hope I can get some help here.
>
> I'm using pacemaker 1.0.10 and heartbeat 3.0.5 for a two-node cluster. I'm having split-brain problem when heartbeat messages sometimes get dropped when system is under high load. However the problem is it never recover back when system load became low.
>
> I created a test setup to test this by setting dead time to 6 seconds, and continuously dropping one-way heartbeat packets (udp dst port 694) for 5~8 seconds and resume the traffic for 1~2 seconds using iptables. After the system got into split-brain state, I stop the test and allow all heartbeat traffic to go through. Sometimes the system recovered by sometimes it didn't. There are various symptoms when the system didn't recovered from split-brain:
>
> 1. In one instance, cl_status listnodes becomes empty. The syslog keeps showing
> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.warning] [2853]: WARN: Message hist queue is filling up (436 messages in queue)
> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: hist->ackseq =12111
> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: hist->lowseq =12111, hist->hiseq=12547
> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: expecting from node-1
> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: it's ackseq=12111
>
> 2. In another instance, cl_status nodestatus <node> shows both nodes are active, but "crm_mon -1" shows that each of the two nodes thinks itself is the DC, and peer node is offline. Pengine process is running on one node only. The node not running pengine (but still thinks itself is DC) has log shows crmd terminated pengine because it detected peer is active. After that, the peer status keeps flapping between dead and active, but pengine has never being started again. The last log shows the peer is active (after I stopped the test and allow all traffic). However "crm_mon -1" shows itself is the DC and peer is offline as:
>
> [root@node-1 ~]# crm_mon -1
> ============
> Last updated: Fri Jun 20 19:12:23 2014
> Stack: Heartbeat
> Current DC: node-1 (bf053fc5-5afd-b483-2ad2-3c9fc354f7fa) - partition with quorum
> Version: 1.0.9-da7075976b5ff0bee71074385f8fd02f296ec8a3
> 2 Nodes configured, unknown expected votes
> 1 Resources configured.
> ============
>
> Online: [ node-1 ]
> OFFLINE: [ node-0 ]
>
> cluster (heartbeat:ha): Started node-1
>
>
> Any help, like pointer to the source code where the problem might be, or any existing bug filed for this (I did some search but didn't find matched symptoms) is appreciated.
>
> Thanks,
> -Kaiwei

Hi Kaiwei,

Is this a new install? If so, that is some very old (and deprecated)
software. If it is an existing install, then you might find it hard to
get an answer here (but by all means, you might). Heartbeat hasn't been
developed in a loooong time, and pacemaker 1.0.x is also very old.
However, Linbit still offers commercial support for heartbeat. So if you
don't get help here, you might want to drop them a line.

Cheers, and best of luck.

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Re: unable to recover from split-brain in a two-node cluster [ In reply to ]
Thanks, Digimer. This is an existing setup so I'm stuck with them. Currently my workaround is to increase the dead time so it won't flap and cause all these issues.

Best,
-Kaiwei

----- Original Message -----
From: "Digimer" <lists@alteeve.ca>
To: "General Linux-HA mailing list" <linux-ha@lists.linux-ha.org>
Sent: Friday, June 20, 2014 4:19:29 PM
Subject: Re: [Linux-HA] unable to recover from split-brain in a two-node cluster

On 20/06/14 03:18 PM, fank@vmware.com wrote:
> Hi,
>
> New to this list and hope I can get some help here.
>
> I'm using pacemaker 1.0.10 and heartbeat 3.0.5 for a two-node cluster. I'm having split-brain problem when heartbeat messages sometimes get dropped when system is under high load. However the problem is it never recover back when system load became low.
>
> I created a test setup to test this by setting dead time to 6 seconds, and continuously dropping one-way heartbeat packets (udp dst port 694) for 5~8 seconds and resume the traffic for 1~2 seconds using iptables. After the system got into split-brain state, I stop the test and allow all heartbeat traffic to go through. Sometimes the system recovered by sometimes it didn't. There are various symptoms when the system didn't recovered from split-brain:
>
> 1. In one instance, cl_status listnodes becomes empty. The syslog keeps showing
> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.warning] [2853]: WARN: Message hist queue is filling up (436 messages in queue)
> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: hist->ackseq =12111
> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: hist->lowseq =12111, hist->hiseq=12547
> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: expecting from node-1
> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: it's ackseq=12111
>
> 2. In another instance, cl_status nodestatus <node> shows both nodes are active, but "crm_mon -1" shows that each of the two nodes thinks itself is the DC, and peer node is offline. Pengine process is running on one node only. The node not running pengine (but still thinks itself is DC) has log shows crmd terminated pengine because it detected peer is active. After that, the peer status keeps flapping between dead and active, but pengine has never being started again. The last log shows the peer is active (after I stopped the test and allow all traffic). However "crm_mon -1" shows itself is the DC and peer is offline as:
>
> [root@node-1 ~]# crm_mon -1
> ============
> Last updated: Fri Jun 20 19:12:23 2014
> Stack: Heartbeat
> Current DC: node-1 (bf053fc5-5afd-b483-2ad2-3c9fc354f7fa) - partition with quorum
> Version: 1.0.9-da7075976b5ff0bee71074385f8fd02f296ec8a3
> 2 Nodes configured, unknown expected votes
> 1 Resources configured.
> ============
>
> Online: [ node-1 ]
> OFFLINE: [ node-0 ]
>
> cluster (heartbeat:ha): Started node-1
>
>
> Any help, like pointer to the source code where the problem might be, or any existing bug filed for this (I did some search but didn't find matched symptoms) is appreciated.
>
> Thanks,
> -Kaiwei

Hi Kaiwei,

Is this a new install? If so, that is some very old (and deprecated)
software. If it is an existing install, then you might find it hard to
get an answer here (but by all means, you might). Heartbeat hasn't been
developed in a loooong time, and pacemaker 1.0.x is also very old.
However, Linbit still offers commercial support for heartbeat. So if you
don't get help here, you might want to drop them a line.

Cheers, and best of luck.

--
Digimer
Papers and Projects: https://urldefense.proofpoint.com/v1/url?u=https://alteeve.ca/w/&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=%2FLrozAm7X%2FZImzg1e%2FD43UhqHH2aYn%2BCkbHuB%2B9vhLw%3D%0A&s=a38d98eb09db3aeadc08bbeb2eef3cfe6d1035281d0c72b6b1829ca318e2a0ec
What if the cure for cancer is trapped in the mind of a person without
access to education?
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
https://urldefense.proofpoint.com/v1/url?u=http://lists.linux-ha.org/mailman/listinfo/linux-ha&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=%2FLrozAm7X%2FZImzg1e%2FD43UhqHH2aYn%2BCkbHuB%2B9vhLw%3D%0A&s=b8efd2bbc5af0a3fee47d7d684973f1cde015d58afb95cb9b8da2aac22deb621
See also: https://urldefense.proofpoint.com/v1/url?u=http://linux-ha.org/ReportingProblems&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=%2FLrozAm7X%2FZImzg1e%2FD43UhqHH2aYn%2BCkbHuB%2B9vhLw%3D%0A&s=836651138e60dfb2c89c66cc63e1b229f2daf7b221cf06683fe184b347c1104d
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Re: unable to recover from split-brain in a two-node cluster [ In reply to ]
Oh, great!

In the meantime, I would _strongly_ recommend you start work on a
migration plan. There is no danger in the foreseeable future as Linbit
has no plans to end support, but the stack itself is quite deprecated.

The stack that all major distros have now settled on corosync +
pacemaker, so this is what I would recommend moving towards. You'll
probably find that the shift isn't very difficult, as pacemaker was born
out of the heartbeat project.

Cheers

On 21/06/14 01:47 AM, fank@vmware.com wrote:
> Thanks, Digimer. This is an existing setup so I'm stuck with them. Currently my workaround is to increase the dead time so it won't flap and cause all these issues.
>
> Best,
> -Kaiwei
>
> ----- Original Message -----
> From: "Digimer" <lists@alteeve.ca>
> To: "General Linux-HA mailing list" <linux-ha@lists.linux-ha.org>
> Sent: Friday, June 20, 2014 4:19:29 PM
> Subject: Re: [Linux-HA] unable to recover from split-brain in a two-node cluster
>
> On 20/06/14 03:18 PM, fank@vmware.com wrote:
>> Hi,
>>
>> New to this list and hope I can get some help here.
>>
>> I'm using pacemaker 1.0.10 and heartbeat 3.0.5 for a two-node cluster. I'm having split-brain problem when heartbeat messages sometimes get dropped when system is under high load. However the problem is it never recover back when system load became low.
>>
>> I created a test setup to test this by setting dead time to 6 seconds, and continuously dropping one-way heartbeat packets (udp dst port 694) for 5~8 seconds and resume the traffic for 1~2 seconds using iptables. After the system got into split-brain state, I stop the test and allow all heartbeat traffic to go through. Sometimes the system recovered by sometimes it didn't. There are various symptoms when the system didn't recovered from split-brain:
>>
>> 1. In one instance, cl_status listnodes becomes empty. The syslog keeps showing
>> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.warning] [2853]: WARN: Message hist queue is filling up (436 messages in queue)
>> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: hist->ackseq =12111
>> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: hist->lowseq =12111, hist->hiseq=12547
>> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: expecting from node-1
>> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: it's ackseq=12111
>>
>> 2. In another instance, cl_status nodestatus <node> shows both nodes are active, but "crm_mon -1" shows that each of the two nodes thinks itself is the DC, and peer node is offline. Pengine process is running on one node only. The node not running pengine (but still thinks itself is DC) has log shows crmd terminated pengine because it detected peer is active. After that, the peer status keeps flapping between dead and active, but pengine has never being started again. The last log shows the peer is active (after I stopped the test and allow all traffic). However "crm_mon -1" shows itself is the DC and peer is offline as:
>>
>> [root@node-1 ~]# crm_mon -1
>> ============
>> Last updated: Fri Jun 20 19:12:23 2014
>> Stack: Heartbeat
>> Current DC: node-1 (bf053fc5-5afd-b483-2ad2-3c9fc354f7fa) - partition with quorum
>> Version: 1.0.9-da7075976b5ff0bee71074385f8fd02f296ec8a3
>> 2 Nodes configured, unknown expected votes
>> 1 Resources configured.
>> ============
>>
>> Online: [ node-1 ]
>> OFFLINE: [ node-0 ]
>>
>> cluster (heartbeat:ha): Started node-1
>>
>>
>> Any help, like pointer to the source code where the problem might be, or any existing bug filed for this (I did some search but didn't find matched symptoms) is appreciated.
>>
>> Thanks,
>> -Kaiwei
>
> Hi Kaiwei,
>
> Is this a new install? If so, that is some very old (and deprecated)
> software. If it is an existing install, then you might find it hard to
> get an answer here (but by all means, you might). Heartbeat hasn't been
> developed in a loooong time, and pacemaker 1.0.x is also very old.
> However, Linbit still offers commercial support for heartbeat. So if you
> don't get help here, you might want to drop them a line.
>
> Cheers, and best of luck.
>


--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Re: unable to recover from split-brain in a two-node cluster [ In reply to ]
On 21 Jun 2014, at 5:18 am, fank@vmware.com wrote:

> Hi,
>
> New to this list and hope I can get some help here.
>
> I'm using pacemaker 1.0.10 and heartbeat 3.0.5 for a two-node cluster. I'm having split-brain problem when heartbeat messages sometimes get dropped when system is under high load. However the problem is it never recover back when system load became low.
>
> I created a test setup to test this by setting dead time to 6 seconds, and continuously dropping one-way heartbeat packets (udp dst port 694) for 5~8 seconds and resume the traffic for 1~2 seconds using iptables. After the system got into split-brain state, I stop the test and allow all heartbeat traffic to go through. Sometimes the system recovered by sometimes it didn't. There are various symptoms when the system didn't recovered from split-brain:
>
> 1. In one instance, cl_status listnodes becomes empty. The syslog keeps showing
> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.warning] [2853]: WARN: Message hist queue is filling up (436 messages in queue)
> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: hist->ackseq =12111
> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: hist->lowseq =12111, hist->hiseq=12547
> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: expecting from node-1
> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: it's ackseq=12111
>
> 2. In another instance, cl_status nodestatus <node> shows both nodes are active, but "crm_mon -1" shows that each of the two nodes thinks itself is the DC, and peer node is offline. Pengine process is running on one node only. The node not running pengine (but still thinks itself is DC) has log shows crmd terminated pengine because it detected peer is active. After that, the peer status keeps flapping between dead and active, but pengine has never being started again. The last log shows the peer is active (after I stopped the test and allow all traffic). However "crm_mon -1" shows itself is the DC and peer is offline as:
>
> [root@node-1 ~]# crm_mon -1
> ============
> Last updated: Fri Jun 20 19:12:23 2014
> Stack: Heartbeat
> Current DC: node-1 (bf053fc5-5afd-b483-2ad2-3c9fc354f7fa) - partition with quorum
> Version: 1.0.9-da7075976b5ff0bee71074385f8fd02f296ec8a3
> 2 Nodes configured, unknown expected votes
> 1 Resources configured.
> ============
>
> Online: [ node-1 ]
> OFFLINE: [ node-0 ]
>
> cluster (heartbeat:ha): Started node-1
>
>
> Any help, like pointer to the source code where the problem might be, or any existing bug filed for this (I did some search but didn't find matched symptoms) is appreciated.

This is happening at the heartbeat level.

Not much pacemaker can do I'm afraid. Perhaps look to see if heartbeat is "real time" scheduled, if not that may explain why its being staved of CPU and can't get its messages out.

>
> Thanks,
> -Kaiwei
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
Re: unable to recover from split-brain in a two-node cluster [ In reply to ]
Hi,

I understand that initially the split-brain is caused by heartbeat messaging layer and there is nothing much can be done when packets are dropped. However, the problem is sometimes when the load is gone (or when iptables allows all traffic in my test setup), it doesn't recover.

In the second case I provided, the heartbeat on both nodes did find each other and both were active, but pacemaker in both nodes still thinks peer is offline. I don't know if this is heartbeat's problem or Pacemaker's problem though.

Thanks,
-Kaiwei

----- Original Message -----
From: "Andrew Beekhof" <andrew@beekhof.net>
To: "General Linux-HA mailing list" <linux-ha@lists.linux-ha.org>
Sent: Sunday, June 22, 2014 3:45:00 PM
Subject: Re: [Linux-HA] unable to recover from split-brain in a two-node cluster


On 21 Jun 2014, at 5:18 am, fank@vmware.com wrote:

> Hi,
>
> New to this list and hope I can get some help here.
>
> I'm using pacemaker 1.0.10 and heartbeat 3.0.5 for a two-node cluster. I'm having split-brain problem when heartbeat messages sometimes get dropped when system is under high load. However the problem is it never recover back when system load became low.
>
> I created a test setup to test this by setting dead time to 6 seconds, and continuously dropping one-way heartbeat packets (udp dst port 694) for 5~8 seconds and resume the traffic for 1~2 seconds using iptables. After the system got into split-brain state, I stop the test and allow all heartbeat traffic to go through. Sometimes the system recovered by sometimes it didn't. There are various symptoms when the system didn't recovered from split-brain:
>
> 1. In one instance, cl_status listnodes becomes empty. The syslog keeps showing
> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.warning] [2853]: WARN: Message hist queue is filling up (436 messages in queue)
> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: hist->ackseq =12111
> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: hist->lowseq =12111, hist->hiseq=12547
> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: expecting from node-1
> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: it's ackseq=12111
>
> 2. In another instance, cl_status nodestatus <node> shows both nodes are active, but "crm_mon -1" shows that each of the two nodes thinks itself is the DC, and peer node is offline. Pengine process is running on one node only. The node not running pengine (but still thinks itself is DC) has log shows crmd terminated pengine because it detected peer is active. After that, the peer status keeps flapping between dead and active, but pengine has never being started again. The last log shows the peer is active (after I stopped the test and allow all traffic). However "crm_mon -1" shows itself is the DC and peer is offline as:
>
> [root@node-1 ~]# crm_mon -1
> ============
> Last updated: Fri Jun 20 19:12:23 2014
> Stack: Heartbeat
> Current DC: node-1 (bf053fc5-5afd-b483-2ad2-3c9fc354f7fa) - partition with quorum
> Version: 1.0.9-da7075976b5ff0bee71074385f8fd02f296ec8a3
> 2 Nodes configured, unknown expected votes
> 1 Resources configured.
> ============
>
> Online: [ node-1 ]
> OFFLINE: [ node-0 ]
>
> cluster (heartbeat:ha): Started node-1
>
>
> Any help, like pointer to the source code where the problem might be, or any existing bug filed for this (I did some search but didn't find matched symptoms) is appreciated.

This is happening at the heartbeat level.

Not much pacemaker can do I'm afraid. Perhaps look to see if heartbeat is "real time" scheduled, if not that may explain why its being staved of CPU and can't get its messages out.

>
> Thanks,
> -Kaiwei
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> https://urldefense.proofpoint.com/v1/url?u=http://lists.linux-ha.org/mailman/listinfo/linux-ha&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=wkHclXpXSRj4jN%2FF6yECfGjmY217uMGOFlJmf%2B1H4FA%3D%0A&s=1816bc839d2eb1e28a3d00afaecf7d0ad1eb371fc314b0acf875b0c3e6c9add8
> See also: https://urldefense.proofpoint.com/v1/url?u=http://linux-ha.org/ReportingProblems&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=wkHclXpXSRj4jN%2FF6yECfGjmY217uMGOFlJmf%2B1H4FA%3D%0A&s=03ff7c6b602a98e907a05bd086150d61cfaefe6e06fe60ac881fee79077a76f6


_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
https://urldefense.proofpoint.com/v1/url?u=http://lists.linux-ha.org/mailman/listinfo/linux-ha&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=wkHclXpXSRj4jN%2FF6yECfGjmY217uMGOFlJmf%2B1H4FA%3D%0A&s=1816bc839d2eb1e28a3d00afaecf7d0ad1eb371fc314b0acf875b0c3e6c9add8
See also: https://urldefense.proofpoint.com/v1/url?u=http://linux-ha.org/ReportingProblems&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=wkHclXpXSRj4jN%2FF6yECfGjmY217uMGOFlJmf%2B1H4FA%3D%0A&s=03ff7c6b602a98e907a05bd086150d61cfaefe6e06fe60ac881fee79077a76f6
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Re: unable to recover from split-brain in a two-node cluster [ In reply to ]
On 24 Jun 2014, at 1:52 am, fank@vmware.com wrote:

> Hi,
>
> I understand that initially the split-brain is caused by heartbeat messaging layer and there is nothing much can be done when packets are dropped. However, the problem is sometimes when the load is gone (or when iptables allows all traffic in my test setup), it doesn't recover.
>
> In the second case I provided, the heartbeat on both nodes did find each other and both were active, but pacemaker in both nodes still thinks peer is offline. I don't know if this is heartbeat's problem or Pacemaker's problem though.

Do you see any messages from 'crmd' saying the node left/returned?
If you only see the node going away, then its almost certainly a heartbeat problem.

You may have better luck with a corosync based cluster, or even a newer version of pacemaker (or both! the 1.0.x codebase is quite old at this point).

I was never all that happy with heartbeat's membership code, it was a near-abandoned mystery box even at the point I started Pacemaker 10 years ago.
Corosync membership had its problems in the beginning, but personally I take comfort in the fact that its actively being worked on.
Opinions differ, but IMHO it surpassed heartbeat for reliability 3-4 years ago.

>
> Thanks,
> -Kaiwei
>
> ----- Original Message -----
> From: "Andrew Beekhof" <andrew@beekhof.net>
> To: "General Linux-HA mailing list" <linux-ha@lists.linux-ha.org>
> Sent: Sunday, June 22, 2014 3:45:00 PM
> Subject: Re: [Linux-HA] unable to recover from split-brain in a two-node cluster
>
>
> On 21 Jun 2014, at 5:18 am, fank@vmware.com wrote:
>
>> Hi,
>>
>> New to this list and hope I can get some help here.
>>
>> I'm using pacemaker 1.0.10 and heartbeat 3.0.5 for a two-node cluster. I'm having split-brain problem when heartbeat messages sometimes get dropped when system is under high load. However the problem is it never recover back when system load became low.
>>
>> I created a test setup to test this by setting dead time to 6 seconds, and continuously dropping one-way heartbeat packets (udp dst port 694) for 5~8 seconds and resume the traffic for 1~2 seconds using iptables. After the system got into split-brain state, I stop the test and allow all heartbeat traffic to go through. Sometimes the system recovered by sometimes it didn't. There are various symptoms when the system didn't recovered from split-brain:
>>
>> 1. In one instance, cl_status listnodes becomes empty. The syslog keeps showing
>> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.warning] [2853]: WARN: Message hist queue is filling up (436 messages in queue)
>> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: hist->ackseq =12111
>> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: hist->lowseq =12111, hist->hiseq=12547
>> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: expecting from node-1
>> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: it's ackseq=12111
>>
>> 2. In another instance, cl_status nodestatus <node> shows both nodes are active, but "crm_mon -1" shows that each of the two nodes thinks itself is the DC, and peer node is offline. Pengine process is running on one node only. The node not running pengine (but still thinks itself is DC) has log shows crmd terminated pengine because it detected peer is active. After that, the peer status keeps flapping between dead and active, but pengine has never being started again. The last log shows the peer is active (after I stopped the test and allow all traffic). However "crm_mon -1" shows itself is the DC and peer is offline as:
>>
>> [root@node-1 ~]# crm_mon -1
>> ============
>> Last updated: Fri Jun 20 19:12:23 2014
>> Stack: Heartbeat
>> Current DC: node-1 (bf053fc5-5afd-b483-2ad2-3c9fc354f7fa) - partition with quorum
>> Version: 1.0.9-da7075976b5ff0bee71074385f8fd02f296ec8a3
>> 2 Nodes configured, unknown expected votes
>> 1 Resources configured.
>> ============
>>
>> Online: [ node-1 ]
>> OFFLINE: [ node-0 ]
>>
>> cluster (heartbeat:ha): Started node-1
>>
>>
>> Any help, like pointer to the source code where the problem might be, or any existing bug filed for this (I did some search but didn't find matched symptoms) is appreciated.
>
> This is happening at the heartbeat level.
>
> Not much pacemaker can do I'm afraid. Perhaps look to see if heartbeat is "real time" scheduled, if not that may explain why its being staved of CPU and can't get its messages out.
>
>>
>> Thanks,
>> -Kaiwei
>> _______________________________________________
>> Linux-HA mailing list
>> Linux-HA@lists.linux-ha.org
>> https://urldefense.proofpoint.com/v1/url?u=http://lists.linux-ha.org/mailman/listinfo/linux-ha&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=wkHclXpXSRj4jN%2FF6yECfGjmY217uMGOFlJmf%2B1H4FA%3D%0A&s=1816bc839d2eb1e28a3d00afaecf7d0ad1eb371fc314b0acf875b0c3e6c9add8
>> See also: https://urldefense.proofpoint.com/v1/url?u=http://linux-ha.org/ReportingProblems&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=wkHclXpXSRj4jN%2FF6yECfGjmY217uMGOFlJmf%2B1H4FA%3D%0A&s=03ff7c6b602a98e907a05bd086150d61cfaefe6e06fe60ac881fee79077a76f6
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> https://urldefense.proofpoint.com/v1/url?u=http://lists.linux-ha.org/mailman/listinfo/linux-ha&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=wkHclXpXSRj4jN%2FF6yECfGjmY217uMGOFlJmf%2B1H4FA%3D%0A&s=1816bc839d2eb1e28a3d00afaecf7d0ad1eb371fc314b0acf875b0c3e6c9add8
> See also: https://urldefense.proofpoint.com/v1/url?u=http://linux-ha.org/ReportingProblems&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=wkHclXpXSRj4jN%2FF6yECfGjmY217uMGOFlJmf%2B1H4FA%3D%0A&s=03ff7c6b602a98e907a05bd086150d61cfaefe6e06fe60ac881fee79077a76f6
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
Re: unable to recover from split-brain in a two-node cluster [ In reply to ]
On Tue, Jun 24, 2014 at 12:23:30PM +1000, Andrew Beekhof wrote:
>
> On 24 Jun 2014, at 1:52 am, fank@vmware.com wrote:
>
> > Hi,
> >
> > I understand that initially the split-brain is caused by heartbeat messaging layer and there is nothing much can be done when packets are dropped. However, the problem is sometimes when the load is gone (or when iptables allows all traffic in my test setup), it doesn't recover.
> >
> > In the second case I provided, the heartbeat on both nodes did find each other and both were active, but pacemaker in both nodes still thinks peer is offline. I don't know if this is heartbeat's problem or Pacemaker's problem though.
>
> Do you see any messages from 'crmd' saying the node left/returned?
> If you only see the node going away, then its almost certainly a heartbeat problem.
>
> You may have better luck with a corosync based cluster, or even a newer version of pacemaker (or both! the 1.0.x codebase is quite old at this point).
>
> I was never all that happy with heartbeat's membership code, it was a near-abandoned mystery box even at the point I started Pacemaker 10 years ago.
> Corosync membership had its problems in the beginning, but personally I take comfort in the fact that its actively being worked on.
> Opinions differ, but IMHO it surpassed heartbeat for reliability 3-4 years ago.

Possibly. But especially with nodes
"unexpectedly returning after having been declared dead",
I've still seen more problems with corosync than with heartbeat,
even within the last few years.

Anyways:
Andrew is right, you should use (recent!) corosync and recent pacemaker.
And working node level fencing aka stonith.

That said, you said earlier you are using heartbeat 3.0.5,
and that heartbeat successfully re-established membership.
So you can confirm "ccm_testclient" on both nodes reports
the expected and same membership?

Is that 3.0.5 release tag, or a more "recent" hg checkout?
You need heartbeat up to at least this commit:
http://hg.linux-ha.org/heartbeat-STABLE_3_0/rev/fd1b907a0de6

(I meant to add a 3.0.6 release tag since at least I pushed that commit,
but because of packaging inconsistencies I want to fix,
and other commitments, I deferred that much too long).

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Re: unable to recover from split-brain in a two-node cluster [ In reply to ]
Hi Andrew,

I do see the last status update from crmd as following on node-1 from crmd is but crm_mon -1 still shows node-0 offline:
crmd_ha_status_callback: Status update: Node node-0 now has status [active] [DC=false]
Same on node-0 showing node-1 now has status active but crm_mon -1 shows it offline.

Thanks,
-Kaiwei

----- Original Message -----
From: "Andrew Beekhof" <andrew@beekhof.net>
To: "General Linux-HA mailing list" <linux-ha@lists.linux-ha.org>
Sent: Monday, June 23, 2014 7:23:30 PM
Subject: Re: [Linux-HA] unable to recover from split-brain in a two-node cluster


On 24 Jun 2014, at 1:52 am, fank@vmware.com wrote:

> Hi,
>
> I understand that initially the split-brain is caused by heartbeat messaging layer and there is nothing much can be done when packets are dropped. However, the problem is sometimes when the load is gone (or when iptables allows all traffic in my test setup), it doesn't recover.
>
> In the second case I provided, the heartbeat on both nodes did find each other and both were active, but pacemaker in both nodes still thinks peer is offline. I don't know if this is heartbeat's problem or Pacemaker's problem though.

Do you see any messages from 'crmd' saying the node left/returned?
If you only see the node going away, then its almost certainly a heartbeat problem.

You may have better luck with a corosync based cluster, or even a newer version of pacemaker (or both! the 1.0.x codebase is quite old at this point).

I was never all that happy with heartbeat's membership code, it was a near-abandoned mystery box even at the point I started Pacemaker 10 years ago.
Corosync membership had its problems in the beginning, but personally I take comfort in the fact that its actively being worked on.
Opinions differ, but IMHO it surpassed heartbeat for reliability 3-4 years ago.

>
> Thanks,
> -Kaiwei
>
> ----- Original Message -----
> From: "Andrew Beekhof" <andrew@beekhof.net>
> To: "General Linux-HA mailing list" <linux-ha@lists.linux-ha.org>
> Sent: Sunday, June 22, 2014 3:45:00 PM
> Subject: Re: [Linux-HA] unable to recover from split-brain in a two-node cluster
>
>
> On 21 Jun 2014, at 5:18 am, fank@vmware.com wrote:
>
>> Hi,
>>
>> New to this list and hope I can get some help here.
>>
>> I'm using pacemaker 1.0.10 and heartbeat 3.0.5 for a two-node cluster. I'm having split-brain problem when heartbeat messages sometimes get dropped when system is under high load. However the problem is it never recover back when system load became low.
>>
>> I created a test setup to test this by setting dead time to 6 seconds, and continuously dropping one-way heartbeat packets (udp dst port 694) for 5~8 seconds and resume the traffic for 1~2 seconds using iptables. After the system got into split-brain state, I stop the test and allow all heartbeat traffic to go through. Sometimes the system recovered by sometimes it didn't. There are various symptoms when the system didn't recovered from split-brain:
>>
>> 1. In one instance, cl_status listnodes becomes empty. The syslog keeps showing
>> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.warning] [2853]: WARN: Message hist queue is filling up (436 messages in queue)
>> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: hist->ackseq =12111
>> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: hist->lowseq =12111, hist->hiseq=12547
>> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: expecting from node-1
>> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: it's ackseq=12111
>>
>> 2. In another instance, cl_status nodestatus <node> shows both nodes are active, but "crm_mon -1" shows that each of the two nodes thinks itself is the DC, and peer node is offline. Pengine process is running on one node only. The node not running pengine (but still thinks itself is DC) has log shows crmd terminated pengine because it detected peer is active. After that, the peer status keeps flapping between dead and active, but pengine has never being started again. The last log shows the peer is active (after I stopped the test and allow all traffic). However "crm_mon -1" shows itself is the DC and peer is offline as:
>>
>> [root@node-1 ~]# crm_mon -1
>> ============
>> Last updated: Fri Jun 20 19:12:23 2014
>> Stack: Heartbeat
>> Current DC: node-1 (bf053fc5-5afd-b483-2ad2-3c9fc354f7fa) - partition with quorum
>> Version: 1.0.9-da7075976b5ff0bee71074385f8fd02f296ec8a3
>> 2 Nodes configured, unknown expected votes
>> 1 Resources configured.
>> ============
>>
>> Online: [ node-1 ]
>> OFFLINE: [ node-0 ]
>>
>> cluster (heartbeat:ha): Started node-1
>>
>>
>> Any help, like pointer to the source code where the problem might be, or any existing bug filed for this (I did some search but didn't find matched symptoms) is appreciated.
>
> This is happening at the heartbeat level.
>
> Not much pacemaker can do I'm afraid. Perhaps look to see if heartbeat is "real time" scheduled, if not that may explain why its being staved of CPU and can't get its messages out.
>
>>
>> Thanks,
>> -Kaiwei
>> _______________________________________________
>> Linux-HA mailing list
>> Linux-HA@lists.linux-ha.org
>> https://urldefense.proofpoint.com/v1/url?u=http://lists.linux-ha.org/mailman/listinfo/linux-ha&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=wkHclXpXSRj4jN%2FF6yECfGjmY217uMGOFlJmf%2B1H4FA%3D%0A&s=1816bc839d2eb1e28a3d00afaecf7d0ad1eb371fc314b0acf875b0c3e6c9add8
>> See also: https://urldefense.proofpoint.com/v1/url?u=http://linux-ha.org/ReportingProblems&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=wkHclXpXSRj4jN%2FF6yECfGjmY217uMGOFlJmf%2B1H4FA%3D%0A&s=03ff7c6b602a98e907a05bd086150d61cfaefe6e06fe60ac881fee79077a76f6
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> https://urldefense.proofpoint.com/v1/url?u=http://lists.linux-ha.org/mailman/listinfo/linux-ha&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=wkHclXpXSRj4jN%2FF6yECfGjmY217uMGOFlJmf%2B1H4FA%3D%0A&s=1816bc839d2eb1e28a3d00afaecf7d0ad1eb371fc314b0acf875b0c3e6c9add8
> See also: https://urldefense.proofpoint.com/v1/url?u=http://linux-ha.org/ReportingProblems&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=wkHclXpXSRj4jN%2FF6yECfGjmY217uMGOFlJmf%2B1H4FA%3D%0A&s=03ff7c6b602a98e907a05bd086150d61cfaefe6e06fe60ac881fee79077a76f6
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> https://urldefense.proofpoint.com/v1/url?u=http://lists.linux-ha.org/mailman/listinfo/linux-ha&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=Bdkr%2BWD1D90gppOzcux8fbsp%2FFuhQA2GbfnQDXlt1Tk%3D%0A&s=f202ffc293e834a940e48dbc394ae6df2a3670b777a8421229f25614349bb7fa
> See also: https://urldefense.proofpoint.com/v1/url?u=http://linux-ha.org/ReportingProblems&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=Bdkr%2BWD1D90gppOzcux8fbsp%2FFuhQA2GbfnQDXlt1Tk%3D%0A&s=c511b9a0478ebacd5bcd14658a2609c36d360788340f19a2ff5a561a4fd0c016


_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
https://urldefense.proofpoint.com/v1/url?u=http://lists.linux-ha.org/mailman/listinfo/linux-ha&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=Bdkr%2BWD1D90gppOzcux8fbsp%2FFuhQA2GbfnQDXlt1Tk%3D%0A&s=f202ffc293e834a940e48dbc394ae6df2a3670b777a8421229f25614349bb7fa
See also: https://urldefense.proofpoint.com/v1/url?u=http://linux-ha.org/ReportingProblems&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=Bdkr%2BWD1D90gppOzcux8fbsp%2FFuhQA2GbfnQDXlt1Tk%3D%0A&s=c511b9a0478ebacd5bcd14658a2609c36d360788340f19a2ff5a561a4fd0c016
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Re: unable to recover from split-brain in a two-node cluster [ In reply to ]
Hi Lars,

Thanks for pointing out the patch. It is not in the heartbeat version on the system (it is using Heartbeat-3-0-7e3a82377fa8). I'll try that out.

As for ccm_testclient, the system has stripped out unnecessary files that won't be used during normal operation, including gcc. So ccm_testclient complains gcc not found and I cannot test it on that system. cl_status listnodes shows both nodes on both system, cl_status nodestatus shows both are active thought.

Thanks,
-Kaiwei

----- Original Message -----
From: "Lars Ellenberg" <lars.ellenberg@linbit.com>
To: linux-ha@lists.linux-ha.org
Sent: Tuesday, June 24, 2014 7:03:47 AM
Subject: Re: [Linux-HA] unable to recover from split-brain in a two-node cluster

On Tue, Jun 24, 2014 at 12:23:30PM +1000, Andrew Beekhof wrote:
>
> On 24 Jun 2014, at 1:52 am, fank@vmware.com wrote:
>
> > Hi,
> >
> > I understand that initially the split-brain is caused by heartbeat messaging layer and there is nothing much can be done when packets are dropped. However, the problem is sometimes when the load is gone (or when iptables allows all traffic in my test setup), it doesn't recover.
> >
> > In the second case I provided, the heartbeat on both nodes did find each other and both were active, but pacemaker in both nodes still thinks peer is offline. I don't know if this is heartbeat's problem or Pacemaker's problem though.
>
> Do you see any messages from 'crmd' saying the node left/returned?
> If you only see the node going away, then its almost certainly a heartbeat problem.
>
> You may have better luck with a corosync based cluster, or even a newer version of pacemaker (or both! the 1.0.x codebase is quite old at this point).
>
> I was never all that happy with heartbeat's membership code, it was a near-abandoned mystery box even at the point I started Pacemaker 10 years ago.
> Corosync membership had its problems in the beginning, but personally I take comfort in the fact that its actively being worked on.
> Opinions differ, but IMHO it surpassed heartbeat for reliability 3-4 years ago.

Possibly. But especially with nodes
"unexpectedly returning after having been declared dead",
I've still seen more problems with corosync than with heartbeat,
even within the last few years.

Anyways:
Andrew is right, you should use (recent!) corosync and recent pacemaker.
And working node level fencing aka stonith.

That said, you said earlier you are using heartbeat 3.0.5,
and that heartbeat successfully re-established membership.
So you can confirm "ccm_testclient" on both nodes reports
the expected and same membership?

Is that 3.0.5 release tag, or a more "recent" hg checkout?
You need heartbeat up to at least this commit:
https://urldefense.proofpoint.com/v1/url?u=http://hg.linux-ha.org/heartbeat-STABLE_3_0/rev/fd1b907a0de6&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=V3pBuNr8GhHtwXWP%2BVQHztuIz7vfEAp%2FC5q4PCVesTI%3D%0A&s=bfe2739553e81279f7bfcb0c4e7667e1fc737ff5d0fa016e3552a25a1909e5aa

(I meant to add a 3.0.6 release tag since at least I pushed that commit,
but because of packaging inconsistencies I want to fix,
and other commitments, I deferred that much too long).

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting https://urldefense.proofpoint.com/v1/url?u=http://www.linbit.com/&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=V3pBuNr8GhHtwXWP%2BVQHztuIz7vfEAp%2FC5q4PCVesTI%3D%0A&s=4ce0568e77f615cdc2dfa71544ea0f2b6b41ad41a3b76be0c23aca62ff8012a2

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
https://urldefense.proofpoint.com/v1/url?u=http://lists.linux-ha.org/mailman/listinfo/linux-ha&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=V3pBuNr8GhHtwXWP%2BVQHztuIz7vfEAp%2FC5q4PCVesTI%3D%0A&s=d8439db8b91239efc9cd43b95f7742a3cd58752c27541d3a05cbbe69bcba7554
See also: https://urldefense.proofpoint.com/v1/url?u=http://linux-ha.org/ReportingProblems&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=V3pBuNr8GhHtwXWP%2BVQHztuIz7vfEAp%2FC5q4PCVesTI%3D%0A&s=1ae6e0ac95afd9dc1a458b9388124199400d06ba87166c19986d350985c5b851
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Re: unable to recover from split-brain in a two-node cluster [ In reply to ]
On 25 Jun 2014, at 12:03 am, Lars Ellenberg <Lars.Ellenberg@linbit.com> wrote:

> On Tue, Jun 24, 2014 at 12:23:30PM +1000, Andrew Beekhof wrote:
>>
>> On 24 Jun 2014, at 1:52 am, fank@vmware.com wrote:
>>
>>> Hi,
>>>
>>> I understand that initially the split-brain is caused by heartbeat messaging layer and there is nothing much can be done when packets are dropped. However, the problem is sometimes when the load is gone (or when iptables allows all traffic in my test setup), it doesn't recover.
>>>
>>> In the second case I provided, the heartbeat on both nodes did find each other and both were active, but pacemaker in both nodes still thinks peer is offline. I don't know if this is heartbeat's problem or Pacemaker's problem though.
>>
>> Do you see any messages from 'crmd' saying the node left/returned?
>> If you only see the node going away, then its almost certainly a heartbeat problem.
>>
>> You may have better luck with a corosync based cluster, or even a newer version of pacemaker (or both! the 1.0.x codebase is quite old at this point).
>>
>> I was never all that happy with heartbeat's membership code, it was a near-abandoned mystery box even at the point I started Pacemaker 10 years ago.
>> Corosync membership had its problems in the beginning, but personally I take comfort in the fact that its actively being worked on.
>> Opinions differ, but IMHO it surpassed heartbeat for reliability 3-4 years ago.
>
> Possibly. But especially with nodes
> "unexpectedly returning after having been declared dead",
> I've still seen more problems with corosync than with heartbeat,
> even within the last few years.

Unfortunately a fair share of those have also been pacemaker bugs :(
Yan is working on another one related to slow fencing devices.

>
> Anyways:
> Andrew is right, you should use (recent!) corosync and recent pacemaker.
> And working node level fencing aka stonith.
>
> That said, you said earlier you are using heartbeat 3.0.5,
> and that heartbeat successfully re-established membership.
> So you can confirm "ccm_testclient" on both nodes reports
> the expected and same membership?
>
> Is that 3.0.5 release tag, or a more "recent" hg checkout?
> You need heartbeat up to at least this commit:
> http://hg.linux-ha.org/heartbeat-STABLE_3_0/rev/fd1b907a0de6
>
> (I meant to add a 3.0.6 release tag since at least I pushed that commit,
> but because of packaging inconsistencies I want to fix,
> and other commitments, I deferred that much too long).
>
> --
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
>
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
Re: unable to recover from split-brain in a two-node cluster [ In reply to ]
On Tue, Jun 24, 2014 at 08:48:03AM -0700, fank@vmware.com wrote:
> Hi Lars,
>
> Thanks for pointing out the patch. It is not in the heartbeat version on the system (it is using Heartbeat-3-0-7e3a82377fa8). I'll try that out.
>
> As for ccm_testclient, the system has stripped out unnecessary files
> that won't be used during normal operation, including gcc.
> So ccm_testclient complains gcc not found

Uh? Why would it think it needs gcc?
Can you copy the exact message please?

> and I cannot test it on that
> system. cl_status listnodes shows both nodes on both system, cl_status
> nodestatus shows both are active thought.

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Re: unable to recover from split-brain in a two-node cluster [ In reply to ]
I don't know what ccm_testclient I was running. I'm pretty sure it was a shell script and it was complaining gcc not found.
I rebuilt heartbeat and ugpraded pacemaker, now the ccm_testclient is a binary file and I can run it without problem....

-Kaiwei

----- Original Message -----
From: "Lars Ellenberg" <lars.ellenberg@linbit.com>
To: linux-ha@lists.linux-ha.org
Sent: Thursday, June 26, 2014 4:24:29 AM
Subject: Re: [Linux-HA] unable to recover from split-brain in a two-node cluster

On Tue, Jun 24, 2014 at 08:48:03AM -0700, fank@vmware.com wrote:
> Hi Lars,
>
> Thanks for pointing out the patch. It is not in the heartbeat version on the system (it is using Heartbeat-3-0-7e3a82377fa8). I'll try that out.
>
> As for ccm_testclient, the system has stripped out unnecessary files
> that won't be used during normal operation, including gcc.
> So ccm_testclient complains gcc not found

Uh? Why would it think it needs gcc?
Can you copy the exact message please?

> and I cannot test it on that
> system. cl_status listnodes shows both nodes on both system, cl_status
> nodestatus shows both are active thought.

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
https://urldefense.proofpoint.com/v1/url?u=http://lists.linux-ha.org/mailman/listinfo/linux-ha&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=YWD2xEyRKh61Jmcm0EXT7hmgss2aUbM7cfoKSh9MEa4%3D%0A&s=c143678ccb972cfa4c5e65f1769f192d51edfdcee919cd978567d0c18b93621e
See also: https://urldefense.proofpoint.com/v1/url?u=http://linux-ha.org/ReportingProblems&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=YWD2xEyRKh61Jmcm0EXT7hmgss2aUbM7cfoKSh9MEa4%3D%0A&s=01de7c6157a51619278b06ef93465ce5ce6164a688133e0bcb80ce3b9cb1be7f
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Re: unable to recover from split-brain in a two-node cluster [ In reply to ]
Ahh, I was running ccm_testclient in membership/ccm/ generated by libtool during build, not the one that's complied/installed. My bad.

----- Original Message -----
From: fank@vmware.com
To: "General Linux-HA mailing list" <linux-ha@lists.linux-ha.org>
Sent: Thursday, June 26, 2014 9:45:27 AM
Subject: Re: [Linux-HA] unable to recover from split-brain in a two-node cluster

I don't know what ccm_testclient I was running. I'm pretty sure it was a shell script and it was complaining gcc not found.
I rebuilt heartbeat and ugpraded pacemaker, now the ccm_testclient is a binary file and I can run it without problem....

-Kaiwei

----- Original Message -----
From: "Lars Ellenberg" <lars.ellenberg@linbit.com>
To: linux-ha@lists.linux-ha.org
Sent: Thursday, June 26, 2014 4:24:29 AM
Subject: Re: [Linux-HA] unable to recover from split-brain in a two-node cluster

On Tue, Jun 24, 2014 at 08:48:03AM -0700, fank@vmware.com wrote:
> Hi Lars,
>
> Thanks for pointing out the patch. It is not in the heartbeat version on the system (it is using Heartbeat-3-0-7e3a82377fa8). I'll try that out.
>
> As for ccm_testclient, the system has stripped out unnecessary files
> that won't be used during normal operation, including gcc.
> So ccm_testclient complains gcc not found

Uh? Why would it think it needs gcc?
Can you copy the exact message please?

> and I cannot test it on that
> system. cl_status listnodes shows both nodes on both system, cl_status
> nodestatus shows both are active thought.

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
https://urldefense.proofpoint.com/v1/url?u=http://lists.linux-ha.org/mailman/listinfo/linux-ha&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=YWD2xEyRKh61Jmcm0EXT7hmgss2aUbM7cfoKSh9MEa4%3D%0A&s=c143678ccb972cfa4c5e65f1769f192d51edfdcee919cd978567d0c18b93621e
See also: https://urldefense.proofpoint.com/v1/url?u=http://linux-ha.org/ReportingProblems&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=YWD2xEyRKh61Jmcm0EXT7hmgss2aUbM7cfoKSh9MEa4%3D%0A&s=01de7c6157a51619278b06ef93465ce5ce6164a688133e0bcb80ce3b9cb1be7f
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
https://urldefense.proofpoint.com/v1/url?u=http://lists.linux-ha.org/mailman/listinfo/linux-ha&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=hC%2BrQ7aBLQFIszFHnZ9kS%2Ff6SyFp%2FbANntt37bZekkc%3D%0A&s=33c602becb582feadcf71026060028efa2b96cd4b0347dcb457fbb80e66abebc
See also: https://urldefense.proofpoint.com/v1/url?u=http://linux-ha.org/ReportingProblems&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=hC%2BrQ7aBLQFIszFHnZ9kS%2Ff6SyFp%2FbANntt37bZekkc%3D%0A&s=b697a78775c567c1fcde6940e06e086a38b245a76c439676101b101fc0bac286
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems