Mailing List Archive: heartbeat 3.0.3 crashes if there are networking/multicast issues (ERROR: lowseq cannnot be greater than ackseq)

heartbeat 3.0.3 crashes if there are networking/multicast issues (ERROR: lowseq cannnot be greater than ackseq)

Jun 24, 2014, 1:20 PM

Post #1 of 3 (2534 views)

Hello!

I've been seeing heartbeat cluster problems in Linux-based Vyatta and more recent VyOS networking/router appliances.
These are currently based on Debian Squeeze, and thus are using:

Package: heartbeat
Version: 1:3.0.3-2

VyOS bug report: http://bugzilla.vyos.net/show_bug.cgi?id=244

The problem is that when there are (unexpected) networking problems causing multicast issues,
which cause problems in the inter-cluster communications, the heartbeat processes will die on the cluster nodes,
which is bad, right? I assume heartbeat should never die, especially not because of temporary networking issues..

I've also seen heartbeat dying because of temporary network maintenance breaks..

Basicly first I'm seeing this kind of messages:

Jun 23 17:55:02 vyos03 heartbeat: [4119]: WARN: node vyos01: is dead
Jun 23 17:59:23 vyos03 heartbeat: [4119]: CRIT: Cluster node vyos01 returning after partition.
Jun 23 17:59:23 vyos03 heartbeat: [4119]: WARN: Deadtime value may be too small.
Jun 23 17:59:23 vyos03 heartbeat: [4119]: WARN: Late heartbeat: Node vyos01: interval 273580 ms
Jun 23 17:59:23 vyos03 harc[4961]: info: Running /etc/ha.d//rc.d/status status
Jun 23 17:59:25 vyos03 ResourceManager[4991]: info: Releasing resource group: vyos01 IPaddr2-vyatta::10.0.0.10/24/eth1
Jun 23 17:59:25 vyos03 ResourceManager[4991]: info: Running /etc/ha.d/resource.d/IPaddr2-vyatta 10.0.0.10/24/eth1 stop
Jun 23 17:59:26 vyos03 heartbeat: [4119]: WARN: 1 lost packet(s) for [vyos01] [421:423]
Jun 23 17:59:39 vyos03 heartbeat: [4119]: WARN: Logging daemon is disabled --enabling logging daemon is recommended
Jun 23 17:59:40 vyos03 harc[5102]: info: Running /etc/ha.d//rc.d/status status

Which seem normal in the case of networking problem.. But then later:

Jun 23 19:31:22 vyos03 heartbeat: [10921]: ERROR: Message hist queue is filling up (494 messages in queue)
Jun 23 19:31:22 vyos03 heartbeat: [10921]: ERROR: Message hist queue is filling up (495 messages in queue)
Jun 23 19:31:23 vyos03 heartbeat: [10921]: ERROR: Message hist queue is filling up (496 messages in queue)
Jun 23 19:31:24 vyos03 heartbeat: [10921]: ERROR: Message hist queue is filling up (497 messages in queue)
Jun 23 19:31:24 vyos03 heartbeat: [10921]: ERROR: Message hist queue is filling up (498 messages in queue)
Jun 23 19:31:25 vyos03 heartbeat: [10921]: ERROR: Message hist queue is filling up (499 messages in queue)
Jun 23 19:31:26 vyos03 heartbeat: [10921]: ERROR: Message hist queue is filling up (500 messages in queue)
Jun 23 19:31:42 vyos03 heartbeat: last message repeated 25 times

The "hist queue" size keeps increasing, and when it gets to 500 messages bad things start happening..

Jun 23 19:31:43 vyos03 heartbeat: [10921]: ERROR: Message hist queue is filling up (500 messages in queue)
Jun 23 19:31:49 vyos03 heartbeat: last message repeated 9 times
Jun 23 19:31:49 vyos03 heartbeat: [10921]: ERROR: lowseq cannnot be greater than ackseq
Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Emergency Shutdown: Master Control process died.
Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Killing pid 10921 with SIGTERM
Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Killing pid 10924 with SIGTERM
Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Killing pid 10925 with SIGTERM
Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Emergency Shutdown(MCP dead): Killing ourselves.

At this point clustering has failed, because the heartbeat services/processes aren't running anymore..

Has anyone else seen this?

It seems the bug gets triggered at 500 messages in the hist queue,
and then I always see the "ERROR: lowseq cannnot be greater than ackseq" and then heartbeat dies..

Thanks,

-- Pasi

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: heartbeat 3.0.3 crashes if there are networking/multicast issues (ERROR: lowseq cannnot be greater than ackseq) [ In reply to ]

lars.ellenberg at linbit

Jun 26, 2014, 4:30 AM

Post #2 of 3 (2380 views)

Permalink

On Tue, Jun 24, 2014 at 11:20:48PM +0300, Pasi Kärkkäinen wrote:
> Hello!
>
> I've been seeing heartbeat cluster problems in Linux-based Vyatta and more recent VyOS networking/router appliances.
> These are currently based on Debian Squeeze, and thus are using:
>
> Package: heartbeat
> Version: 1:3.0.3-2

Please use 3.0.5:
http://hg.linux-ha.org/heartbeat-STABLE_3_0/archive/37f57a36a2dd.tar.bz2

> VyOS bug report: http://bugzilla.vyos.net/show_bug.cgi?id=244
>
> The problem is that when there are (unexpected) networking problems causing multicast issues,
> which cause problems in the inter-cluster communications, the heartbeat processes will die on the cluster nodes,
> which is bad, right? I assume heartbeat should never die, especially not because of temporary networking issues..
>
> I've also seen heartbeat dying because of temporary network maintenance breaks..
>
> Basicly first I'm seeing this kind of messages:
>
> Jun 23 17:55:02 vyos03 heartbeat: [4119]: WARN: node vyos01: is dead
> Jun 23 17:59:23 vyos03 heartbeat: [4119]: CRIT: Cluster node vyos01 returning after partition.
> Jun 23 17:59:23 vyos03 heartbeat: [4119]: WARN: Deadtime value may be too small.
> Jun 23 17:59:23 vyos03 heartbeat: [4119]: WARN: Late heartbeat: Node vyos01: interval 273580 ms
> Jun 23 17:59:23 vyos03 harc[4961]: info: Running /etc/ha.d//rc.d/status status
> Jun 23 17:59:25 vyos03 ResourceManager[4991]: info: Releasing resource group: vyos01 IPaddr2-vyatta::10.0.0.10/24/eth1
> Jun 23 17:59:25 vyos03 ResourceManager[4991]: info: Running /etc/ha.d/resource.d/IPaddr2-vyatta 10.0.0.10/24/eth1 stop
> Jun 23 17:59:26 vyos03 heartbeat: [4119]: WARN: 1 lost packet(s) for [vyos01] [421:423]
> Jun 23 17:59:39 vyos03 heartbeat: [4119]: WARN: Logging daemon is disabled --enabling logging daemon is recommended
> Jun 23 17:59:40 vyos03 harc[5102]: info: Running /etc/ha.d//rc.d/status status
>
> Which seem normal in the case of networking problem.. But then later:
>
> Jun 23 19:31:22 vyos03 heartbeat: [10921]: ERROR: Message hist queue is filling up (494 messages in queue)
> Jun 23 19:31:22 vyos03 heartbeat: [10921]: ERROR: Message hist queue is filling up (495 messages in queue)
> Jun 23 19:31:23 vyos03 heartbeat: [10921]: ERROR: Message hist queue is filling up (496 messages in queue)
> Jun 23 19:31:24 vyos03 heartbeat: [10921]: ERROR: Message hist queue is filling up (497 messages in queue)
> Jun 23 19:31:24 vyos03 heartbeat: [10921]: ERROR: Message hist queue is filling up (498 messages in queue)
> Jun 23 19:31:25 vyos03 heartbeat: [10921]: ERROR: Message hist queue is filling up (499 messages in queue)
> Jun 23 19:31:26 vyos03 heartbeat: [10921]: ERROR: Message hist queue is filling up (500 messages in queue)
> Jun 23 19:31:42 vyos03 heartbeat: last message repeated 25 times
>
>
> The "hist queue" size keeps increasing, and when it gets to 500 messages bad things start happening..
>
>
> Jun 23 19:31:43 vyos03 heartbeat: [10921]: ERROR: Message hist queue is filling up (500 messages in queue)
> Jun 23 19:31:49 vyos03 heartbeat: last message repeated 9 times
> Jun 23 19:31:49 vyos03 heartbeat: [10921]: ERROR: lowseq cannnot be greater than ackseq
> Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Emergency Shutdown: Master Control process died.
> Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Killing pid 10921 with SIGTERM
> Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Killing pid 10924 with SIGTERM
> Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Killing pid 10925 with SIGTERM
> Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Emergency Shutdown(MCP dead): Killing ourselves.
>
> At this point clustering has failed, because the heartbeat services/processes aren't running anymore..
>
> Has anyone else seen this?

It has been fixed years ago ...

> It seems the bug gets triggered at 500 messages in the hist queue,
> and then I always see the "ERROR: lowseq cannnot be greater than ackseq" and then heartbeat dies..

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: heartbeat 3.0.3 crashes if there are networking/multicast issues (ERROR: lowseq cannnot be greater than ackseq) [ In reply to ]

pasik at iki

Jun 30, 2014, 2:33 AM

Post #3 of 3 (2381 views)

Permalink

On Thu, Jun 26, 2014 at 01:30:01PM +0200, Lars Ellenberg wrote:
> On Tue, Jun 24, 2014 at 11:20:48PM +0300, Pasi Kärkkäinen wrote:
> > Hello!
> >
> > I've been seeing heartbeat cluster problems in Linux-based Vyatta and more recent VyOS networking/router appliances.
> > These are currently based on Debian Squeeze, and thus are using:
> >
> > Package: heartbeat
> > Version: 1:3.0.3-2
>
> Please use 3.0.5:
> http://hg.linux-ha.org/heartbeat-STABLE_3_0/archive/37f57a36a2dd.tar.bz2
>

Do you think v3.0.5 fixes the issue of heartbeat process crashing?

This patch perhaps? http://hg.linux-ha.org/heartbeat-STABLE_3_0/rev/3e51db646a21

Thanks,

-- Pasi

> > VyOS bug report: http://bugzilla.vyos.net/show_bug.cgi?id=244
> >
> > The problem is that when there are (unexpected) networking problems causing multicast issues,
> > which cause problems in the inter-cluster communications, the heartbeat processes will die on the cluster nodes,
> > which is bad, right? I assume heartbeat should never die, especially not because of temporary networking issues..
> >
> > I've also seen heartbeat dying because of temporary network maintenance breaks..
> >
> > Basicly first I'm seeing this kind of messages:
> >
> > Jun 23 17:55:02 vyos03 heartbeat: [4119]: WARN: node vyos01: is dead
> > Jun 23 17:59:23 vyos03 heartbeat: [4119]: CRIT: Cluster node vyos01 returning after partition.
> > Jun 23 17:59:23 vyos03 heartbeat: [4119]: WARN: Deadtime value may be too small.
> > Jun 23 17:59:23 vyos03 heartbeat: [4119]: WARN: Late heartbeat: Node vyos01: interval 273580 ms
> > Jun 23 17:59:23 vyos03 harc[4961]: info: Running /etc/ha.d//rc.d/status status
> > Jun 23 17:59:25 vyos03 ResourceManager[4991]: info: Releasing resource group: vyos01 IPaddr2-vyatta::10.0.0.10/24/eth1
> > Jun 23 17:59:25 vyos03 ResourceManager[4991]: info: Running /etc/ha.d/resource.d/IPaddr2-vyatta 10.0.0.10/24/eth1 stop
> > Jun 23 17:59:26 vyos03 heartbeat: [4119]: WARN: 1 lost packet(s) for [vyos01] [421:423]
> > Jun 23 17:59:39 vyos03 heartbeat: [4119]: WARN: Logging daemon is disabled --enabling logging daemon is recommended
> > Jun 23 17:59:40 vyos03 harc[5102]: info: Running /etc/ha.d//rc.d/status status
> >
> > Which seem normal in the case of networking problem.. But then later:
> >
> > Jun 23 19:31:22 vyos03 heartbeat: [10921]: ERROR: Message hist queue is filling up (494 messages in queue)
> > Jun 23 19:31:22 vyos03 heartbeat: [10921]: ERROR: Message hist queue is filling up (495 messages in queue)
> > Jun 23 19:31:23 vyos03 heartbeat: [10921]: ERROR: Message hist queue is filling up (496 messages in queue)
> > Jun 23 19:31:24 vyos03 heartbeat: [10921]: ERROR: Message hist queue is filling up (497 messages in queue)
> > Jun 23 19:31:24 vyos03 heartbeat: [10921]: ERROR: Message hist queue is filling up (498 messages in queue)
> > Jun 23 19:31:25 vyos03 heartbeat: [10921]: ERROR: Message hist queue is filling up (499 messages in queue)
> > Jun 23 19:31:26 vyos03 heartbeat: [10921]: ERROR: Message hist queue is filling up (500 messages in queue)
> > Jun 23 19:31:42 vyos03 heartbeat: last message repeated 25 times
> >
> >
> > The "hist queue" size keeps increasing, and when it gets to 500 messages bad things start happening..
> >
> >
> > Jun 23 19:31:43 vyos03 heartbeat: [10921]: ERROR: Message hist queue is filling up (500 messages in queue)
> > Jun 23 19:31:49 vyos03 heartbeat: last message repeated 9 times
> > Jun 23 19:31:49 vyos03 heartbeat: [10921]: ERROR: lowseq cannnot be greater than ackseq
> > Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Emergency Shutdown: Master Control process died.
> > Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Killing pid 10921 with SIGTERM
> > Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Killing pid 10924 with SIGTERM
> > Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Killing pid 10925 with SIGTERM
> > Jun 23 19:31:50 vyos03 heartbeat: [10923]: CRIT: Emergency Shutdown(MCP dead): Killing ourselves.
> >
> > At this point clustering has failed, because the heartbeat services/processes aren't running anymore..
> >
> > Has anyone else seen this?
>
> It has been fixed years ago ...
>
> > It seems the bug gets triggered at 500 messages in the hist queue,
> > and then I always see the "ERROR: lowseq cannnot be greater than ackseq" and then heartbeat dies..
>
> --
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
>
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems