Mailing List Archive: Question about Heartbeat "heartbeats"

Question about Heartbeat "heartbeats"

lclaudio at conectiva

Jun 1, 2000, 10:48 AM

Post #1 of 7 (1786 views)

Hi!

Why do heartbeat uses broadcast to send status messages and so?
I know that if you have many hosts in the cluster it is very
useful, or at least the easiest way to do that, but we are working
with two hosts environments right now and the "add_new_hosts" feature
is commented in the source code. Is there any reason for that (the
broadcasts)?
We're doing tests using keepalive = 1... so our sysadmin gets crazy
everytime we start heartbeat... (using two nodes this means 2
broadcasts each second...).
If you're using a two node cluster and have the name of both nodes
in the configuration file, you can do the job using unicast
messages. This way you avoid generating load for the other machines
on your network.
If this is a future vision, something for when we will have lots of
nodes in the cluster, I have one more thought. Broadcasts aren't
routed... so your entire cluster must fit in the same subnet... it
isn't a good thing if you have load balance and other services that
may be located in servers on different subnets. Multicasts would do
that with more elegance.

Hugs!

Luis

[ Luis Claudio R. Goncalves lclaudio@conectiva.com.br ]
[. BSc in Computer Science -- MSc coming soon -- Gospel User -- Linuxer ]
[. Fault Tolerance - Real-Time - Distributed Systems - IECLB - IS 40:31 ]
[. LateNite Programmer -- Jesus Is The Solid Rock On Which I Stand -- ]

Question about Heartbeat "heartbeats" [ In reply to ]

Jun 1, 2000, 11:01 AM

Post #2 of 7 (1773 views)

Being triggered by the Luis' note, and especially this latter comment:
==========
If this is a future vision, something for when we will have lots of
nodes in the cluster, I have one more thought. Broadcasts aren't
routed... so your entire cluster must fit in the same subnet... it
isn't a good thing if you have load balance and other services that
may be located in servers on different subnets. Multicasts would do
that with more elegance.

Hugs!

Luis

[ Luis Claudio R. Goncalves lclaudio@conectiva.com.br ]
===========
OK, I can throw in some hard-won experience here, not with Heartbeat, but
with our own work. The Topology Services (TS) component in IBM's Phoenix
cluster services (see Greg Pfister's "In Search of Clusters," or the
reference on the Linux-HA TODO page) performs the yeoman work of
"heartbeating." We use both unicast and broadcast messages, in addition,
we need to scale from 2 nodes to 512 nodes, which means that we have
encountered issues with nodes being on multiple subnets.

For the actual keepalive heartbeats, we use a ring model, where a node only
heartbeats to its 'neighbor' (normally we do it unidirectionally, an
extension to do it bidirectionally is straightforward.) We do not use
broadcasts for keepalives, nodes only see the unicast messages from their
neighbors. If a node does not see a number of hearbeats (based on the
number configured by the admin) it considers its neighbor dead. Neighbor
is based on IP address, NOT on physical location.

We essentially have a logical hierarchy of nodes: a group leader (GL),
mayors, and peons. The GL determines overall cluster health, and
distributes that information via broadcasts to the other nodes in the
cluster. In the case of multiple subnets, the GL sends a unicast to a
mayor node on remote subnets, and the mayor in turn broadcasts the message
to its subnet. Peons are simply all of the other nodes :-) We need to
recover when a GL or mayor fails, and elect new ones. These protocols are
all combinations of unicast and broadcast, or, all unicast if the comm
media doesn't support broadcast (or the admin configures TS to not use it.)
We've always though multicast would be good, but there are complexities
with it regarding setting up multicast masks, recovery, etc., and when we
started this work multicast wasn't supported on all of the networks and
adapters we needed to support!

Overall cluster status is built by the GL, by collecting status from the
nodes. When a node sees its neighbor die (well, when a node sees nothing
from its neighbor) it reports "node dead" to the GL, who subsequently
distributes the new cluster membership to all of the surviving nodes.
Initial cluster bringup is a set of smaller clusters coalescing as nodes
PROCLAIM to each other, these subclusters coalesce eventually into the One
True Cluster. Note that TS requires a pre-defined list of nodes that are
to be allowed into the cluster, we then do discovery based on finding the
nodes on that list. Nodes not in the list are ignored. The list of nodes
can be dynamically modified ("refresh").

The biggest bite here is the need to handle multiple physical subnets, and
the need to distribute messages to large numbers of nodes in a timely
fashion -- this is heartbeating after all, so, all of this must happen with
a specific number of seconds, even on 500 nodes. The use of broadcast thus
kicks in significant complexity in having to store-and-forward protocol
messages. Given it all to do over again, using multicast from the
beginning may be more effective. Using pure unicast actually makes the
code simpler, but introduces significant network traffic and performance
concerns.

One comment about resource starvation, and TS thus causing nodes to be shut
down, we've seen that on heavily loaded nodes. We've found that setting
TS' process priority and pinning it in memory on AIX is not enough to avoid
all possible blockages on a node that is very heavily loaded. I/O
interrupts can be a significant problem, as they block any process-level
execution, and the efficiency of their handling is a concern (on AIX, at
least.) Also, simply pinning TS is not sufficient, there are also the
various system libraries and their data that are used by TS that need to be
pinned. All of this begins to drag heavily on the overall memory resources
available. We (who know the Phoenix code) don't have enough experience yet
with Linux to be able to know the different effects that may be hit.

One last point (this is already too long, my apologies), Phoenix is
strictly layered. TS ONLY deals with heartbeating, and declares which
nodes (in fact, which specific comm adapters and networks) are UP or DOWN.
Above TS is Group Services, the n-phase commit protocol driver and group
membership service. It uses TS to determine which nodes are up/down, and
allows its clients to perform barrier synch protocols. Above GS are
various monitors, cluster managers, and such. It is up to the layers above
TS and GS to control resources, perform recovery and failover, etc.

Peter R. Badovinatz -- (503)578-5530 (TL 775)
Clusters and High Availability, Beaverton, OR
wombat@us.ibm.com or IBMUSM00(WOMBAT)

Question about Heartbeat "heartbeats" [ In reply to ]

Jun 1, 2000, 12:18 PM

Post #3 of 7 (1776 views)

David Brower <dbrower@us.oracle.com>@us.oracle.com on 06/01/2000 05:47:43
PM

Sent by: dbrower@us.oracle.com

To: linux-ha-dev@lists.tummy.com, Peter Badovinatz/Poughkeepsie/IBM@IBMUS
cc:
Subject: Re: [Linux-ha-dev] Question about Heartbeat "heartbeats"

Thanks for this informative note, and welcome
to the party!

-dB

wombat@us.ibm.com wrote:
<deleted, everyone has seen it!>
--
Butterflies tell me to say:
"The statements and opinions expressed here are my own and do not
necessarily
represent those of Oracle Corporation."
==========

Thank you, I'll try to earn my keep.

Ah, thanks for reminding me, my new signature (just moved from
Poughkeepsie, NY to Oregon) is lacking. New one below, please mentally
retrofit it to my previous note ;-)

These have been the opinions of:
Peter R. Badovinatz -- (503)578-5530 (TL 775)
Clusters and High Availability, Beaverton, OR
wombat@us.ibm.com
and in no way should be construed as official opinion of IBM, Corp., my
email id notwithstanding.

Question about Heartbeat "heartbeats" [ In reply to ]

Jun 1, 2000, 2:47 PM

Post #4 of 7 (1771 views)

Thanks for this informative note, and welcome
to the party!

-dB

wombat@us.ibm.com wrote:
>
> Being triggered by the Luis' note, and especially this latter comment:
> ===========
> OK, I can throw in some hard-won experience here, not with Heartbeat, but
> with our own work. The Topology Services (TS) component in IBM's Phoenix
> cluster services (see Greg Pfister's "In Search of Clusters," or the
> reference on the Linux-HA TODO page) performs the yeoman work of
> "heartbeating." We use both unicast and broadcast messages, in addition,
> we need to scale from 2 nodes to 512 nodes, which means that we have
> encountered issues with nodes being on multiple subnets.
>
> For the actual keepalive heartbeats, we use a ring model, where a node only
> heartbeats to its 'neighbor' (normally we do it unidirectionally, an
> extension to do it bidirectionally is straightforward.) We do not use
> broadcasts for keepalives, nodes only see the unicast messages from their
> neighbors. If a node does not see a number of hearbeats (based on the
> number configured by the admin) it considers its neighbor dead. Neighbor
> is based on IP address, NOT on physical location.
>
> We essentially have a logical hierarchy of nodes: a group leader (GL),
> mayors, and peons. The GL determines overall cluster health, and
> distributes that information via broadcasts to the other nodes in the
> cluster. In the case of multiple subnets, the GL sends a unicast to a
> mayor node on remote subnets, and the mayor in turn broadcasts the message
> to its subnet. Peons are simply all of the other nodes :-) We need to
> recover when a GL or mayor fails, and elect new ones. These protocols are
> all combinations of unicast and broadcast, or, all unicast if the comm
> media doesn't support broadcast (or the admin configures TS to not use it.)
> We've always though multicast would be good, but there are complexities
> with it regarding setting up multicast masks, recovery, etc., and when we
> started this work multicast wasn't supported on all of the networks and
> adapters we needed to support!
>
> Overall cluster status is built by the GL, by collecting status from the
> nodes. When a node sees its neighbor die (well, when a node sees nothing
> from its neighbor) it reports "node dead" to the GL, who subsequently
> distributes the new cluster membership to all of the surviving nodes.
> Initial cluster bringup is a set of smaller clusters coalescing as nodes
> PROCLAIM to each other, these subclusters coalesce eventually into the One
> True Cluster. Note that TS requires a pre-defined list of nodes that are
> to be allowed into the cluster, we then do discovery based on finding the
> nodes on that list. Nodes not in the list are ignored. The list of nodes
> can be dynamically modified ("refresh").
>
> The biggest bite here is the need to handle multiple physical subnets, and
> the need to distribute messages to large numbers of nodes in a timely
> fashion -- this is heartbeating after all, so, all of this must happen with
> a specific number of seconds, even on 500 nodes. The use of broadcast thus
> kicks in significant complexity in having to store-and-forward protocol
> messages. Given it all to do over again, using multicast from the
> beginning may be more effective. Using pure unicast actually makes the
> code simpler, but introduces significant network traffic and performance
> concerns.
>
> One comment about resource starvation, and TS thus causing nodes to be shut
> down, we've seen that on heavily loaded nodes. We've found that setting
> TS' process priority and pinning it in memory on AIX is not enough to avoid
> all possible blockages on a node that is very heavily loaded. I/O
> interrupts can be a significant problem, as they block any process-level
> execution, and the efficiency of their handling is a concern (on AIX, at
> least.) Also, simply pinning TS is not sufficient, there are also the
> various system libraries and their data that are used by TS that need to be
> pinned. All of this begins to drag heavily on the overall memory resources
> available. We (who know the Phoenix code) don't have enough experience yet
> with Linux to be able to know the different effects that may be hit.
>
> One last point (this is already too long, my apologies), Phoenix is
> strictly layered. TS ONLY deals with heartbeating, and declares which
> nodes (in fact, which specific comm adapters and networks) are UP or DOWN.
> Above TS is Group Services, the n-phase commit protocol driver and group
> membership service. It uses TS to determine which nodes are up/down, and
> allows its clients to perform barrier synch protocols. Above GS are
> various monitors, cluster managers, and such. It is up to the layers above
> TS and GS to control resources, perform recovery and failover, etc.
>
> Peter R. Badovinatz -- (503)578-5530 (TL 775)
> Clusters and High Availability, Beaverton, OR
> wombat@us.ibm.com or IBMUSM00(WOMBAT)
--
Butterflies tell me to say:
"The statements and opinions expressed here are my own and do not necessarily
represent those of Oracle Corporation."

Question about Heartbeat "heartbeats" [ In reply to ]

Jun 6, 2000, 8:04 PM

Post #5 of 7 (1770 views)

"Luis Claudio R. Goncalves" wrote:
>
> Hi!
>
> Why do heartbeat uses broadcast to send status messages and so?
> I know that if you have many hosts in the cluster it is very
> useful, or at least the easiest way to do that, but we are working
> with two hosts environments right now and the "add_new_hosts" feature
> is commented in the source code. Is there any reason for that (the
> broadcasts)?

It was easy. It was effective. Someone else did it for me. It works.

> We're doing tests using keepalive = 1... so our sysadmin gets crazy
> everytime we start heartbeat... (using two nodes this means 2
> broadcasts each second...).

I understand. Multicast is a great idea.

> If you're using a two node cluster and have the name of both nodes
> in the configuration file, you can do the job using unicast
> messages. This way you avoid generating load for the other machines
> on your network.

Unicast would work fine, for 2 nodes. Otherwise, for lots of nodes it's
O(n^2), which is a problem.

> If this is a future vision, something for when we will have lots of
> nodes in the cluster, I have one more thought. Broadcasts aren't
> routed... so your entire cluster must fit in the same subnet... it
> isn't a good thing if you have load balance and other services that
> may be located in servers on different subnets. Multicasts would do
> that with more elegance.

Multicast is on the todo list. Patches are being accepted ;-)

-- Alan Robertson
alanr@suse.com

Question about Heartbeat "heartbeats" [ In reply to ]

Jun 6, 2000, 8:24 PM

Post #6 of 7 (1769 views)

wombat@us.ibm.com wrote:
>
[SNIP]
> One comment about resource starvation, and TS thus causing nodes to be shut
> down, we've seen that on heavily loaded nodes. We've found that setting
> TS' process priority and pinning it in memory on AIX is not enough to avoid
> all possible blockages on a node that is very heavily loaded. I/O
> interrupts can be a significant problem, as they block any process-level
> execution, and the efficiency of their handling is a concern (on AIX, at
> least.) Also, simply pinning TS is not sufficient, there are also the
> various system libraries and their data that are used by TS that need to be
> pinned. All of this begins to drag heavily on the overall memory resources
> available. We (who know the Phoenix code) don't have enough experience yet
> with Linux to be able to know the different effects that may be hit.

Hmmm... I hadn't thought about needing to pin in shared libraries, but
the manual page says that mlockall(MCL_FUTURE) does this:

mlockall disables paging for all pages mapped into the
address space of the calling process. This includes the
pages of the code, data and stack segment, as well as
shared libraries, user space kernel data, shared memory
and memory mapped files

So, I should be relatively OK. I have a bias towards relying on code in
libraries as little as possible. This would seem to be good practice,
given the impact of locking in big libraries.

In terms of process priority, does AIX have soft real time functions,
and did your TS code use them? I use sched_setscheduler(SCHED_RR) to
help me out in this way.

As David said: Welcome to the Party!

Thanks!

-- Alan Robertson
alanr@suse.com

Question about Heartbeat "heartbeats" [ In reply to ]

Jun 6, 2000, 9:04 PM

Post #7 of 7 (1774 views)

Hmmm... I hadn't thought about needing to pin in shared libraries, but
the manual page says that mlockall(MCL_FUTURE) does this:

mlockall disables paging for all pages mapped into the
address space of the calling process. This includes the
pages of the code, data and stack segment, as well as
shared libraries, user space kernel data, shared memory
and memory mapped files

So, I should be relatively OK. I have a bias towards relying on code in
libraries as little as possible. This would seem to be good practice,
given the impact of locking in big libraries.

In terms of process priority, does AIX have soft real time functions,
and did your TS code use them? I use sched_setscheduler(SCHED_RR) to
help me out in this way.

As David said: Welcome to the Party!

Thanks!

-- Alan Robertson
alanr@suse.com
=============

AIX doesn't have mlockall(). Rather, we use plock(), which locks the code
and/or data+stack regions. Doesn't touch the shared libraries and other
memory regions :-( We too avoid library calls as much as possible, but,
impossible to completely do it unfortunately. When we first did TS we were
also too naive about these impacts, and the memory usage pattern is very
sloppy for the daemon (when you have 500 nodes, with 2 adapters per node,
TS and GS (Group Services) suck up memory like you wouldn't believe :-)
thus pinning is more painful than it would be if we'd been better at
managing our memory usage (too much OO, I think.)

AIX does have soft real-time, we used it as much as we could get away with.
If memory serves me, we do use SCHED_RR. I haven't been central to TS
(Group Services was my baliwick), and I know that the current TS guru has
continued to chase strange behaviors. One thing that hit us way back
when was fork. TS forks to do drive "self death" checks when it thinks an
adapter is dead. When we did this naively, fork was killing us, so we
pre-forked at startup, or carefully assigned forks to certain threads.

Also, logging I/O (writing to a trace/log file) has been a pain, when
memory is constrained (AIX has no specific buffer cache, rather just uses
all of memory to mix in file caching and working storage.) We set up a
dedicated logging thread, and the other threads write to pre-allocated
internal buffers and the logging thread writes to disk.

Finally, as I mentioned before heavy I/O traffic can be very troublesome,
depending on how fast the I/O interrupts arrive and how quickly they're
handled. We'd battled in the past with some AIX issues that incorrectly
(since fixed) funneled too many interrupts onto CPU 0, and if one of our
threads happened to get onto that CPU it could get stuck for a "long time."
Long enough for the node to be crashed.

These have been the opinions of:
Peter R. Badovinatz -- (503)578-5530 (TL 775)
Clusters and High Availability, Beaverton, OR
wombat@us.ibm.com
and in no way should be construed as official opinion of IBM, Corp., my
email id notwithstanding.