Mailing List Archive

Proposal: Heartbeat ping membership
Hi,

I've been thinking seriously about a new feature for heartbeat:
ping membership

I'd like your comments and thoughts on it:

What ping membership would do is allow a switch, or router or anything else that
you can ping to become a pseudo-member of the cluster.

A pseudo-member is one which we report on through the API almost as though it
were a real member, except that it doesn't have to run heartbeat - it just has
to respond to a ping.

Such pseudo-members could become tie-breakers for 2-node clusters.

For example:

Node1 and Node2 are real members. Switch1 is a pseudo-member. Switch1 would be
pinged at an appropriate interval, and as long as the pings returned often
enough and rapidly enough, switch2 is thought to be a member of the cluster. If
it dies or connectivity to it is lost, then heartbeat thinks that it has left
the cluster.

This allows quorum decisions in which the pseudo-member has a "vote". For
example, if you pull the ethernet from node2, then it looks around, sees that
node1 and switch1 have "died". It sees that it does not have enough members to
constitute quorum, so it gives up resources and waits for something to change.

On the other hand, node1 sees that node2 has died but switch1 is still alive,
which is two out of three "votes". It can then continue as the viable cluster.

Of course, it isn't really necessary to make a new membership type, but it seems
like a nice, uniform way of looking at it.

Normal nodes have status "dead", "up" or "active". Psuedo nodes might have
status "ping" or "dead". I'm undecided if the dead status should be the same
between the two types, or not. If they're the same, then there should be a
node-type API call that would tell you if it's a normal member or a
pseudo-member.

A few observations:

You must choose the ping resource such that it's "impossible" for the two
machines both to be able to communicate with the pseudo-member, but not be able
to communicate with each other using either this interface or another one.

This does not eliminate the need for I/O fencing.

Arbitrarily perverse hardware failures can cause arbitraryily perverse problems
- and this is no exception.

You still want more than one heartbeat network - especially if you have shared
storage.


Thoughts? Comments?

-- Alan Robertson
alanr@suse.com
Proposal: Heartbeat ping membership [ In reply to ]
On Fri, 11 Aug 2000, Alan Robertson wrote:

> Hi,
>
> I've been thinking seriously about a new feature for heartbeat:
> ping membership
>
> I'd like your comments and thoughts on it:
>
> What ping membership would do is allow a switch, or router or anything else that
> you can ping to become a pseudo-member of the cluster.
>
> A pseudo-member is one which we report on through the API almost as though it
> were a real member, except that it doesn't have to run heartbeat - it just has
> to respond to a ping.
>
> Such pseudo-members could become tie-breakers for 2-node clusters.
>
> For example:
>
> Node1 and Node2 are real members. Switch1 is a pseudo-member. Switch1 would be
> pinged at an appropriate interval, and as long as the pings returned often
> enough and rapidly enough, switch2 is thought to be a member of the cluster. If
> it dies or connectivity to it is lost, then heartbeat thinks that it has left
> the cluster.
>
> This allows quorum decisions in which the pseudo-member has a "vote". For
> example, if you pull the ethernet from node2, then it looks around, sees that
> node1 and switch1 have "died". It sees that it does not have enough members to
> constitute quorum, so it gives up resources and waits for something to change.
>
> On the other hand, node1 sees that node2 has died but switch1 is still alive,
> which is two out of three "votes". It can then continue as the viable cluster.
>
> Of course, it isn't really necessary to make a new membership type, but it seems
> like a nice, uniform way of looking at it.
>
> Normal nodes have status "dead", "up" or "active". Psuedo nodes might have
> status "ping" or "dead". I'm undecided if the dead status should be the same
> between the two types, or not. If they're the same, then there should be a
> node-type API call that would tell you if it's a normal member or a
> pseudo-member.

IMO the second choice is better because we will not have to change current
heartbeat code to handle this new node type, but only add a new separate
function to report node type. (as you said)
Proposal: Heartbeat ping membership [ In reply to ]
On Fri, Aug 11, 2000 at 05:40:26PM -0300, Marcelo Tosatti wrote:
>
> On Fri, 11 Aug 2000, Alan Robertson wrote:
>
> > Hi,
> >
> > I've been thinking seriously about a new feature for heartbeat: ping
> > membership
> >
> > I'd like your comments and thoughts on it:
> >
> > What ping membership would do is allow a switch, or router or anything
> > else that you can ping to become a pseudo-member of the cluster.
> >
> > A pseudo-member is one which we report on through the API almost as
> > though it were a real member, except that it doesn't have to run
> > heartbeat - it just has to respond to a ping.
> >
> > Such pseudo-members could become tie-breakers for 2-node clusters.
> >
> > For example:
> >
> > Node1 and Node2 are real members. Switch1 is a pseudo-member. Switch1
> > would be pinged at an appropriate interval, and as long as the pings
> > returned often enough and rapidly enough, switch2 is thought to be a
> > member of the cluster. If it dies or connectivity to it is lost, then
> > heartbeat thinks that it has left the cluster.
> >
> > This allows quorum decisions in which the pseudo-member has a "vote".
> > For example, if you pull the ethernet from node2, then it looks around,
> > sees that node1 and switch1 have "died". It sees that it does not have
> > enough members to constitute quorum, so it gives up resources and waits
> > for something to change.
> >
> > On the other hand, node1 sees that node2 has died but switch1 is still
> > alive, which is two out of three "votes". It can then continue as the
> > viable cluster.
> >
> > Of course, it isn't really necessary to make a new membership type, but
> > it seems like a nice, uniform way of looking at it.
> >
> > Normal nodes have status "dead", "up" or "active". Psuedo nodes might
> > have status "ping" or "dead". I'm undecided if the dead status should
> > be the same between the two types, or not. If they're the same, then
> > there should be a node-type API call that would tell you if it's a
> > normal member or a pseudo-member.
>
> IMO the second choice is better because we will not have to change
> current heartbeat code to handle this new node type, but only add a new
> separate function to report node type. (as you said)

I agree. I also belive that having the ability to achive a psudo-quorum
within the existing heartbeat infastructure would be very useful indeed.
The only down side I see is that switch IP addresses aren't always
resources that are used very often so essentially this may add extra
configuration/ maintinence requirements to the network. This is of course
only if you wish to use the new feature.

--
Horms
Proposal: Heartbeat ping membership [ In reply to ]
Horms wrote:
>
> On Fri, Aug 11, 2000 at 05:40:26PM -0300, Marcelo Tosatti wrote:
> >
> > On Fri, 11 Aug 2000, Alan Robertson wrote:
> >
> > > Hi,
> > >
> > > I've been thinking seriously about a new feature for heartbeat: ping
> > > membership
> > >
> > > I'd like your comments and thoughts on it:
> > >
> > > What ping membership would do is allow a switch, or router or anything
> > > else that you can ping to become a pseudo-member of the cluster.
> > >
> > > A pseudo-member is one which we report on through the API almost as
> > > though it were a real member, except that it doesn't have to run
> > > heartbeat - it just has to respond to a ping.
> > >
> > > Such pseudo-members could become tie-breakers for 2-node clusters.
> > >
> > > For example:
> > >
> > > Node1 and Node2 are real members. Switch1 is a pseudo-member. Switch1
> > > would be pinged at an appropriate interval, and as long as the pings
> > > returned often enough and rapidly enough, switch2 is thought to be a
> > > member of the cluster. If it dies or connectivity to it is lost, then
> > > heartbeat thinks that it has left the cluster.
> > >
> > > This allows quorum decisions in which the pseudo-member has a "vote".
> > > For example, if you pull the ethernet from node2, then it looks around,
> > > sees that node1 and switch1 have "died". It sees that it does not have
> > > enough members to constitute quorum, so it gives up resources and waits
> > > for something to change.
> > >
> > > On the other hand, node1 sees that node2 has died but switch1 is still
> > > alive, which is two out of three "votes". It can then continue as the
> > > viable cluster.
> > >
> > > Of course, it isn't really necessary to make a new membership type, but
> > > it seems like a nice, uniform way of looking at it.
> > >
> > > Normal nodes have status "dead", "up" or "active". Psuedo nodes might
> > > have status "ping" or "dead". I'm undecided if the dead status should
> > > be the same between the two types, or not. If they're the same, then
> > > there should be a node-type API call that would tell you if it's a
> > > normal member or a pseudo-member.
> >
> > IMO the second choice is better because we will not have to change
> > current heartbeat code to handle this new node type, but only add a new
> > separate function to report node type. (as you said)
>
> I agree. I also belive that having the ability to achive a psudo-quorum
> within the existing heartbeat infastructure would be very useful indeed.

This, of course, is the point ;-) Glad you like it. It's implemented by
now. The code is pretty nice, perhaps even elegant. The code is a little
smaller than the UDP heartbeat code (!).

What's implemented right now in CVS is:
normal nodes: down, up active
ping nodes: down, ping

No way to tell which node is which type yet. This will happen, but I'm
at LinuxWorld this week. It may take a few weeks to get around to it.

> The only down side I see is that switch IP addresses aren't always
> resources that are used very often so essentially this may add extra
> configuration/ maintinence requirements to the network. This is of course
> only if you wish to use the new feature.

Yup. If you're using BayTech power switches for Stonith devices, it'd be good
to ping them, since they're essential to the takeover process anyway.

The point is: Choose something *essential*, something *important* to your
cluster, and something that can't get network separated from some of your
cluster nodes, but not all of them.

-- Alan Robertson
alanr@suse.com
AW: Proposal: Heartbeat ping membership [ In reply to ]
Hi,

> I agree. I also belive that having the ability to achive a psudo-quorum
> within the existing heartbeat infastructure would be very useful indeed.
> The only down side I see is that switch IP addresses aren't always
> resources that are used very often so essentially this may add extra
> configuration/ maintinence requirements to the network. This is of course
> only if you wish to use the new feature.

"Mee too" :-) Seriously - having some sort of quorum code to make a decision
to give up resources if "the rest of the world" goes away would be quite
useful.

I see a bit of a problem from the combination of ping-pseudo members and a
secondary heartbeat network: a two nodes + one ping device cluster, a loss
of network connectivity in one of the nodes would not result in that node
giving up any resources as it can still see the 2nd node on the secondary
heartbeat medium.

Bye, Martin

"you have moved your mouse, please reboot to make this change take effect"
--------------------------------------------------
Martin Bene vox: +43-316-813824
simon media fax: +43-316-813824-6
Andreas-Hofer-Platz 9 e-mail: mb@sime.com
8010 Graz, Austria
--------------------------------------------------
finger mb@mail.sime.com for PGP public key
AW: Proposal: Heartbeat ping membership [ In reply to ]
Hi,

>> I see a bit of a problem from the combination of ping-pseudo
>> members and a secondary heartbeat network: a two nodes + one
>> ping device cluster, a loss of network connectivity in one
>> of the nodes would not result in that node giving up any
>> resources as it can still see the 2nd node on the secondary
>> heartbeat medium.
>
> I don't believe so, wouldn't it lose a vote as it is missing the
> ping-pseudo member?

Yes, it would loose a node, but it would just go from 3 nodes to 2 nodes and
so still be viable.

Before node1 burns its network interface: both nodes see a 3-node cluster,
all 3 nodes are part of the cluster and alive (node1, node2 "real" nodes;
pingnode as an external pingable device).

Afterwards: node1 still sees 2 nodes in the cluster (itself and node2), the
ping-node has vanished. Doesn't it now still have 2 out of three nodes
visible and thus no reason to give up its resources?

Now, WITHOUT the secondary link between node1 and node2, node1 whould find
itself isolated (node2 gone, pingnode gone) and give up its resources while
node2 would find node1 gone, pingnode up -> 2/3 majority and take over the
resource previously owned by node1.

Bye, Martin
AW: Proposal: Heartbeat ping membership [ In reply to ]
Martin Bene wrote:
>
> Hi,
>
> > I agree. I also belive that having the ability to achive a psudo-quorum
> > within the existing heartbeat infastructure would be very useful indeed.
> > The only down side I see is that switch IP addresses aren't always
> > resources that are used very often so essentially this may add extra
> > configuration/ maintinence requirements to the network. This is of course
> > only if you wish to use the new feature.
>
> "Mee too" :-) Seriously - having some sort of quorum code to make a decision
> to give up resources if "the rest of the world" goes away would be quite
> useful.
>
> I see a bit of a problem from the combination of ping-pseudo members and a
> secondary heartbeat network: a two nodes + one ping device cluster, a loss
> of network connectivity in one of the nodes would not result in that node
> giving up any resources as it can still see the 2nd node on the secondary
> heartbeat medium.

That's a different problem.

All heartbeat does now is measure if the *node* goes away. It does NOT do
resource monitoring, where an ethernet is an example of a kind of resource.

Luis Claudio R. Goncalves of Conectiva is looking at a cluster manager based on
the heartbeat API, and that is one of the kind of things he's looking at.

-- Alan Robertson
alanr@suse.com
Proposal: Heartbeat ping membership [ In reply to ]
Horms wrote:
>
> On Tue, Aug 15, 2000 at 10:44:21AM +0200, Martin Bene wrote:
> > Hi,
> >
> > > I agree. I also belive that having the ability to achive a psudo-quorum
> > > within the existing heartbeat infastructure would be very useful indeed.
> > > The only down side I see is that switch IP addresses aren't always
> > > resources that are used very often so essentially this may add extra
> > > configuration/ maintinence requirements to the network. This is of course
> > > only if you wish to use the new feature.
> >
> > "Mee too" :-) Seriously - having some sort of quorum code to make a decision
> > to give up resources if "the rest of the world" goes away would be quite
> > useful.
> >
> > I see a bit of a problem from the combination of ping-pseudo members and a
> > secondary heartbeat network: a two nodes + one ping device cluster, a loss
> > of network connectivity in one of the nodes would not result in that node
> > giving up any resources as it can still see the 2nd node on the secondary
> > heartbeat medium.
>
> I don't believe so, wouldn't it lose a vote as it is missing the
> ping-pseudo member?

I think what he's referring to is that it would continue to provide service even
when the main network is actually down.

-- Alan Robertson
alanr@suse.com
Re: Proposal: Heartbeat ping membership [ In reply to ]
Hi,

On Fri, Aug 11, 2000 at 04:40:29PM -0600, Alan Robertson wrote:

> You must choose the ping resource such that it's "impossible" for the two
> machines both to be able to communicate with the pseudo-member, but not be able
> to communicate with each other using either this interface or another one.

This is the hard part. Ethernet makes such a failure mode quite
possible. With thin-wire, cabling problems can easily cause this, and
otherwise, arp problems can often cause two machines to lose sight of
each other even though their physical media are fine.

Cheers,
Stephen
Proposal: Heartbeat ping membership [ In reply to ]
On Tue, Aug 15, 2000 at 10:44:21AM +0200, Martin Bene wrote:
> Hi,
>
> > I agree. I also belive that having the ability to achive a psudo-quorum
> > within the existing heartbeat infastructure would be very useful indeed.
> > The only down side I see is that switch IP addresses aren't always
> > resources that are used very often so essentially this may add extra
> > configuration/ maintinence requirements to the network. This is of course
> > only if you wish to use the new feature.
>
> "Mee too" :-) Seriously - having some sort of quorum code to make a decision
> to give up resources if "the rest of the world" goes away would be quite
> useful.
>
> I see a bit of a problem from the combination of ping-pseudo members and a
> secondary heartbeat network: a two nodes + one ping device cluster, a loss
> of network connectivity in one of the nodes would not result in that node
> giving up any resources as it can still see the 2nd node on the secondary
> heartbeat medium.

I don't believe so, wouldn't it lose a vote as it is missing the
ping-pseudo member?

--
Horms
Re: Proposal: Heartbeat ping membership [ In reply to ]
From Stephen Tweedie:
Hi,

On Fri, Aug 11, 2000 at 04:40:29PM -0600, Alan Robertson wrote:

> You must choose the ping resource such that it's "impossible" for the two
> machines both to be able to communicate with the pseudo-member, but not
be able
> to communicate with each other using either this interface or another
one.

This is the hard part. Ethernet makes such a failure mode quite
possible. With thin-wire, cabling problems can easily cause this, and
otherwise, arp problems can often cause two machines to lose sight of
each other even though their physical media are fine.

Cheers,
Stephen

=============

Yes, this is a key issue! In a 3 element setup, 2 nodes + quorum device,
if the 2 nodes lose direct connectivity, Stephen's note is but two of the
many examples I've unfortunately seen, they may BOTH be able to ping the
quorum device. Thus, both nodes consider that their partner is dead, but
each still considers it has quorum. Without this capability, if they act
via STONITH, perhaps they'll shut each other down :-)

Which leaves you with some options:
- nodes must heartbeat to each other through the quorum device, in addition
to their "direct" network paths. This will tell the nodes that they are
both still alive.
- nodes must heartbeat through the shared data path, be it shared SCSI, FC,
whatever. Since they will be fighting it out for this data, the fact that
they can communicate with each other through this path means they know each
other are up. Alternately, use of a serial path in addition to the IP
paths can provide an element of this safety.

This latter point can often actually BE the quorum device, e.g., a shared
disk on which each node writes, so long as the current owner is writing on
the proper basis, it continues to own the disk (or, set of disks 'locked'
by that disk.) The backup only gets involved when this disk heartbeat, and
all other heartbeat paths, go silent. This goes some way to meeting Alan's
above-stated rules for the pseudo-member, although it requires the ability
of the disks to be written by multiple systems, and for the heartbeat layer
to "merge" the information from all of the paths to determine death of the
other node.

To protect the disks now, some options:
- backup node uses STONITH to shut down the other node (watching out for
MAD :-)
- heartbeating across the shared disks.
- use disk-level SCSI reserves to 'lock out' other nodes, eliminates need
to use STONITH, but introduces various difficulties in that not all disks
support this, effects of booting on the reserves, breaking reserves to take
away a disk, etc.
- interpose code in the OS device layer, that mediates access to the disks
based on cluster controller (heartbeat, cluster manager) decisions, works
when the cluser software is coherent, so, not a guaranteed protection.
- a 'deadman switch' that halts a node if anything happens to the cluster
controller processes, shuts the node down 'right now.' Eliminates the need
for a STONITH device, when used in conjunction with some of these other
techniques, and assuming this is implemented as a low-level kernel service
based on timer interrupts. For example, if the heartbeat process gets
blocked and can't send out heartbeats, and the remote node decides that it
is dead, the deadman switch on the original node would guarantee this, as
the heartbeat process needs to heartbeat to the deadman switch.

These have been the opinions of:
Peter R. Badovinatz -- (503)578-5530 (TL 775)
Clusters and High Availability, Beaverton, OR
wombat@us.ibm.com
and in no way should be construed as official opinion of IBM, Corp., my
email id notwithstanding.
Re: Proposal: Heartbeat ping membership [ In reply to ]
I thought I'd mention how Kimberlite works.

In General:

It uses the shared disk for application data, cluster services status,
heartbeat, and a cluster lock to protect the cluster services status.

The cluster services status contains information about each defined
service (Oracle, NFS, etc) such as its state (running, stopped) and
for state = running, which node it is on.


Specifically related to membership, heartbeat, and shooting.

If node A sees that node B has stopped updating its heartbeat section
of the shared disk, that is, stopped pinging the disk, node A will
power cycle Node B via a remote power switch. A more detailed explanation
of this follows.

There exists two cluster daemons that relate to heartbeat.
The quorum daemon is the coordinating process and also performs the disk
ping and checks for disk pings from the other node.
The heartbeat daemon performs heartbeat over multiple IP and serial
channels.

When the quorum daemon notices that a time interval, T, has expired
under which the other node has not pinged the shared disk it queries
the heartbeat daemon to see if it thinks the other node is up.
If heartbeat says the the other node looks down over the various channels,
then quorum daemon power cycles the other node. If heartbeat says that the
other node looks good, the quorum daemon takes into account that the other
node may be busy with application I/O and the ping to the disk has been
held at bay. Thus, quorum daemon waits another time interval, T1, to see
if the other node pings the disk. At the expiration of T + T1, with
no ping of the disk by the other node, the other node is shot,
regardless of what heartbeat daemon thinks.

If a node cannot access the shared disk, the quorum daemon will reboot
the node.

The shared disk ping is the essential one. The other heartbeat channels
are only used to accelerate the failover process.

I like the method of having a node take out its partner via a remote power switch,
rather than a deadman switch. With the deadman switch method a temporary kernel hang
could cause the unhung node to take over services. Then when the hang is released
and before the deadman code gets executed, I/Os could be issued which would corrupt
data on the shared disk.

Regards,
Dave

wombat@us.ibm.com wrote:
>
> >From Stephen Tweedie:
> Hi,
>
> On Fri, Aug 11, 2000 at 04:40:29PM -0600, Alan Robertson wrote:
>
> > You must choose the ping resource such that it's "impossible" for the two
> > machines both to be able to communicate with the pseudo-member, but not
> be able
> > to communicate with each other using either this interface or another
> one.
>
> This is the hard part. Ethernet makes such a failure mode quite
> possible. With thin-wire, cabling problems can easily cause this, and
> otherwise, arp problems can often cause two machines to lose sight of
> each other even though their physical media are fine.
>
> Cheers,
> Stephen
>
> =============
>
> Yes, this is a key issue! In a 3 element setup, 2 nodes + quorum device,
> if the 2 nodes lose direct connectivity, Stephen's note is but two of the
> many examples I've unfortunately seen, they may BOTH be able to ping the
> quorum device. Thus, both nodes consider that their partner is dead, but
> each still considers it has quorum. Without this capability, if they act
> via STONITH, perhaps they'll shut each other down :-)
>
> Which leaves you with some options:
> - nodes must heartbeat to each other through the quorum device, in addition
> to their "direct" network paths. This will tell the nodes that they are
> both still alive.
> - nodes must heartbeat through the shared data path, be it shared SCSI, FC,
> whatever. Since they will be fighting it out for this data, the fact that
> they can communicate with each other through this path means they know each
> other are up. Alternately, use of a serial path in addition to the IP
> paths can provide an element of this safety.
>
> This latter point can often actually BE the quorum device, e.g., a shared
> disk on which each node writes, so long as the current owner is writing on
> the proper basis, it continues to own the disk (or, set of disks 'locked'
> by that disk.) The backup only gets involved when this disk heartbeat, and
> all other heartbeat paths, go silent. This goes some way to meeting Alan's
> above-stated rules for the pseudo-member, although it requires the ability
> of the disks to be written by multiple systems, and for the heartbeat layer
> to "merge" the information from all of the paths to determine death of the
> other node.
>
> To protect the disks now, some options:
> - backup node uses STONITH to shut down the other node (watching out for
> MAD :-)
> - heartbeating across the shared disks.
> - use disk-level SCSI reserves to 'lock out' other nodes, eliminates need
> to use STONITH, but introduces various difficulties in that not all disks
> support this, effects of booting on the reserves, breaking reserves to take
> away a disk, etc.
> - interpose code in the OS device layer, that mediates access to the disks
> based on cluster controller (heartbeat, cluster manager) decisions, works
> when the cluser software is coherent, so, not a guaranteed protection.
> - a 'deadman switch' that halts a node if anything happens to the cluster
> controller processes, shuts the node down 'right now.' Eliminates the need
> for a STONITH device, when used in conjunction with some of these other
> techniques, and assuming this is implemented as a low-level kernel service
> based on timer interrupts. For example, if the heartbeat process gets
> blocked and can't send out heartbeats, and the remote node decides that it
> is dead, the deadman switch on the original node would guarantee this, as
> the heartbeat process needs to heartbeat to the deadman switch.
>
> These have been the opinions of:
> Peter R. Badovinatz -- (503)578-5530 (TL 775)
> Clusters and High Availability, Beaverton, OR
> wombat@us.ibm.com
> and in no way should be construed as official opinion of IBM, Corp., my
> email id notwithstanding.
>
> _______________________________________________________
> Linux-HA-Dev: Linux-HA-Dev@lists.tummy.com
> http://lists.tummy.com/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
Proposal: Heartbeat ping membership [ In reply to ]
Hi Alan,

> > "Mee too" :-) Seriously - having some sort of quorum code to
> > make a decision to give up resources if "the rest of the world"
> > goes away would be quite useful.
> >
> > I see a bit of a problem from the combination of ping-pseudo
> > members and a secondary heartbeat network: a two nodes + one
> > ping device cluster, a loss of network connectivity in one
> > of the nodes would not result in that node giving up any
> > resources as it can still see the 2nd node on the secondary
> > heartbeat medium.
>
> That's a different problem.
>
> All heartbeat does now is measure if the *node* goes away. It does NOT do
> resource monitoring, where an ethernet is an example of a kind of
> resource.

The way I see it, there are resources so fundamental to the functioning of a
node that the whole node could/should be considered to be down if the
resource is not available. Ethernet access to the internet gateway would be
a prime example of such a resource.

Bye, Martin
Re: Proposal: Heartbeat ping membership [ In reply to ]
wombat@us.ibm.com wrote:
>
> >From Stephen Tweedie:
> Hi,
>
> On Fri, Aug 11, 2000 at 04:40:29PM -0600, Alan Robertson wrote:
>
> > You must choose the ping resource such that it's "impossible" for the two
> > machines both to be able to communicate with the pseudo-member, but not
> be able
> > to communicate with each other using either this interface or another
> one.
>
> This is the hard part. Ethernet makes such a failure mode quite
> possible. With thin-wire, cabling problems can easily cause this, and
> otherwise, arp problems can often cause two machines to lose sight of
> each other even though their physical media are fine.
>
> Cheers,
> Stephen
>
> =============
>
> Yes, this is a key issue! In a 3 element setup, 2 nodes + quorum device,
> if the 2 nodes lose direct connectivity, Stephen's note is but two of the
> many examples I've unfortunately seen, they may BOTH be able to ping the
> quorum device. Thus, both nodes consider that their partner is dead, but
> each still considers it has quorum. Without this capability, if they act
> via STONITH, perhaps they'll shut each other down :-)

If you use the baytech switch, and this happens, then they'll at least both come
back up. Other switches aren't quite so smart and can result in both nodes
going down and not coming back up. Selection of hardware requires careful
thought. You can also make this much less likely with software timing
considerations.

> Which leaves you with some options:
> - nodes must heartbeat to each other through the quorum device, in addition
> to their "direct" network paths. This will tell the nodes that they are
> both still alive.

This seems hard to do in the case of a stupid quorum device. Perhaps you could
forge your return address on the ping ;-)

Would that work?

> - nodes must heartbeat through the shared data path, be it shared SCSI, FC,
> whatever. Since they will be fighting it out for this data, the fact that
> they can communicate with each other through this path means they know each
> other are up. Alternately, use of a serial path in addition to the IP
> paths can provide an element of this safety.

Kimberlite does this. I've been thinking about adding it to heartbeat. I don't
have any shared media devices. Perhaps someone in Mission Critical could try
this out for me. I'll drop by their booth today and ask...

> This latter point can often actually BE the quorum device, e.g., a shared
> disk on which each node writes, so long as the current owner is writing on
> the proper basis, it continues to own the disk (or, set of disks 'locked'
> by that disk.) The backup only gets involved when this disk heartbeat, and
> all other heartbeat paths, go silent. This goes some way to meeting Alan's
> above-stated rules for the pseudo-member, although it requires the ability
> of the disks to be written by multiple systems, and for the heartbeat layer
> to "merge" the information from all of the paths to determine death of the
> other node.
>
> To protect the disks now, some options:
> - backup node uses STONITH to shut down the other node (watching out for
> MAD :-)

Choose your device and/or technique carefully!

> - heartbeating across the shared disks.
> - use disk-level SCSI reserves to 'lock out' other nodes, eliminates need
> to use STONITH, but introduces various difficulties in that not all disks
> support this, effects of booting on the reserves, breaking reserves to take
> away a disk, etc.

This is very difficult to manage in Linux, given Linux's lack of committment to
that capability, the wide variety of SCSI hardware, and continually changing
kernel. This is what SteelEye's lifekeeper does. I wish them good luck (very
sincerely)!

> - interpose code in the OS device layer, that mediates access to the disks
> based on cluster controller (heartbeat, cluster manager) decisions, works
> when the cluser software is coherent, so, not a guaranteed protection.
> - a 'deadman switch' that halts a node if anything happens to the cluster
> controller processes, shuts the node down 'right now.' Eliminates the need
> for a STONITH device, when used in conjunction with some of these other
> techniques, and assuming this is implemented as a low-level kernel service
> based on timer interrupts. For example, if the heartbeat process gets
> blocked and can't send out heartbeats, and the remote node decides that it
> is dead, the deadman switch on the original node would guarantee this, as
> the heartbeat process needs to heartbeat to the deadman switch.

You could also have the heartbeat interact with the watchdog device. It does to
some extent now, although the heartbeat/watchdog interaction isn't quite what it
needs to be now.

What you need is:
Ability to set the watchdog timer to slightly greater than the dead node timer.
Right now, it tickles the timer whenever it hears it's own heart beat.

What we have now:
Fixed timer interval much greater than the dead node timer. Inability to turn
the timer off when heartbeat is shut down. This is fixable, but not too good...

-- Alan Robertson
alanr@suse.com
Proposal: Heartbeat ping membership [ In reply to ]
Martin Bene wrote:
>
> Hi Alan,
>
> > > "Mee too" :-) Seriously - having some sort of quorum code to
> > > make a decision to give up resources if "the rest of the world"
> > > goes away would be quite useful.
> > >
> > > I see a bit of a problem from the combination of ping-pseudo
> > > members and a secondary heartbeat network: a two nodes + one
> > > ping device cluster, a loss of network connectivity in one
> > > of the nodes would not result in that node giving up any
> > > resources as it can still see the 2nd node on the secondary
> > > heartbeat medium.
> >
> > That's a different problem.
> >
> > All heartbeat does now is measure if the *node* goes away. It does NOT do
> > resource monitoring, where an ethernet is an example of a kind of
> > resource.
>
> The way I see it, there are resources so fundamental to the functioning of a
> node that the whole node could/should be considered to be down if the
> resource is not available. Ethernet access to the internet gateway would be
> a prime example of such a resource.

Agreed. All I was doing is making it clear that this is how it operates.

-- Alan Robertson
alanr@suse.com
Re: Proposal: Heartbeat ping membership [ In reply to ]
From Alan Robertson:
> - use disk-level SCSI reserves to 'lock out' other nodes, eliminates need
> to use STONITH, but introduces various difficulties in that not all disks
> support this, effects of booting on the reserves, breaking reserves to
take
> away a disk, etc.

This is very difficult to manage in Linux, given Linux's lack of
committment to
that capability, the wide variety of SCSI hardware, and continually
changing
kernel. This is what SteelEye's lifekeeper does. I wish them good luck
(very
sincerely)!
=========

Issues here about reserve understood, and accepted! HACMP on AIX is
strongly dependent on disk reserves, which has always limited the range of
disks that we are able to support to those that properly handled this.
Even then, ironing out all of the bugs has always been a difficult factor
in certifying new hardware and software releases. But, given how much the
customers were already paying for the 'extra' hardware and the software,
buying additional devices (e.g., STONITH controllers) was not generally
attractive to them so reserve was a reasonable choice (ah, the "good" old
proprietary days ;-) I seem also to remember anecdotal info about folks
sometimes being mistrustful of our software doing something like 'reaching
out' and shutting nodes down!

I included reserve here for completeness, and, as you mention, it is used
in places. Of course, it does highlight a point, to quote from something I
saved (although I have no attribution for it, it popped out of 'fortune'
one day :-)
"If you ever want to have a lot of fun, I recommend that you go off and
program an imbedded system. The salient characteristic of an imbedded
system is that it cannot be allowed to get into a state from which only
direct intervention will suffice to remove it. An imbedded system can't
permanently trust anything it hears from the outside world. It must sniff
around, adapt, consider, sniff around, and adapt again. I'm not talking
about ordinary modular programming carefulness here. No. Programming an
imbedded system calls for undiluted raging maniacal paranoia."

Change 'imbedded system' to 'high availability system' and I think it fits
rather well :-) My point being that whether it is reserve, or STONITH, you
need to count on a variety of software and hardware working within rather
strict timing parameters for everything to hold together properly. And the
paranoia about "what can go wrong NOW" can make you crazy!

These have been the opinions of:
Peter R. Badovinatz -- (503)578-5530 (TL 775)
Clusters and High Availability, Beaverton, OR
wombat@us.ibm.com
and in no way should be construed as official opinion of IBM, Corp., my
email id notwithstanding.
Proposal: Heartbeat ping membership [ In reply to ]
On Wed, 16 Aug 2000, Martin Bene wrote:

> The way I see it, there are resources so fundamental to the functioning of a
> node that the whole node could/should be considered to be down if the
> resource is not available. Ethernet access to the internet gateway would be
> a prime example of such a resource.

I'm currently working on a simple scheme of dependencies. It
will be implemented as a separate (wannabe) Service Manager. Using the
current heartbeat jargon, you have:

resource_group -> services -> dependencies

A resource group is atomic, in case of problems you have to move
the whole service group, and is represented by an IP address.
A service belongs to that resource group and may have some
dependencies.
The dependencies can be services, links, files, weather, time of
day or anithing else you can check somehow...

The rules I'm working on are:

- If one service fails, it can be restarted up to max_restarts
times... each service may have its max_restarts value.
- If one service fails due to dependencies and the faulty
dependency is a service, we can try to restart the dependency
(service).
- If one service fails due to a link dependency or any other
resource that isn't a service or any service tries to go over
max_restarts, it's time to trigger a failover or a failback.
- If we need to failback or failover and the other node is
out-of-this-earth... shit, I mean, shout as loud as you can and try to
awake the sysadmin... yep, you're in trouble :)

Hugs!

Luis

[ Luis Claudio R. Goncalves lclaudio@conectiva.com.br ]
[. MSc coming soon -- Conectiva HA Team -- Gospel User -- Linuxer -- :) ]
[. Fault Tolerance - Real-Time - Distributed Systems - IECLB - IS 40:31 ]
[. LateNite Programmer -- Jesus Is The Solid Rock On Which I Stand -- ]
Re: Proposal: Heartbeat ping membership [ In reply to ]
"Stephen C. Tweedie" wrote:
>
> Hi,
>
> On Fri, Aug 11, 2000 at 04:40:29PM -0600, Alan Robertson wrote:
>
> > You must choose the ping resource such that it's "impossible" for the two
> > machines both to be able to communicate with the pseudo-member, but not be able
> > to communicate with each other using either this interface or another one.
>
> This is the hard part. Ethernet makes such a failure mode quite
> possible. With thin-wire, cabling problems can easily cause this, and
> otherwise, arp problems can often cause two machines to lose sight of
> each other even though their physical media are fine.

Nobody uses thin-wire anymore, do they? Modern hardware (100mbit) doesn't
support it.

I'm a little slow on tumbling to this regarding ARPs, so sorry for the delayed
response...

If your cluster heartbeats over a single subnet using either broadcast or
multicast, this isn't an issue -- because ARPs don't enter into addressability
in this case. I hadn't thought about it before, but broadcast is more reliable
than point-to-point because there are fewer things to fail <:-S

For the case that this is most of concern for: 2 nodes, then I believe the
following configuration will not be subject to this in any meaningful way:

A public ethernet for heartbeating

Serial ring (or private ethernet if you insist) for heartbeating

ping heartbeats to something important and "near": switch, router, or
Stonith device

The probability of the mode of failure previously described occurring is
extremely low in the first place when using broadcasts , and having it occur
when when both public and private comm has failed is wildly improbable. This is
especially true when you are using broadcasts or multicasts, where the
addressing is pretty much foolproof on a single subnet. Your SCSI/FC controller
is probably much more likely to scrog the disk on its own for no apparent
reason. If you really mistrust your ethernet that much, use serial for your
backup medium, or even both serial and secondary ethernet.

I am aware that the ping medium needs ARPs, but if it fails and the other one
works, we have no problem. If it fails and the other fails we have no
problems. If it succeeds and all other communication media fail (including the
broadcasts), we have a problem. This is the only case of interest. This is
what I think to be wildly improbable when normal heartbeats are through
broadcast/multicast on the same subnet. This is like a SCSI controller claiming
to have written something to disk, and either not doing it, or writing the wrong
thing. Of course, this *does* happen, but when it does, you're in trouble.
But this is how it is. Some troubles just can't be avoided.

NOW...
It is NOT the case that this eliminates the need for redundant heartbeat media,
because the ping heartbeat is not independent of your normal eth0 heartbeat, but
instead is used to get a quorum vote when you think the other machine is down.

Comments, flames, etc.?

-- Alan Robertson
alanr@suse.com
Re: Proposal: Heartbeat ping membership [ In reply to ]
Hi,

On Mon, Aug 21, 2000 at 09:37:24PM -0600, Alan Robertson wrote:

> If your cluster heartbeats over a single subnet using either broadcast or
> multicast, this isn't an issue -- because ARPs don't enter into addressability
> in this case. I hadn't thought about it before, but broadcast is more reliable
> than point-to-point because there are fewer things to fail <:-S

However, if you have got a bridged subnet, there are LOTS of
interesting ways in which it can (and, eventually, will) fail!

--Stephen
Re: Proposal: Heartbeat ping membership [ In reply to ]
"Stephen C. Tweedie" wrote:
>
> Hi,
>
> On Mon, Aug 21, 2000 at 09:37:24PM -0600, Alan Robertson wrote:
>
> > If your cluster heartbeats over a single subnet using either broadcast or
> > multicast, this isn't an issue -- because ARPs don't enter into addressability
> > in this case. I hadn't thought about it before, but broadcast is more reliable
> > than point-to-point because there are fewer things to fail <:-S
>
> However, if you have got a bridged subnet, there are LOTS of
> interesting ways in which it can (and, eventually, will) fail!

Agreed.

But this is easily avoidable in the most interesting case:
2 nodes that need a third to break the tie.

Does it seem that this is a reasonable approach for this case, given that you
manage your network topology appropriately?

-- Alan Robertson
alanr@suse.com