Mailing List Archive: tracking resource groups in heartbeat

tracking resource groups in heartbeat

Mar 28, 2000, 9:35 PM

Post #1 of 14 (3386 views)

The conversation below is taken from email off the list. It seemed
generally interesting though...

Horms wrote:
> ... I have however noticed that as heartbeat keeps state of
> nodes, and not resource allocations it is possible to get
> into a state where no nodes/more than one node have a
> resource. In particular if there is a communication medium
> failure, or if heartbeat is started up on more than one node
> simultaneously. I have been thinking of some fairly simple
> mechanisms to resolve this, vis a vis nodes requesting
> ownership of a resource. I am wondering what your thoughts
> are. I am most concerned about the (simple) two-node case,
> though something that extends beyond that would be nice.

The folks from Conectiva are doing something in a related area. In the
current code, the assumption is that if the master for a resource is up,
it has control of the resources it is listed as master for. They break
that assumption with a new feature (nice_failover?). It would be good
to add your thoughts and observations to that, and think about the right
way of thinking about this stuff. Once one has the right mental model,
the code is easy :-)

There is a mechanism right now for a node to make a cluster-request to
get ownership of a resource group. There is a way to tell if a node
owns a particular resource, but there is no cluster-request to ask the
cluster which node owns a particular resource. Obviously there is an
auditing problem that goes with it as well. In this case, every node
should answer "yes" or "no", not just have the owning node answer "yes",
and everyone else give silence.

This is also related to the cluster partitioning problem, in that you
need resource auditing to recover from a partitioned cluster. So, these
three things are related to each other and the concept of resource
ownership.

More thoughts?

-- Alan Robertson
alanr@suse.com

tracking resource groups in heartbeat [ In reply to ]

horms at vergenet

Mar 29, 2000, 2:20 AM

Post #2 of 14 (3366 views)

--a8Wt8u1KmwUX3Y2C
Content-Type: text/plain; charset=us-ascii

On Tue, Mar 28, 2000 at 09:35:08PM -0700, Alan Robertson wrote:
> The conversation below is taken from email off the list. It seemed
> generally interesting though...
>
> Horms wrote:
> > ... I have however noticed that as heartbeat keeps state of nodes, and
> > not resource allocations it is possible to get into a state where no
> > nodes/more than one node have a resource. In particular if there is a
> > communication medium failure, or if heartbeat is started up on more
> > than one node simultaneously. I have been thinking of some fairly
> > simple mechanisms to resolve this, vis a vis nodes requesting ownership
> > of a resource. I am wondering what your thoughts are. I am most
> > concerned about the (simple) two-node case, though something that
> > extends beyond that would be nice.
>
> The folks from Conectiva are doing something in a related area. In the
> current code, the assumption is that if the master for a resource is up,
> it has control of the resources it is listed as master for. They break
> that assumption with a new feature (nice_failover?). It would be good to
> add your thoughts and observations to that, and think about the right way
> of thinking about this stuff. Once one has the right mental model, the
> code is easy :-)

It seems to me that the existing code will take control of a resource if
the master specified in haresources fails, but not necessarily give it up
when the master comes back up again. Again, in the case of a media failure,
or nodes coming up at the same time both nodes may take ownership of a
resource and neither will give it up.

I have attached a patch that I believe will fix this problem. If
nice_failover is in operation then this patch will cause both nodes to drop
the resource, which is bad, but they would both keep it otherwise so it is
problematic in either case. Also if a resource has more than one master -
then this patch results in resources being dropped by all nodes or no nodes,
depending on your haresources file. This isn't very good either but if a
resource has a master and a slave then it works.

As an aside. I notice that much of the manipulation of resources, and
reading of haresources is done by shell scripts. I am thinking that it
would make more sense to merge this into the heartbeat C code, which should
make code paths easier to track.

> There is a mechanism right now for a node to make a cluster-request to
> get ownership of a resource group. There is a way to tell if a node owns
> a particular resource, but there is no cluster-request to ask the cluster
> which node owns a particular resource. Obviously there is an auditing
> problem that goes with it as well. In this case, every node should
> answer "yes" or "no", not just have the owning node answer "yes", and
> everyone else give silence.

True. And if more than one node says yes, there needs to be a mechanism
to decide who should relinquish the resource.

> This is also related to the cluster partitioning problem, in that you
> need resource auditing to recover from a partitioned cluster. So, these
> three things are related to each other and the concept of resource
> ownership.

Agreed.

--
Horms

--a8Wt8u1KmwUX3Y2C
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="heartbeat-0.4.6d-mach_up.patch"

diff -ruN heartbeat.orig/heartbeat/Makefile heartbeat/heartbeat/Makefile
--- heartbeat.orig/heartbeat/Makefile Mon Jan 31 07:37:14 2000
+++ heartbeat/heartbeat/Makefile Wed Mar 29 00:34:24 2000
@@ -80,7 +80,7 @@

PRODUCTS = $(LIBCMDS)

-LIBSCRIPTS = lib/mach_down lib/req_resource lib/ResourceManager
+LIBSCRIPTS = lib/mach_down lib/mach_up lib/req_resource lib/ResourceManager

RESOURCECMDS= resource.d/IPaddr resource.d/ldirectord

diff -ruN heartbeat.orig/heartbeat/lib/mach_up heartbeat/heartbeat/lib/mach_up
--- heartbeat.orig/heartbeat/lib/mach_up Wed Dec 31 16:00:00 1969
+++ heartbeat/heartbeat/lib/mach_up Wed Mar 29 00:25:34 2000
@@ -0,0 +1,17 @@
+#!/bin/sh
+#
+# This script will only work for a two machine setup...
+# More than that and you need to vote, or something...
+#
+#
+. /etc/ha.d/shellfuncs
+
+: Now running $0: $*
+
+mdown=$1; # The name of the downed machine...
+
+for groupkey in `$HA_BIN/ResourceManager listkeys $mdown`
+do
+ ha_log "Giving up resource group $groupkey"
+ $HA_BIN/ResourceManager givegroup $groupkey
+done
diff -ruN heartbeat.orig/heartbeat/rc.d/status heartbeat/heartbeat/rc.d/status
--- heartbeat.orig/heartbeat/rc.d/status Wed Nov 10 12:31:46 1999
+++ heartbeat/heartbeat/rc.d/status Wed Mar 29 00:33:40 2000
@@ -6,4 +6,5 @@

case $HA_st in
dead) $HA_BIN/mach_down $HA_src;;
+ up) $HA_BIN/mach_up $HA_src;;
esac

--a8Wt8u1KmwUX3Y2C--

tracking resource groups in heartbeat [ In reply to ]

lclaudio at conectiva

Mar 29, 2000, 5:44 AM

Post #3 of 14 (3379 views)

Hello!

On Wed, 29 Mar 2000, horms wrote:
> It seems to me that the existing code will take control of a resource if
> the master specified in haresources fails, but not necessarily give it up
> when the master comes back up again.

That's the idea. It's useful mainly to resync disks and other
operations like that. And the service wouldn't stop for 15+ seconds
because the master came back.
Depending on what you're doing with the cluster it may be good or
not... so you can turn this behavior on/off in ha.cf.

> Again, in the case of a media failure,
> or nodes coming up at the same time both nodes may take ownership of a
> resource and neither will give it up.

What do you mean by "media failure"?
We did lots of tests using the nice_failback and we didn't got any
race, besides it's possible. I'm working on a better protocol for the
nice_failback. Something like if both servers are starting right
now, the master takes control.

> As an aside. I notice that much of the manipulation of resources, and
> reading of haresources is done by shell scripts. I am thinking that it
> would make more sense to merge this into the heartbeat C code, which should
> make code paths easier to track.

But it is easier to add new resources or change the behavior of
manipulation for some special services.
I don't believe everyone who uses heartbeat will challenge into the
sourcecode to modify something. IMHO it makes things easy.

Hugs!

Luis

[ Luis Claudio R. Goncalves lclaudio@conectiva.com.br ]
[. BSc in Computer Science -- MSc coming soon -- Gospel User -- Linuxer ]
[. Fault Tolerance - Real-Time - Distributed Systems - IECLB - IS 40:31 ]
[. LateNite Programmer -- Jesus Is The Solid Rock On Which I Stand -- ]

tracking resource groups in heartbeat [ In reply to ]

Mar 29, 2000, 6:46 AM

Post #4 of 14 (3370 views)

horms wrote:
>
> On Tue, Mar 28, 2000 at 09:35:08PM -0700, Alan Robertson wrote:
> > The conversation below is taken from email off the list. It seemed
> > generally interesting though...
> >
> > Horms wrote:
> > > ... I have however noticed that as heartbeat keeps state of nodes, and
> > > not resource allocations it is possible to get into a state where no
> > > nodes/more than one node have a resource. In particular if there is a
> > > communication medium failure, or if heartbeat is started up on more
> > > than one node simultaneously. I have been thinking of some fairly
> > > simple mechanisms to resolve this, vis a vis nodes requesting ownership
> > > of a resource. I am wondering what your thoughts are. I am most
> > > concerned about the (simple) two-node case, though something that
> > > extends beyond that would be nice.
> >
> > The folks from Conectiva are doing something in a related area. In the
> > current code, the assumption is that if the master for a resource is up,
> > it has control of the resources it is listed as master for. They break
> > that assumption with a new feature (nice_failover?). It would be good to
> > add your thoughts and observations to that, and think about the right way
> > of thinking about this stuff. Once one has the right mental model, the
> > code is easy :-)
>
> It seems to me that the existing code will take control of a resource if
> the master specified in haresources fails, but not necessarily give it up
> when the master comes back up again.

Without nice_failback (which isn't in the current code), this should not
happen. When the master comes back up, it asks for the other node to
give it's resources, and in the case of no response takes them anyway.

> Again, in the case of a media failure,
> or nodes coming up at the same time both nodes may take ownership of a
> resource and neither will give it up.

I agree in the case of media failure. Have you observed it in the case
of both nodes coming up at the same time?

The current bringup sequence is: Start your own heartbeat. Wait until
you've heard someone else's heartbeat or about 10 seconds. Begin the
resource takeover sequence for those resources you master. If you've
heard someone else's heartbeat, then communications with the other end
are working. I think the problem right now is that there is no database
indicating the state of either resources or nodes. Everything depends
on the resource scripts to indicate resource status.

Here's what I think the race condition might be: Node A is master and
is down. Node B is also down, but is the slave. Node B comes up, and
just about the time that it times out on Node A being down, Node A
begins to come up. Node B times out on the resources A is primary on,
and begins the process of taking them over. Node A comes up, and seeing
B's heartbeat, immediately requests it's resources. Node B has started
the takeover scripts, but they aren't done, so it thinks it doesn't own
them, so it doesn't give them up. Node A then takes them over, while
Node B's scripts are in the process of doing the same.

> I have attached a patch that I believe will fix this problem. If
> nice_failover is in operation then this patch will cause both nodes to drop
> the resource, which is bad, but they would both keep it otherwise so it is
> problematic in either case. Also if a resource has more than one master -
> then this patch results in resources being dropped by all nodes or no nodes,
> depending on your haresources file. This isn't very good either but if a
> resource has a master and a slave then it works.

My guess is that we need to design a "good" bringup algorithm that has
the right kinds of sequencing and status changes such that it doesn't
have any race conditions. This is moderately complex, but is probably
the better approach. I started to write one here, but found it too hard
to write inline in email.

-- Alan Robertson
alanr@suse.com

tracking resource groups in heartbeat [ In reply to ]

lclaudio at conectiva

Mar 29, 2000, 6:56 AM

Post #5 of 14 (3381 views)

Hello!

On Wed, 29 Mar 2000, Alan Robertson wrote:

> My guess is that we need to design a "good" bringup algorithm that has
> the right kinds of sequencing and status changes such that it doesn't
> have any race conditions. This is moderately complex, but is probably
> the better approach. I started to write one here, but found it too hard
> to write inline in email.

I've just created a "starting" message that is sent x times in the
first 10 seconds (RQSTDELAY). If both nodes see the "start" message
from each other, the master takes the services. If the slave is
already active, the master stills quiet.

Luis

[ Luis Claudio R. Goncalves lclaudio@conectiva.com.br ]
[. BSc in Computer Science -- MSc coming soon -- Gospel User -- Linuxer ]
[. Fault Tolerance - Real-Time - Distributed Systems - IECLB - IS 40:31 ]
[. LateNite Programmer -- Jesus Is The Solid Rock On Which I Stand -- ]

tracking resource groups in heartbeat [ In reply to ]

horms at vergenet

Mar 29, 2000, 11:50 AM

Post #6 of 14 (3391 views)

On Wed, Mar 29, 2000 at 09:44:26AM -0300, Luis Claudio R. Goncalves wrote:
> Hello!
>
> On Wed, 29 Mar 2000, horms wrote:
> > It seems to me that the existing code will take control of a resource if
> > the master specified in haresources fails, but not necessarily give it up
> > when the master comes back up again.
>
> That's the idea. It's useful mainly to resync disks and other
> operations like that. And the service wouldn't stop for 15+ seconds
> because the master came back.
> Depending on what you're doing with the cluster it may be good or
> not... so you can turn this behavior on/off in ha.cf.
>
> > Again, in the case of a media failure,
> > or nodes coming up at the same time both nodes may take ownership of a
> > resource and neither will give it up.
>
> What do you mean by "media failure"?
> We did lots of tests using the nice_failback and we didn't got any
> race, besides it's possible. I'm working on a better protocol for the
> nice_failback. Something like if both servers are starting right
> now, the master takes control.

I am testing with heartbeat out of cvs (yesterday) + the nice_failback and
link status patches.

Scenario 1

Host A and Host B are not running heartbeat, nice_failback is on
Host B is the master for a resource
Host A and Host B start heartbeat at the same time

Host A
Configuration validated. Starting heartbeat.
UDP heartbeat started on port 1001 interface eth0
Waiting for someone else...
heartbeat startup succeeded
node louise.su.valinux.com -- link eth0: status up
The cluster is already active
Requesting our resources.
No local resources [/usr/lib/heartbeat/ResourceManager listkeys louise.su.valinux.com]
node flim.su.valinux.com -- link eth0: status up

Host B
Configuration validated. Starting heartbeat.
UDP heartbeat started on port 1001 interface eth0
Waiting for someone else...
heartbeat startup succeeded
The cluster is already active
Requesting our resources.
Acting as standby for resource 192.168.0.0/24
node louise.su.valinux.com -- link eth0: status up
node flim.su.valinux.com -- link eth0: status up

But ifconfig on Host A and Host B shows that neither machine
has the resource (IP address).

Scenario 2

Host A and Host B are running hearbteat, nice_failback is on
Host B is the master for a resource
Host B has the resource
The etnernet (and serial and whatever) link between Host A and Host B is cut.
Both machines have the resource
The ethernet link is restored

Host A
Retransmitting pkt 40
Retransmitting pkt 39
Retransmitting pkt 38
Retransmitting pkt 37
Retransmitting pkt 36
Retransmitting pkt 35
Retransmitting pkt 34
Retransmitting pkt 33
Retransmitting pkt 32
Retransmitting pkt 31
Retransmitting pkt 30
Retransmitting pkt 29
Retransmitting pkt 28
Retransmitting pkt 27
Retransmitting pkt 26
Retransmitting pkt 25
Retransmitting pkt 24
Retransmitting pkt 23
Retransmitting pkt 22
Retransmitting pkt 21
Retransmitting pkt 20
21 lost packet(s) for [flim.su.valinux.com] [75:97]
node flim.su.valinux.com: status up
node flim.su.valinux.com -- link eth0: status up
INFO: Running /etc/ha.d/rc.d/status status
Running /etc/ha.d/rc.d/status: status
No pkts missing from flim.su.valinux.com!

Host B
21 lost packet(s) for [louise.su.valinux.com] [19:41]
node louise.su.valinux.com: status up
node louise.su.valinux.com -- link eth0: status up
INFO: Running /etc/ha.d/rc.d/status status
No pkts missing from louise.su.valinux.com!
Running /etc/ha.d/rc.d/status: status
Retransmitting pkt 96
Retransmitting pkt 95
Retransmitting pkt 94
Retransmitting pkt 93
Retransmitting pkt 92
Retransmitting pkt 91
Retransmitting pkt 90
Retransmitting pkt 89
Retransmitting pkt 88
Retransmitting pkt 87
Retransmitting pkt 86
Retransmitting pkt 85
Retransmitting pkt 84
Retransmitting pkt 83
Retransmitting pkt 82
Retransmitting pkt 81
Retransmitting pkt 80
Retransmitting pkt 79
Retransmitting pkt 78
Retransmitting pkt 77
Retransmitting pkt 76

But ifconfig on Host A and Host B shows that both machine
have the resource (IP address).

> > As an aside. I notice that much of the manipulation of resources, and
> > reading of haresources is done by shell scripts. I am thinking that it
> > would make more sense to merge this into the heartbeat C code, which should
> > make code paths easier to track.
>
> But it is easier to add new resources or change the behavior of
> manipulation for some special services.
> I don't believe everyone who uses heartbeat will challenge into the
> sourcecode to modify something. IMHO it makes things easy.

Good point, but I would argue that with the myriad of scripts that
are run, and the environment variables that are mainatined it
is more difficult to modify the shell scripts than the C code.

--
Horms

tracking resource groups in heartbeat [ In reply to ]

horms at vergenet

Mar 29, 2000, 12:08 PM

Post #7 of 14 (3383 views)

On Wed, Mar 29, 2000 at 06:46:24AM -0700, Alan Robertson wrote:
> horms wrote:
> >
> > On Tue, Mar 28, 2000 at 09:35:08PM -0700, Alan Robertson wrote:
> > > The conversation below is taken from email off the list. It seemed
> > > generally interesting though...
> > >
> > > Horms wrote:
> > > > ... I have however noticed that as heartbeat keeps state of nodes, and
> > > > not resource allocations it is possible to get into a state where no
> > > > nodes/more than one node have a resource. In particular if there is a
> > > > communication medium failure, or if heartbeat is started up on more
> > > > than one node simultaneously. I have been thinking of some fairly
> > > > simple mechanisms to resolve this, vis a vis nodes requesting ownership
> > > > of a resource. I am wondering what your thoughts are. I am most
> > > > concerned about the (simple) two-node case, though something that
> > > > extends beyond that would be nice.
> > >
> > > The folks from Conectiva are doing something in a related area. In the
> > > current code, the assumption is that if the master for a resource is up,
> > > it has control of the resources it is listed as master for. They break
> > > that assumption with a new feature (nice_failover?). It would be good to
> > > add your thoughts and observations to that, and think about the right way
> > > of thinking about this stuff. Once one has the right mental model, the
> > > code is easy :-)
> >
> > It seems to me that the existing code will take control of a resource if
> > the master specified in haresources fails, but not necessarily give it up
> > when the master comes back up again.
>
> Without nice_failback (which isn't in the current code), this should not
> happen. When the master comes back up, it asks for the other node to
> give it's resources, and in the case of no response takes them anyway.

> > Again, in the case of a media failure,
> > or nodes coming up at the same time both nodes may take ownership of a
> > resource and neither will give it up.
>
> I agree in the case of media failure. Have you observed it in the case
> of both nodes coming up at the same time?

I thought I had, but I can't reproduce it using the latest code.

> The current bringup sequence is: Start your own heartbeat. Wait until
> you've heard someone else's heartbeat or about 10 seconds. Begin the
> resource takeover sequence for those resources you master. If you've
> heard someone else's heartbeat, then communications with the other end
> are working. I think the problem right now is that there is no database
> indicating the state of either resources or nodes. Everything depends
> on the resource scripts to indicate resource status.

Agreed.

> Here's what I think the race condition might be: Node A is master and
> is down. Node B is also down, but is the slave. Node B comes up, and
> just about the time that it times out on Node A being down, Node A
> begins to come up. Node B times out on the resources A is primary on,
> and begins the process of taking them over. Node A comes up, and seeing
> B's heartbeat, immediately requests it's resources. Node B has started
> the takeover scripts, but they aren't done, so it thinks it doesn't own
> them, so it doesn't give them up. Node A then takes them over, while
> Node B's scripts are in the process of doing the same.

It should be easy enough to resolve this by a node having tighter
control over its resources. It takover is commenced then it has
the resource. Perhaps there needs to be a state for resorce
takover in process or giveup in process which is somewhere between
having a resource and not having a resource.

> > I have attached a patch that I believe will fix this problem. If
> > nice_failover is in operation then this patch will cause both nodes to drop
> > the resource, which is bad, but they would both keep it otherwise so it is
> > problematic in either case. Also if a resource has more than one master -
> > then this patch results in resources being dropped by all nodes or no nodes,
> > depending on your haresources file. This isn't very good either but if a
> > resource has a master and a slave then it works.
>
> My guess is that we need to design a "good" bringup algorithm that has
> the right kinds of sequencing and status changes such that it doesn't
> have any race conditions. This is moderately complex, but is probably
> the better approach. I started to write one here, but found it too hard
> to write inline in email.

--
Horms

tracking resource groups in heartbeat [ In reply to ]

horms at vergenet

Mar 29, 2000, 12:09 PM

Post #8 of 14 (3393 views)

On Wed, Mar 29, 2000 at 10:56:54AM -0300, Luis Claudio R. Goncalves wrote:
>
> Hello!
>
> On Wed, 29 Mar 2000, Alan Robertson wrote:
>
> > My guess is that we need to design a "good" bringup algorithm that has
> > the right kinds of sequencing and status changes such that it doesn't
> > have any race conditions. This is moderately complex, but is probably
> > the better approach. I started to write one here, but found it too hard
> > to write inline in email.
>
> I've just created a "starting" message that is sent x times in the
> first 10 seconds (RQSTDELAY). If both nodes see the "start" message
> from each other, the master takes the services. If the slave is
> already active, the master stills quiet.

Do you have a patch :)

--
Horms

tracking resource groups in heartbeat [ In reply to ]

Mar 29, 2000, 12:19 PM

Post #9 of 14 (3376 views)

Horms wrote:
>
> On Wed, Mar 29, 2000 at 10:56:54AM -0300, Luis Claudio R. Goncalves wrote:
> >
> > Hello!
> >
> > On Wed, 29 Mar 2000, Alan Robertson wrote:
> >
> > > My guess is that we need to design a "good" bringup algorithm that has
> > > the right kinds of sequencing and status changes such that it doesn't
> > > have any race conditions. This is moderately complex, but is probably
> > > the better approach. I started to write one here, but found it too hard
> > > to write inline in email.
> >
> > I've just created a "starting" message that is sent x times in the
> > first 10 seconds (RQSTDELAY). If both nodes see the "start" message
> > from each other, the master takes the services. If the slave is
> > already active, the master stills quiet.
>
> Do you have a patch :)

This obviously needs to be fixed, *however* before we put it in, let's
walk through the "new" startup sequence, and make sure that there aren't
still some holes in it. Let's not enter into a series of new init
sequences which are better, but still not right.

The current state of affairs rarely actually exhibits the problems in
practice with properly configured systems, so we can live a bit longer
while we look for the "right" approach.

-- Alan Robertson
alanr@suse.com

tracking resource groups in heartbeat [ In reply to ]

lclaudio at conectiva

Mar 29, 2000, 12:24 PM

Post #10 of 14 (3372 views)

Howdy again!

> Scenario 1
>
> Host A and Host B are not running heartbeat, nice_failback is on
> Host B is the master for a resource
> Host A and Host B start heartbeat at the same time

I'm working right now to correct the race when both machines start
at the same time. When it occurs, the master disables nice failback
and work as it is used to.

> Scenario 2
>
> Host A and Host B are running hearbteat, nice_failback is on
> Host B is the master for a resource
> Host B has the resource
> The etnernet (and serial and whatever) link between Host A and Host B is cut.
> Both machines have the resource
> The ethernet link is restored

We're also working out an idea that may solve this problem. We
thought something like this:

drbd
[NODE A] ---------------------- [NODE B]
| |
| |
| services | services
| |
+-------------+ +------------+
| |
[ SWITCH/HUB ] ------------- [Reference host]

There's a network for DRBD and another for the services. If one
node can't find each other other you'll see a "link xxx down" message
and we will try to find who's out...
If node A can't reach node B but can reach the reference host (that
may be the switch or another non-stop machine) B's service network
interface may be down. In this case, B won't reach A neither the
reference host... so it may mark its link as down, shutdown or release
the resources (and optionally may still mirroring A's disk).
If drbd's network stops, we can reconfigure drbd to use the service
network.
There are some more obvious possibilities.

Hope this helps!

Luis

[ Luis Claudio R. Goncalves lclaudio@conectiva.com.br ]
[. BSc in Computer Science -- MSc coming soon -- Gospel User -- Linuxer ]
[. Fault Tolerance - Real-Time - Distributed Systems - IECLB - IS 40:31 ]
[. LateNite Programmer -- Jesus Is The Solid Rock On Which I Stand -- ]

tracking resource groups in heartbeat [ In reply to ]

horms at vergenet

Mar 29, 2000, 12:59 PM

Post #11 of 14 (3376 views)

I have been thinking of a way of getting around the problem of multiple
nodes owning a resource after a media failure. It seems that for this
problem the main issue is that though nodes are aware of whether or not
they have contact with each other, they are not necessarily aware of what
resources the other nodes currently have.

I think that nodes having omniscient knowledge of what resources are on
what nodes of the cluster is bad, to much state, to much of the time.
Rather I am thinking of a mechanism where if there is a chance of a
resource being in an unknown state the status of the resource can be
requested. This might occur when node A notices node B has just come back
up. In the case of a media failure, node B would also notice that node A
has just come up.

Here is a first cut.

To find out which node owns a resource a resource-request could be sent.
This would contain the name of the resource, as well as auth, timestamp and
sequence number information.

All nodes should reply (except the originating node - presumably it knows
the status of the resources it owns).

The resource-reply should contain, the resource name, status as well as
auth, timestamp and sequence number information. In addition information
for tie-breaking should be included, who the node thinks is the master,
time since the resource was last obtained/given up, and perhaps a random
number as a last resort tie-breaker.

A resource-request would be sent out for each resource a node is eligible
to own (from haresources) when it sees another node come on line. Once
resource-reply's are received from all nodes that are up (the node should
know which nodes it thinks are up) then it should be able to decide weather
or not to give up or take over the resource. Once this decision is made a
resource-reply should be sent out, so all nodes can know the state of the
resource. If the state of the resource is still inconsistent (in particular
owned more than once) then the other nodes effected should notice this and
send a fresh resource-request.

--
Horms

tracking resource groups in heartbeat [ In reply to ]

horms at vergenet

Mar 29, 2000, 12:59 PM

Post #12 of 14 (3369 views)

I have been thinking of a way of getting around the problem of multiple
nodes owning a resource after a media failure. It seems that for this
problem the main issue is that though nodes are aware of whether or not
they have contact with each other, they are not necessarily aware of what
resources the other nodes currently have.

I think that nodes having omniscient knowledge of what resources are on
what nodes of the cluster is bad, to much state, to much of the time.
Rather I am thinking of a mechanism where if there is a chance of a
resource being in an unknown state the status of the resource can be
requested. This might occur when node A notices node B has just come back
up. In the case of a media failure, node B would also notice that node A
has just come up.

Here is a first cut.

To find out which node owns a resource a resource-request could be sent.
This would contain the name of the resource, as well as auth, timestamp and
sequence number information.

All nodes should reply (except the originating node - presumably it knows
the status of the resources it owns).

The resource-reply should contain, the resource name, status as well as
auth, timestamp and sequence number information. In addition information
for tie-breaking should be included, who the node thinks is the master,
time since the resource was last obtained/given up, and perhaps a random
number as a last resort tie-breaker.

A resource-request would be sent out for each resource a node is eligible
to own (from haresources) when it sees another node come on line. Once
resource-reply's are received from all nodes that are up (the node should
know which nodes it thinks are up) then it should be able to decide weather
or not to give up or take over the resource. Once this decision is made a
resource-reply should be sent out, so all nodes can know the state of the
resource. If the state of the resource is still inconsistent (in particular
owned more than once) then the other nodes effected should notice this and
send a fresh resource-request.

--
Horms

tracking resource groups in heartbeat [ In reply to ]

horms at vergenet

Mar 29, 2000, 1:11 PM

Post #13 of 14 (3381 views)

On Wed, Mar 29, 2000 at 04:24:05PM -0300, Luis Claudio R. Goncalves wrote:
> Howdy again!
>
> > Scenario 1
> >
> > Host A and Host B are not running heartbeat, nice_failback is on
> > Host B is the master for a resource
> > Host A and Host B start heartbeat at the same time
>
> I'm working right now to correct the race when both machines start
> at the same time. When it occurs, the master disables nice failback
> and work as it is used to.
>
> > Scenario 2
> >
> > Host A and Host B are running hearbteat, nice_failback is on
> > Host B is the master for a resource
> > Host B has the resource
> > The etnernet (and serial and whatever) link between Host A and Host B is cut.
> > Both machines have the resource
> > The ethernet link is restored
>
> We're also working out an idea that may solve this problem. We
> thought something like this:
>
> drbd
> [NODE A] ---------------------- [NODE B]
> | |
> | |
> | services | services
> | |
> +-------------+ +------------+
> | |
> [ SWITCH/HUB ] ------------- [Reference host]
>
> There's a network for DRBD and another for the services. If one
> node can't find each other other you'll see a "link xxx down" message
> and we will try to find who's out...
> If node A can't reach node B but can reach the reference host (that
> may be the switch or another non-stop machine) B's service network
> interface may be down. In this case, B won't reach A neither the
> reference host... so it may mark its link as down, shutdown or release
> the resources (and optionally may still mirroring A's disk).
> If drbd's network stops, we can reconfigure drbd to use the service
> network.
> There are some more obvious possibilities.

That would help a lot, but essentially you are relying on multiple links
(or information about multiple links) to ensure that the network
doesn't become partitioned. If (somehow) the network does bcome
partitioned, - the drdb network fails and the SWITCH/HUB fails at the
same time - then you may still be in trouble.

--
Horms

tracking resource groups in heartbeat [ In reply to ]

lclaudio at conectiva

Mar 29, 2000, 1:18 PM

Post #14 of 14 (3375 views)

Hi!

> That would help a lot, but essentially you are relying on multiple links
> (or information about multiple links) to ensure that the network
> doesn't become partitioned. If (somehow) the network does bcome
> partitioned, - the drdb network fails and the SWITCH/HUB fails at the
> same time - then you may still be in trouble.

If one can't reach the reference host, the other drbd side and the
other cluster node, it'd better shutdown... :) This is straight
forward in this solution.

Luis

[ Luis Claudio R. Goncalves lclaudio@conectiva.com.br ]
[. BSc in Computer Science -- MSc coming soon -- Gospel User -- Linuxer ]
[. Fault Tolerance - Real-Time - Distributed Systems - IECLB - IS 40:31 ]
[. LateNite Programmer -- Jesus Is The Solid Rock On Which I Stand -- ]