Mailing List Archive

Psuedo-quorum and nice failback: Was: The nice nice_failback :)
Marcelo Tosatti wrote:
>
> On Mon, 17 Apr 2000, Alan Robertson wrote:
>
> > Horms wrote:
> > >
> > > On Mon, Apr 17, 2000 at 09:39:47AM -0600, Alan Robertson wrote:
> > > > "Luis Claudio R. Goncalves" wrote:
> > > > >
> > > > > Hello!
> > > > >
> > > > > I think, (I *hope* to be honest) this is the nicer nice failback
> > > > > patch I ever did. It adds some features that I'll extend next monday,
> > > > > like periodic resources_held messages and so.
> > > > > I'd challenge the brave ones to test this code (I'm still hardtesting
> > > > > it). If someone survive and give me some feedback, I'll put it on the
> > > > > CVS on Monday :)
> > > >
> > > > Sorry I was out of commission for a few days, so this is coming late,
> > > > and sounds like a broken record. I am generally opposed to putting
> > > > support of resources into heartbeat, particularly if they restrict
> > > > things to only two machines.
> > >
> > > The problem is that heartbeat as it stands has a _serious_ flaw.
> >
> > Agreed.
> >
> > > If all links fail then resources become owned by more than one
> > > machine and will not be relinquished once links are re-established.
> >
> > Yes, but a better way might be to have pseudo-quorum based on the
> > reachability of something like a router or switch or hub. And do
> > something I'll outline below, ALSO.
> That was exactly our idea when we started to help with heartbeat.
> The "FIXME: do something useful here" on my patch is there because we
> need scripts to "do" the pseudo-quorum. Luis already explained this
> in a past message.
> (http://lists.tummy.com/pipermail/linux-ha-dev/2000-March/000460.html)

I knew it had been discussed, but forgot when. If you do this, then
doesn't this largely solve the problem?

I believe that the low-level protocol is capable of noticing the joining
together of two independent clusters, and activating a special "oops"
case when it detects that the systems have rejoined. This could be the
audit script we've talked about. This *can* happen with perverse enough
failures even with a pseudo-quorum device. For those cases where a
pseudo-quorum device isn't configured, it is much more likely.

> > > At the moment the hack is to have as many links used for heartbeat
> > > communication as possible and hope that you never run into a situation
> > > where nodes lose communication with each other and yet are fully
> > > functional. This is in my opinion an acceptable situation in the sort
> > > term as in the case of 2 nodes a serial link should give you a
> > > fair amount of security against all links failing.
> >
> > I don't think I'd call it a "hack", and would recommend it even if it
> > didn't help solve the problem. But, this is not to fundamentally
> > disagree with your assessment.
> >
> > > To my mind to get around this problem the best way forward is to have
> > > nodes keep track of resources internally, when nodes change state they can
> > > check to see if a resource is has - or can potential have - is owned by
> > > any other nodes on the network. Without this assumptions have to be made
> > > about a node being accessible meaning that given resources are accessible.
> > > Especially in the case where there is no master for a resource, there
> > > is no way to make such assumptions without the possibility of situations
> > > where either resources are duplicated or disappear off the network.
> >
> > With drbd (for example), you MUST NEVER have both sides have the mirror
> > mounted read-write simultaneously, so your solution is insufficient for
> > this. The only way I know to handle this is to follow Stephen's
> > suggestion of having a pseudo-quorum resource that you have to "own" in
> > order to own the master side of the mirror. It should work like this:
> > If you can reach the hub and you can't reach the master, then you may
> > take over the drbd resource.
> > If you can't reach the hub, then you should probably shut down, and
> > await it's becoming available again.
> >
> > This will fail to work in the following very unlikely situation:
> > Both sides can reach the hub/switch/router,
> > Neither side can talk to the other (including via alternate paths)
> Then both sit_and_cry().

No. Because each can reach the pseudo-quorum device, neither sits and
cries. This is a very unlikely failure, but not impossible. It would
require a particular type of failure inside the psuedo-quorum
hub/router, and also a simultaneous failure of the redundant heartbeat
link.

> > [This is at least a double failure]
> >
> > This also solves another important problem:
> > A side staying up when it can't serve it's customers.
> >
> > Pardon me if I've forgotten, but does this solve the same problems as
> > you're trying to solve?

> Horms?


Let me make a proposal, and see if anyone is interested in implementing
pieces of it:

Someone (maybe me) should implement the code for detecting cluster
merge. We should activate an external script when it is
discovered. Name and arguments to be determined.

We ought to implement pseudo-quorum. I'm open as to the details.
Thoughts include:

A new resource type ping-quorum::135.9.214.51,
and make the takeover scripts actually check the
return code of one resource before taking over others :-)
In the current implementation, you'd have to list it last
on a line.

Including the ping resource as a psuedo-host, and have
it execute the status script whenever it comes and
goes. This would be a bit of a kludge, but not
so bad. You'd mangle ping responses to make
received packets. Not really *that* bad...
Something special would have to be done when
it came and went... I suppose giving up all
resources when it disappears, and getting them
back when it comes back.

Do it externally with a cron job and stop and start
heartbeat.

Are these complete?

Now, in this context, it seems to me that the nice_failback still has to
worry about whether the other side has any of the resources (which may
be where we started this conversation). You could always add a message
type and ask... The only difference between nice_failback and normal
is if you ask whether the other side has the resources, or if you just
take them over anyway.

If you don't want to add any new message types, you could always
implement nice failback as a case where the side coming back up (the
"natural" master) gets a "no" response when it asks for the resources
from the other side. You could then even make nice_failback a special
resource, so that the nice-failback property is then a property of the
group, not the whole configuration. When asked to give up any group
with the nice-failback resource in it, the far end machine always says
"no".

Sorry to send things so far afield... But this does keep nicefailback
and resource handling in general outside the core code...

-- Alan Robertson
alanr@suse.com
Psuedo-quorum and nice failback: Was: The nice nice_failback :) [ In reply to ]
On Mon, Apr 17, 2000 at 12:10:10PM -0600, Alan Robertson wrote:
> Marcelo Tosatti wrote:
> > > > > Sorry I was out of commission for a few days, so this is coming
> > > > > late, and sounds like a broken record. I am generally opposed to
> > > > > putting support of resources into heartbeat, particularly if they
> > > > > restrict things to only two machines.
> > > >
> > > > The problem is that heartbeat as it stands has a _serious_ flaw.
> > >
> > > Agreed.
> > >
> > > > If all links fail then resources become owned by more than one
> > > > machine and will not be relinquished once links are re-established.
> > >
> > > Yes, but a better way might be to have pseudo-quorum based on the
> > > reachability of something like a router or switch or hub. And do
> > > something I'll outline below, ALSO.
> > That was exactly our idea when we started to help with heartbeat. The
> > "FIXME: do something useful here" on my patch is there because we need
> > scripts to "do" the pseudo-quorum. Luis already explained this in a
> > past message.
> > (http://lists.tummy.com/pipermail/linux-ha-dev/2000-March/000460.html)
>
> I knew it had been discussed, but forgot when. If you do this, then
> doesn't this largely solve the problem?
>
> I believe that the low-level protocol is capable of noticing the joining
> together of two independent clusters, and activating a special "oops"
> case when it detects that the systems have rejoined. This could be the
> audit script we've talked about. This *can* happen with perverse enough
> failures even with a pseudo-quorum device. For those cases where a
> pseudo-quorum device isn't configured, it is much more likely.
>
> > > > At the moment the hack is to have as many links used for heartbeat
> > > > communication as possible and hope that you never run into a
> > > > situation where nodes lose communication with each other and yet
> > > > are fully functional. This is in my opinion an acceptable situation
> > > > in the sort term as in the case of 2 nodes a serial link should
> > > > give you a fair amount of security against all links failing.
> > >
> > > I don't think I'd call it a "hack", and would recommend it even if it
> > > didn't help solve the problem. But, this is not to fundamentally
> > > disagree with your assessment.
> > >
> > > > To my mind to get around this problem the best way forward is to
> > > > have nodes keep track of resources internally, when nodes change
> > > > state they can check to see if a resource is has - or can potential
> > > > have - is owned by any other nodes on the network. Without this
> > > > assumptions have to be made about a node being accessible meaning
> > > > that given resources are accessible. Especially in the case where
> > > > there is no master for a resource, there is no way to make such
> > > > assumptions without the possibility of situations where either
> > > > resources are duplicated or disappear off the network.
> > >
> > > With drbd (for example), you MUST NEVER have both sides have the
> > > mirror mounted read-write simultaneously, so your solution is
> > > insufficient for this. The only way I know to handle this is to
> > > follow Stephen's suggestion of having a pseudo-quorum resource that
> > > you have to "own" in order to own the master side of the mirror. It
> > > should work like this: If you can reach the hub and you can't reach
> > > the master, then you may take over the drbd resource. If you can't
> > > reach the hub, then you should probably shut down, and await it's
> > > becoming available again.
> > >
> > > This will fail to work in the following very unlikely situation: Both
> > > sides can reach the hub/switch/router, Neither side can talk to the
> > > other (including via alternate paths)
> > Then both sit_and_cry().
>
> No. Because each can reach the pseudo-quorum device, neither sits and
> cries. This is a very unlikely failure, but not impossible. It would
> require a particular type of failure inside the psuedo-quorum hub/router,
> and also a simultaneous failure of the redundant heartbeat link.
>
> > > [This is at least a double failure]
> > >
> > > This also solves another important problem: A side staying up when it
> > > can't serve it's customers.
> > >
> > > Pardon me if I've forgotten, but does this solve the same problems as
> > > you're trying to solve?
>
> > Horms?


This should solve the problem. If nodes lost communication with each other
and we assume with the pseudo-quorum device then no node should have any
resources. Onnce communication is re-established then the startup protocol
would come into play, resulting in resources being owned by exactly one
node.

Of course if there ever is a situation where multiple nodes have a resouce
- if they loose communication with each outher but not the pseudo-quorum
device - then there will be no way to re-establish equilibrium with out
a failure of some sort. Granted this is a rather unlikely situation but
I still think that resource management would help us here.

> Let me make a proposal, and see if anyone is interested in implementing
> pieces of it:
>
> Someone (maybe me) should implement the code for detecting cluster
> merge. We should activate an external script when it is
> discovered. Name and arguments to be determined.
>
> We ought to implement pseudo-quorum. I'm open as to the details.
> Thoughts include:
>
> A new resource type ping-quorum::135.9.214.51, and make the
> takeover scripts actually check the return code of one
> resource before taking over others :-) In the current
> implementation, you'd have to list it last on a line.

I think that this option sits most comfortably with me.

> Including the ping resource as a psuedo-host, and have it
> execute the status script whenever it comes and goes. This
> would be a bit of a kludge, but not so bad. You'd mangle
> ping responses to make received packets. Not really *that*
> bad... Something special would have to be done when it
> came and went... I suppose giving up all resources when it
> disappears, and getting them back when it comes back.
>
> Do it externally with a cron job and stop and start
> heartbeat.
>
> Are these complete?
>
> Now, in this context, it seems to me that the nice_failback still has to
> worry about whether the other side has any of the resources (which may be
> where we started this conversation). You could always add a message type
> and ask... The only difference between nice_failback and normal is if
> you ask whether the other side has the resources, or if you just take
> them over anyway.

I don't follow this, if you take over a resource (that has no master)
without checking to see if it is already owned by a node then aren't you
going to end up with two nodes owning the resource?

> If you don't want to add any new message types, you could always
> implement nice failback as a case where the side coming back up (the
> "natural" master) gets a "no" response when it asks for the resources
> from the other side. You could then even make nice_failback a special
> resource, so that the nice-failback property is then a property of the
> group, not the whole configuration. When asked to give up any group with
> the nice-failback resource in it, the far end machine always says "no".

I'm completely lost now. If it asks for the resouce, don't we need
a new message type to do the asking?


--
Horms
Psuedo-quorum and nice failback: Was: The nice nice_failback :) [ In reply to ]
Horms wrote:
>
> On Mon, Apr 17, 2000 at 12:10:10PM -0600, Alan Robertson wrote:

> > it seems to me that the nice_failback still has to
> > worry about whether the other side has any of the resources (which may be
> > where we started this conversation). You could always add a message type
> > and ask... The only difference between nice_failback and normal is if
> > you ask whether the other side has the resources, or if you just take
> > them over anyway.
>
> I don't follow this, if you take over a resource (that has no master)
> without checking to see if it is already owned by a node then aren't you
> going to end up with two nodes owning the resource?
>
> > If you don't want to add any new message types, you could always
> > implement nice failback as a case where the side coming back up (the
> > "natural" master) gets a "no" response when it asks for the resources
> > from the other side. You could then even make nice_failback a special
> > resource, so that the nice-failback property is then a property of the
> > group, not the whole configuration. When asked to give up any group with
> > the nice-failback resource in it, the far end machine always says "no".
>
> I'm completely lost now. If it asks for the resouce, don't we need
> a new message type to do the asking?

There is already a message type which says "give me the resources". It
just always assumes that it will get an OK from the other end. If you
change the semantics of the message slightly, then it would be
permissible to respond with "no-way-jose", in which case it should just
say "oh, well", and go on and *not* take the resources over.

Current scenario:

"natural master" sends "ip-request", and gets back "ip-request-resp"
ALWAYS with "ok=OK", or getting nothing back (timeout).

New scenario:

"natural master" sends "ip-request", and gets back nothing (timeout)
or a an "ip-request-resp" message with ok=OK or ok=NO.
If it gets timeout or OK=OK, then it initiates takeover.

If it gets ok=NO, then it doesn't take over the resource.

The current protocol has the possibility of refusal implicitly designed
in,
but the code implementing it doesn't know what to do if it gets a
refusal.

There is currently an issue if giving up a resource takes "too long".

Then it takes over the resource assuming the other side is dead. This
is not fixed nor made worse by this change.


-- Alan Robertson
alanr@suse.com
Psuedo-quorum and nice failback: Was: The nice nice_failback :) [ In reply to ]
Howdy again!

On Tue, 18 Apr 2000, Alan Robertson wrote:
...
> New scenario:
>
> "natural master" sends "ip-request", and gets back nothing (timeout)
> or a an "ip-request-resp" message with ok=OK or ok=NO.
> If it gets timeout or OK=OK, then it initiates takeover.
>
> If it gets ok=NO, then it doesn't take over the resource.
>
> The current protocol has the possibility of refusal implicitly designed
> in,
> but the code implementing it doesn't know what to do if it gets a
> refusal.

It sounds good and may solve the startup race, in a nice fashion.
I'll take a look at this one.

Luis
[ Luis Claudio R. Goncalves lclaudio@conectiva.com.br ]
[. BSc in Computer Science -- MSc coming soon -- Gospel User -- Linuxer ]
[. Fault Tolerance - Real-Time - Distributed Systems - IECLB - IS 40:31 ]
[. LateNite Programmer -- Jesus Is The Solid Rock On Which I Stand -- ]
Psuedo-quorum and nice failback: Was: The nice nice_failback :) [ In reply to ]
On Tue, Apr 18, 2000 at 07:37:13AM -0600, Alan Robertson wrote:
> > > If you don't want to add any new message types, you could always
> > > implement nice failback as a case where the side coming back up (the
> > > "natural" master) gets a "no" response when it asks for the resources
> > > from the other side. You could then even make nice_failback a special
> > > resource, so that the nice-failback property is then a property of the
> > > group, not the whole configuration. When asked to give up any group with
> > > the nice-failback resource in it, the far end machine always says "no".
> >
> > I'm completely lost now. If it asks for the resouce, don't we need
> > a new message type to do the asking?
>
> There is already a message type which says "give me the resources". It
> just always assumes that it will get an OK from the other end. If you
> change the semantics of the message slightly, then it would be
> permissible to respond with "no-way-jose", in which case it should just
> say "oh, well", and go on and *not* take the resources over.

Ok, I'm with you now. I guess I should just go away and read the code more
thoroughly :)

> Current scenario:
>
> "natural master" sends "ip-request", and gets back "ip-request-resp"
> ALWAYS with "ok=OK", or getting nothing back (timeout).
>
> New scenario:
>
> "natural master" sends "ip-request", and gets back nothing (timeout)
> or a an "ip-request-resp" message with ok=OK or ok=NO.
> If it gets timeout or OK=OK, then it initiates takeover.
>
> If it gets ok=NO, then it doesn't take over the resource.
>
> The current protocol has the possibility of refusal implicitly designed
> in,
> but the code implementing it doesn't know what to do if it gets a
> refusal.
>
> There is currently an issue if giving up a resource takes "too long".
>
> Then it takes over the resource assuming the other side is dead. This
> is not fixed nor made worse by this change.

As an aside, I believe there is also an issue if an active resource
is removed from the haresources file, the resource won't be deactivated
unless heartbeat is stoped and started.

--
Horms