Mailing List Archive

Nice_failback in heartbeat and drbd.
Hi,

This is just a note on what nice_failback is, and why it works so nicely
with drbd (DRBD was one of the main reasons nice_failback was added).

In heartbeat's old mode, a machine was designated the "natural master" of a
given resource, like drbd. This means that whenever that machine was up, it
would become the master of that resource.

In nice_failback mode, a resource would transition to another machine only
when the machine providing the resource goes down.

When possible, it is better for drbd to use nice_failback takeover for the
following reasons:

When a machine comes back up, it has to resync, which is a potentially
expensive operation. This transition has to be delayed by this amount of
time before it can occur. Nice_failback makes fewer transitions (failovers)
than normal failback does, so there's less of this going on.

Making drbd stop one end from being master and forcing it to be a slave, and
vice versa on the other end is kind of messy and complicated, particularly
if the slave isn't yet in sync (usually the case after it comes back up).
This doesn't *ever* happen with nice_failback. When the transition occurs,
either you already have good data, and you can just fail over, or you don't
and you can't get it at all. Either way, it's easy.

So, it is fair to say that drbd and nice_failback get along pretty well in
some conceptual sense.

However, the startup scripts assume that one machine is going to be the
master whenever it comes up. Of course, with nice_failback that doesn't
happen. What I had written as Phase I is with nice_failback, and that's why
it's simpler.

-- Alan Robertson
alanr@example.com
Re: Nice_failback in heartbeat and drbd. [ In reply to ]
> However, the startup scripts assume that one machine is going to be the
> master whenever it comes up. Of course, with nice_failback that doesn't
> happen. What I had written as Phase I is with nice_failback, and that's
why
> it's simpler.

Look the CVS version of the script, one of the mode use a persistance
information
Can it be modified to match heartbeat nice_failover needs ?

I was assuming that heartbeat was willing total control of the resource,
was I wrong ?

Is a hybride mode between "force" and "restore" needed ?

Thomas
Re: Nice_failback in heartbeat and drbd. [ In reply to ]
Thomas Mangin wrote:
>
> > However, the startup scripts assume that one machine is going to be the
> > master whenever it comes up. Of course, with nice_failback that doesn't
> > happen. What I had written as Phase I is with nice_failback, and that's
> why
> > it's simpler.
>
> Look the CVS version of the script, one of the mode use a persistance
> information
> Can it be modified to match heartbeat nice_failover needs ?

I picked up the CVS tree, but haven't yet run across the persistent
information store. I'll look for it specifically.

> I was assuming that heartbeat was willing total control of the resource,
> was I wrong ?

It *thinks* it is in control of the resource, BUT it does not know if what
it is asking is possible (reasonable). The data integrity constraints of
drbd are not understood by heartbeat - so it cannot be guaranteed to do the
right thing from drbd's perspective. I'm writing another email about this.
>
> Is a hybride mode between "force" and "restore" needed ?

Guess I need to understand this better to comment.

-- Alan Robertson
alanr@example.com
Re: Nice_failback in heartbeat and drbd. [ In reply to ]
* Alan Robertson <alanr@example.com> [001121 16:16]:
> Hi,
>
> This is just a note on what nice_failback is, and why it works so nicely
> with drbd (DRBD was one of the main reasons nice_failback was added).
>
> In heartbeat's old mode, a machine was designated the "natural master" of a
> given resource, like drbd. This means that whenever that machine was up, it
> would become the master of that resource.
>
> In nice_failback mode, a resource would transition to another machine only
> when the machine providing the resource goes down.
>
> When possible, it is better for drbd to use nice_failback takeover for the
> following reasons:
>
> When a machine comes back up, it has to resync, which is a potentially
> expensive operation. This transition has to be delayed by this amount of
> time before it can occur. Nice_failback makes fewer transitions (failovers)
> than normal failback does, so there's less of this going on.
>
> Making drbd stop one end from being master and forcing it to be a slave, and
> vice versa on the other end is kind of messy and complicated, particularly
> if the slave isn't yet in sync (usually the case after it comes back up).
> This doesn't *ever* happen with nice_failback. When the transition occurs,
> either you already have good data, and you can just fail over, or you don't
> and you can't get it at all. Either way, it's easy.
>
> So, it is fair to say that drbd and nice_failback get along pretty well in
> some conceptual sense.
>

I totally agree with you, and I have a wish:

When I use nice_failback I want to have a command which allows
me to do a failback by human intervention.
Something like:

heartbeat_ctl failback 123.234.345.456

...will bring the service on IP-Addr 123.234.345.456 back to it's
home node.

Is it possible to do this in heartbeat ?

-Philipp
Re: Nice_failback in heartbeat and drbd. [ In reply to ]
Philipp Reisner wrote:
>
> * Alan Robertson <alanr@example.com> [001121 16:16]:
> > Hi,
> >
> > This is just a note on what nice_failback is, and why it works so nicely
> > with drbd (DRBD was one of the main reasons nice_failback was added).
> >
> > In heartbeat's old mode, a machine was designated the "natural master" of a
> > given resource, like drbd. This means that whenever that machine was up, it
> > would become the master of that resource.
> >
> > In nice_failback mode, a resource would transition to another machine only
> > when the machine providing the resource goes down.
> >
> > When possible, it is better for drbd to use nice_failback takeover for the
> > following reasons:
> >
> > When a machine comes back up, it has to resync, which is a potentially
> > expensive operation. This transition has to be delayed by this amount of
> > time before it can occur. Nice_failback makes fewer transitions (failovers)
> > than normal failback does, so there's less of this going on.
> >
> > Making drbd stop one end from being master and forcing it to be a slave, and
> > vice versa on the other end is kind of messy and complicated, particularly
> > if the slave isn't yet in sync (usually the case after it comes back up).
> > This doesn't *ever* happen with nice_failback. When the transition occurs,
> > either you already have good data, and you can just fail over, or you don't
> > and you can't get it at all. Either way, it's easy.
> >
> > So, it is fair to say that drbd and nice_failback get along pretty well in
> > some conceptual sense.
> >
>
> I totally agree with you, and I have a wish:
>
> When I use nice_failback I want to have a command which allows
> me to do a failback by human intervention.
> Something like:
>
> heartbeat_ctl failback 123.234.345.456
>
> ...will bring the service on IP-Addr 123.234.345.456 back to it's
> home node.
>
> Is it possible to do this in heartbeat ?

Not now. It would be if you wrote the code ;-) I'm now accepting patches
;-)

I should put this on the TODO.

-- Alan Robertson
alanr@example.com