Hi,
Sorry if this has been discussed before (I'm pretty far behind on the list),
but it doesn't seem to be resolved (by reading the code and trying it out).
So, if I'm being an idiot, tell me right away, but be gentle ;-)
Last week in Germany I finally tried out drbd with heartbeat. As far as I
can tell, it and heartbeat don't really get along too well. Heartbeat
thinks its in charge, yet drbd really needs to have more control than it is
getting.
The problem is not drbd per se, but the scripts which start it and stop it.
Since I know heartbeat pretty well, and think I understand the basics of
what drbd, I'll go ahead and jump out and make a proposal for how the
scripts ought to work. I slept too much on the plane to write the code, but
I meant to ;-) If people think this will work, I'll write it and test it
this week.
I have clear ideas on how to do this in 3 phases. I foolishly think they
might even work ;-) I asked Marius to go over this with me, I convinced him
it would also work.
There are two scripts that drbd has:
An init script to activate the drbd service, and either become secondary
or prepare to become primary.
A heartbeat start/stop script which instructs drbd to become primary
or drop its primary status and become secondary
I'll go through three phases in this discussion, for various levels of
solution. They are in order of complexity of implementation.
The first phase will *only* work with nice_failback in heartbeat
The second phase will work with either nice_failback or not.
The third phase will handle two servers going down at once and both coming
back up. It will not handle both going down, and only one coming back up.
I use these abbreviations for the state when you ask drbd:
NC: No connection to the other side
PRI: Other side is primary
SEC: Other side is secondary
I call this state of the other side "OtherState" in the text below.
For all phases, one needs to add an option to the init script where some
human being can force a machine to become primary. /etc/rc.d/init.d/drbd
primary! or something like that. This option sets the State := GOOD.
==== First phase:=============================================
This phase requires a state file with one of three possible states stored in
it:
GOOD: we have a good copy of the data
BAD: we have no good data
SYNC: we are synchronizing with the other side right now.
I use the variable name "State" to refer to this state of our local data.
This should not (need not) persist between reboots.
INIT SCRIPT "start" logic:-------:
OtherState == NC: State:= BAD. GET HUMAN HELP. This is the case where
we cannot start because we don't have good data, and
we can't get it from the other side.
== PRI: State:= SYNC. This is the case where the other side is
primary, and we're just coming up.
Try a quick sync.
If quick sync succeeds
State:=GOOD
else try a full sync
monitor the full sync in the background
if it fails, State:=BAD
when it succeeds, State:=GOOD
OtherState == SEC: State:= BAD. GET HUMAN HELP. This is the case where
neither
of us knows we have good data. We need for
some human to appoint one of us primary before
we can continue. This happens after both
both machines crash and then come back up.
STARTUP SCRIPT logic:-------:
OtherState == NC or SEC: if State == GOOD, force primary role
else GET HUMAN HELP
OtherState == PRI: GET HUMAN HELP This means the other side
is still trying to own the resource too.
This shouldn't happen.
==== Second phase:=============================================
Same as first phase EXCEPT FOR
STARTUP SCRIPT logic:-------:
OtherState == NC or SEC: if State == GOOD, force primary role
else GET HUMAN HELP {same as before}
OtherState == PRI: If State == SYNC:
wait for sync to complete then
go on. If it fails GET HUMAN HELP
If State == BAD:
GET HUMAN HELP
If State == GOOD: send the other side a msg
asking it to become secondary
Force local side into primary.
Ideally, this would be done with a DRBD command
but I'm not sure if it can be. Failing
that, use the local cluster manager API
to send the message.
==== Third phase:=============================================
For the third phase, one needs permanent generation tuples. They are an
ordered pair {manual, auto}. They must persist across reboots.
The manual number is incremented every time that a node is forced to become
master manually. When this happens, the auto number is reset to 1. The
auto element of the tuple is incremented every time a node becomes primary.
The slave keeps the same generation tuple as the primary. It only
increments it when it takes over from the primary.
There is a ">" relation on generation numbers that compare the tuple
elements in the (manual, auto) order. (1,5) > (1,4); (2,1) > (1,4), etc.
This technique either requires rewriting DRBD to accommodate these
generation numbers, or writing code which uses the local cluster manager API
to send these messages around.
In an ideal world these numbers would be stored inside the partition,
because then they would not get confused when disks get replaced, etc. It
would be nice for DRBD to support them in this way at least as an option.
This would make it more bullet-proof in the real world.
On to how they're used...
The scripts above deal for every case except for this one:
INIT SCRIPT "start" logic:-------:
OtherState == SEC:
In this case, the two sides exchange generation tuples, and if
one of them has a higher number, it changes its state to GOOD,
and the other one does a full sync from the GOOD side. When
it completes, the secondary also marks its state to GOOD.
There's a little more logic to making this work with heartbeat and making
sure the side which wants to be heartbeat master is also the DRBD master. I
leave these details as an exercise to the reader :-)
-- Alan Robertson
alanr@example.com
Sorry if this has been discussed before (I'm pretty far behind on the list),
but it doesn't seem to be resolved (by reading the code and trying it out).
So, if I'm being an idiot, tell me right away, but be gentle ;-)
Last week in Germany I finally tried out drbd with heartbeat. As far as I
can tell, it and heartbeat don't really get along too well. Heartbeat
thinks its in charge, yet drbd really needs to have more control than it is
getting.
The problem is not drbd per se, but the scripts which start it and stop it.
Since I know heartbeat pretty well, and think I understand the basics of
what drbd, I'll go ahead and jump out and make a proposal for how the
scripts ought to work. I slept too much on the plane to write the code, but
I meant to ;-) If people think this will work, I'll write it and test it
this week.
I have clear ideas on how to do this in 3 phases. I foolishly think they
might even work ;-) I asked Marius to go over this with me, I convinced him
it would also work.
There are two scripts that drbd has:
An init script to activate the drbd service, and either become secondary
or prepare to become primary.
A heartbeat start/stop script which instructs drbd to become primary
or drop its primary status and become secondary
I'll go through three phases in this discussion, for various levels of
solution. They are in order of complexity of implementation.
The first phase will *only* work with nice_failback in heartbeat
The second phase will work with either nice_failback or not.
The third phase will handle two servers going down at once and both coming
back up. It will not handle both going down, and only one coming back up.
I use these abbreviations for the state when you ask drbd:
NC: No connection to the other side
PRI: Other side is primary
SEC: Other side is secondary
I call this state of the other side "OtherState" in the text below.
For all phases, one needs to add an option to the init script where some
human being can force a machine to become primary. /etc/rc.d/init.d/drbd
primary! or something like that. This option sets the State := GOOD.
==== First phase:=============================================
This phase requires a state file with one of three possible states stored in
it:
GOOD: we have a good copy of the data
BAD: we have no good data
SYNC: we are synchronizing with the other side right now.
I use the variable name "State" to refer to this state of our local data.
This should not (need not) persist between reboots.
INIT SCRIPT "start" logic:-------:
OtherState == NC: State:= BAD. GET HUMAN HELP. This is the case where
we cannot start because we don't have good data, and
we can't get it from the other side.
== PRI: State:= SYNC. This is the case where the other side is
primary, and we're just coming up.
Try a quick sync.
If quick sync succeeds
State:=GOOD
else try a full sync
monitor the full sync in the background
if it fails, State:=BAD
when it succeeds, State:=GOOD
OtherState == SEC: State:= BAD. GET HUMAN HELP. This is the case where
neither
of us knows we have good data. We need for
some human to appoint one of us primary before
we can continue. This happens after both
both machines crash and then come back up.
STARTUP SCRIPT logic:-------:
OtherState == NC or SEC: if State == GOOD, force primary role
else GET HUMAN HELP
OtherState == PRI: GET HUMAN HELP This means the other side
is still trying to own the resource too.
This shouldn't happen.
==== Second phase:=============================================
Same as first phase EXCEPT FOR
STARTUP SCRIPT logic:-------:
OtherState == NC or SEC: if State == GOOD, force primary role
else GET HUMAN HELP {same as before}
OtherState == PRI: If State == SYNC:
wait for sync to complete then
go on. If it fails GET HUMAN HELP
If State == BAD:
GET HUMAN HELP
If State == GOOD: send the other side a msg
asking it to become secondary
Force local side into primary.
Ideally, this would be done with a DRBD command
but I'm not sure if it can be. Failing
that, use the local cluster manager API
to send the message.
==== Third phase:=============================================
For the third phase, one needs permanent generation tuples. They are an
ordered pair {manual, auto}. They must persist across reboots.
The manual number is incremented every time that a node is forced to become
master manually. When this happens, the auto number is reset to 1. The
auto element of the tuple is incremented every time a node becomes primary.
The slave keeps the same generation tuple as the primary. It only
increments it when it takes over from the primary.
There is a ">" relation on generation numbers that compare the tuple
elements in the (manual, auto) order. (1,5) > (1,4); (2,1) > (1,4), etc.
This technique either requires rewriting DRBD to accommodate these
generation numbers, or writing code which uses the local cluster manager API
to send these messages around.
In an ideal world these numbers would be stored inside the partition,
because then they would not get confused when disks get replaced, etc. It
would be nice for DRBD to support them in this way at least as an option.
This would make it more bullet-proof in the real world.
On to how they're used...
The scripts above deal for every case except for this one:
INIT SCRIPT "start" logic:-------:
OtherState == SEC:
In this case, the two sides exchange generation tuples, and if
one of them has a higher number, it changes its state to GOOD,
and the other one does a full sync from the GOOD side. When
it completes, the secondary also marks its state to GOOD.
There's a little more logic to making this work with heartbeat and making
sure the side which wants to be heartbeat master is also the DRBD master. I
leave these details as an exercise to the reader :-)
-- Alan Robertson
alanr@example.com