This is similar to a discussion we had a few months ago on the main ha
list. I didn't cross-post here, and I don't think Phillip was reading the
list at the time.
You cannot go from any configuration files to tell you who should be master,
and who should be slave, because you have to determine at run time who has a
good copy of the data. This is not something configurable statically.
The following portion of the discussion concerns the cases where at least
one machine stays up all the time - in other words, no double failures.
Double failures will be discussed later on...
If machine A is up and is master, it has the "good bits". If B is up and is
fully synced to "A", then it also has the "good bits". If either machine
goes down, then either can continue on because it has the "good bits".
If "A" goes down, and "B" takes over, then "A" doesn't have the "good bits"
any more and is inelegible to take over the service even if it comes back
up. If it syncs from "B", then it can take over, because it has the "good
bits" again.
The rule is that whenever a machine comes up automatically, it can only sync
from the other side, and then after a successful sync the data state is
good, and it can take over.
It is *VITAL* for the scripts (or someone) to track this state, so that
false takeovers don't happen and bad data is used.
Heartbeat may *instruct* DRBD to take over, but drbd cannot do that just
because heartbeat ordered it to - it may not have any good data. Heartbeat
has no way of knowing that. It just wants to always bring the service up,
whether drbd is able to or not. Hearbeat is NOT authoritative in this
matter ;-) It has no idea whether what it's asking you to do is reasonable
or possible.
This is why my proposal has a state file for data integrity, "good", "bad"
and "sync". Good meaning you have a copy of the "good bits", "bad" meaning
you don't, and "sync" meaning you're in the process of getting them...
Now, this assumes that some human anoints one of the machines to be the
primary in the first place, and every time both go down, that they do so
again. There is no easy way to avoid the first case, but there is a way of
avoiding many occurrances of the second case. DOUBLE FAILURE DISCUSSION
BELOW...
This is where the generation tuples come in. Every time a drbd transition
occurs, the generation number of the partition is incremented. When a
machine comes up and it is the first to come up, it has to wait for the
other one, and then it can decide which of them has the latest data. If
only one side comes up, then a human will have to tell the other side to
come up manually.
The reason why they're tuples is that we need a generation number which is
incremented every time a human forces a lone machine to take over, so that
in the future, only the correct version of the data is used in this case as
well. In other words, we don't care WHAT automatic generation number you're
at if the human-generation is lower. This means that human overrides always
outrank the automatically generated generation numbers.
Here are a few of examples:
-------------------------------------------------------
Machine A is master, and B is fully synced. Both have generation # (1,1)
B goes down and comes back up. "A" Knows it still has good data, so it
stays primary. "B" syncs from "A" and continues. Both still have
generation #(1,1)
-------------------------------------------------------
Machine A is master, and B is fully synced. Both have generation # (1,1)
A goes down, and B takes over. B now has generation # (1,2).
B goes down. A comes up, and looks for B. B comes back up, and the two
machines compare generation numbers (1,1) for A, and (1,2) for B. B wins,
so "A" syncs off of B and life goes on. Both now have generation # (1,2).
-------------------------------------------------------------
Machine A is master, and B is fully synced. Both have generation # (1,1)
A goes down, and B takes over. B now has generation # (1,2).
B goes down. When it comes up, it starts looking for "A". It has the good
data but doesn't know for sure that "A" doesn't have newer data. A comes
back up, and the two machines compare generation numbers (1,1) for A, and
(1,2) for B. B wins, so "A" syncs off of B and life goes on. Both now have
generation # (1,2).
I guess this shows an area for possible improvement. If a machine is
primary, and all secondaries are down, it could record information which
would tell it what machine(s) have a copy of the data. When it came back
up, if no other machines were listed in the list, it could safely take over
without further adieu. This could be generalized to handle "n" nodes quite
nicely.
-----------------------------------------------------------------------
Machine A is master, and B is fully synced. Both have generation # (1,1)
A goes down, and B takes over, and runs for a few minutes. It has
generation # (1,2).
B crashes hard, and needs repair. Machine A comes back up, and cannot take
over, because it doesn't know if it has good data (which it actually
doesn't). A human being makes the decision that it is better to run with
data which is 5 minutes old, than to not be up at all (perhaps they think
the hard disk in "B" is bad). "A" now comes back up with generation (2,1).
Eventually B may come back up with generation # (1,2) again. It sees that A
is master and has a higher generation number [ (2,1) > (1,2) ], so it syncs
up to A and goes on, with a copy of the "good bits".
Does this make any sense?
-- Alan Robertson
alanr@example.com
list. I didn't cross-post here, and I don't think Phillip was reading the
list at the time.
You cannot go from any configuration files to tell you who should be master,
and who should be slave, because you have to determine at run time who has a
good copy of the data. This is not something configurable statically.
The following portion of the discussion concerns the cases where at least
one machine stays up all the time - in other words, no double failures.
Double failures will be discussed later on...
If machine A is up and is master, it has the "good bits". If B is up and is
fully synced to "A", then it also has the "good bits". If either machine
goes down, then either can continue on because it has the "good bits".
If "A" goes down, and "B" takes over, then "A" doesn't have the "good bits"
any more and is inelegible to take over the service even if it comes back
up. If it syncs from "B", then it can take over, because it has the "good
bits" again.
The rule is that whenever a machine comes up automatically, it can only sync
from the other side, and then after a successful sync the data state is
good, and it can take over.
It is *VITAL* for the scripts (or someone) to track this state, so that
false takeovers don't happen and bad data is used.
Heartbeat may *instruct* DRBD to take over, but drbd cannot do that just
because heartbeat ordered it to - it may not have any good data. Heartbeat
has no way of knowing that. It just wants to always bring the service up,
whether drbd is able to or not. Hearbeat is NOT authoritative in this
matter ;-) It has no idea whether what it's asking you to do is reasonable
or possible.
This is why my proposal has a state file for data integrity, "good", "bad"
and "sync". Good meaning you have a copy of the "good bits", "bad" meaning
you don't, and "sync" meaning you're in the process of getting them...
Now, this assumes that some human anoints one of the machines to be the
primary in the first place, and every time both go down, that they do so
again. There is no easy way to avoid the first case, but there is a way of
avoiding many occurrances of the second case. DOUBLE FAILURE DISCUSSION
BELOW...
This is where the generation tuples come in. Every time a drbd transition
occurs, the generation number of the partition is incremented. When a
machine comes up and it is the first to come up, it has to wait for the
other one, and then it can decide which of them has the latest data. If
only one side comes up, then a human will have to tell the other side to
come up manually.
The reason why they're tuples is that we need a generation number which is
incremented every time a human forces a lone machine to take over, so that
in the future, only the correct version of the data is used in this case as
well. In other words, we don't care WHAT automatic generation number you're
at if the human-generation is lower. This means that human overrides always
outrank the automatically generated generation numbers.
Here are a few of examples:
-------------------------------------------------------
Machine A is master, and B is fully synced. Both have generation # (1,1)
B goes down and comes back up. "A" Knows it still has good data, so it
stays primary. "B" syncs from "A" and continues. Both still have
generation #(1,1)
-------------------------------------------------------
Machine A is master, and B is fully synced. Both have generation # (1,1)
A goes down, and B takes over. B now has generation # (1,2).
B goes down. A comes up, and looks for B. B comes back up, and the two
machines compare generation numbers (1,1) for A, and (1,2) for B. B wins,
so "A" syncs off of B and life goes on. Both now have generation # (1,2).
-------------------------------------------------------------
Machine A is master, and B is fully synced. Both have generation # (1,1)
A goes down, and B takes over. B now has generation # (1,2).
B goes down. When it comes up, it starts looking for "A". It has the good
data but doesn't know for sure that "A" doesn't have newer data. A comes
back up, and the two machines compare generation numbers (1,1) for A, and
(1,2) for B. B wins, so "A" syncs off of B and life goes on. Both now have
generation # (1,2).
I guess this shows an area for possible improvement. If a machine is
primary, and all secondaries are down, it could record information which
would tell it what machine(s) have a copy of the data. When it came back
up, if no other machines were listed in the list, it could safely take over
without further adieu. This could be generalized to handle "n" nodes quite
nicely.
-----------------------------------------------------------------------
Machine A is master, and B is fully synced. Both have generation # (1,1)
A goes down, and B takes over, and runs for a few minutes. It has
generation # (1,2).
B crashes hard, and needs repair. Machine A comes back up, and cannot take
over, because it doesn't know if it has good data (which it actually
doesn't). A human being makes the decision that it is better to run with
data which is 5 minutes old, than to not be up at all (perhaps they think
the hard disk in "B" is bad). "A" now comes back up with generation (2,1).
Eventually B may come back up with generation # (1,2) again. It sees that A
is master and has a higher generation number [ (2,1) > (1,2) ], so it syncs
up to A and goes on, with a copy of the "good bits".
Does this make any sense?
-- Alan Robertson
alanr@example.com