Mailing List Archive

data integrity and drbd
This is similar to a discussion we had a few months ago on the main ha
list. I didn't cross-post here, and I don't think Phillip was reading the
list at the time.

You cannot go from any configuration files to tell you who should be master,
and who should be slave, because you have to determine at run time who has a
good copy of the data. This is not something configurable statically.

The following portion of the discussion concerns the cases where at least
one machine stays up all the time - in other words, no double failures.
Double failures will be discussed later on...

If machine A is up and is master, it has the "good bits". If B is up and is
fully synced to "A", then it also has the "good bits". If either machine
goes down, then either can continue on because it has the "good bits".

If "A" goes down, and "B" takes over, then "A" doesn't have the "good bits"
any more and is inelegible to take over the service even if it comes back
up. If it syncs from "B", then it can take over, because it has the "good
bits" again.

The rule is that whenever a machine comes up automatically, it can only sync
from the other side, and then after a successful sync the data state is
good, and it can take over.

It is *VITAL* for the scripts (or someone) to track this state, so that
false takeovers don't happen and bad data is used.

Heartbeat may *instruct* DRBD to take over, but drbd cannot do that just
because heartbeat ordered it to - it may not have any good data. Heartbeat
has no way of knowing that. It just wants to always bring the service up,
whether drbd is able to or not. Hearbeat is NOT authoritative in this
matter ;-) It has no idea whether what it's asking you to do is reasonable
or possible.

This is why my proposal has a state file for data integrity, "good", "bad"
and "sync". Good meaning you have a copy of the "good bits", "bad" meaning
you don't, and "sync" meaning you're in the process of getting them...

Now, this assumes that some human anoints one of the machines to be the
primary in the first place, and every time both go down, that they do so
again. There is no easy way to avoid the first case, but there is a way of
avoiding many occurrances of the second case. DOUBLE FAILURE DISCUSSION
BELOW...

This is where the generation tuples come in. Every time a drbd transition
occurs, the generation number of the partition is incremented. When a
machine comes up and it is the first to come up, it has to wait for the
other one, and then it can decide which of them has the latest data. If
only one side comes up, then a human will have to tell the other side to
come up manually.

The reason why they're tuples is that we need a generation number which is
incremented every time a human forces a lone machine to take over, so that
in the future, only the correct version of the data is used in this case as
well. In other words, we don't care WHAT automatic generation number you're
at if the human-generation is lower. This means that human overrides always
outrank the automatically generated generation numbers.

Here are a few of examples:
-------------------------------------------------------
Machine A is master, and B is fully synced. Both have generation # (1,1)

B goes down and comes back up. "A" Knows it still has good data, so it
stays primary. "B" syncs from "A" and continues. Both still have
generation #(1,1)

-------------------------------------------------------
Machine A is master, and B is fully synced. Both have generation # (1,1)
A goes down, and B takes over. B now has generation # (1,2).
B goes down. A comes up, and looks for B. B comes back up, and the two
machines compare generation numbers (1,1) for A, and (1,2) for B. B wins,
so "A" syncs off of B and life goes on. Both now have generation # (1,2).

-------------------------------------------------------------
Machine A is master, and B is fully synced. Both have generation # (1,1)
A goes down, and B takes over. B now has generation # (1,2).
B goes down. When it comes up, it starts looking for "A". It has the good
data but doesn't know for sure that "A" doesn't have newer data. A comes
back up, and the two machines compare generation numbers (1,1) for A, and
(1,2) for B. B wins, so "A" syncs off of B and life goes on. Both now have
generation # (1,2).

I guess this shows an area for possible improvement. If a machine is
primary, and all secondaries are down, it could record information which
would tell it what machine(s) have a copy of the data. When it came back
up, if no other machines were listed in the list, it could safely take over
without further adieu. This could be generalized to handle "n" nodes quite
nicely.

-----------------------------------------------------------------------
Machine A is master, and B is fully synced. Both have generation # (1,1)

A goes down, and B takes over, and runs for a few minutes. It has
generation # (1,2).

B crashes hard, and needs repair. Machine A comes back up, and cannot take
over, because it doesn't know if it has good data (which it actually
doesn't). A human being makes the decision that it is better to run with
data which is 5 minutes old, than to not be up at all (perhaps they think
the hard disk in "B" is bad). "A" now comes back up with generation (2,1).
Eventually B may come back up with generation # (1,2) again. It sees that A
is master and has a higher generation number [ (2,1) > (1,2) ], so it syncs
up to A and goes on, with a copy of the "good bits".

Does this make any sense?

-- Alan Robertson
alanr@example.com
Re: data integrity and drbd [ In reply to ]
[...]
> If machine A is up and is master, it has the "good bits". If B is up and
is
> fully synced to "A", then it also has the "good bits". If either machine
> goes down, then either can continue on because it has the "good bits".
[...]

Hi philipp / all,

Regarding this proposal, I am wondering if a site can be really be late on
sync with protocol B and C and a filesystem mounted with the sync attribute
? I guess is that the remote filesystem is always a mirror of the master and
it can not happen.

The only moment it is not is during a Quick Sync after a reboot and a full
Sync which can only be started manually. "datascript status" already detect
those case and report "non ready" if a sync is performed. so no takeover can
be done by heartbeat.

As far as I understood, philipp is willing to implement a CSN system (which
I didn't fully understood yet as I didn't find the time to read his mail
carefully) which may do what you nicely try to explain to us.

Can philipp or marcello, confirm that please ?

Thomas
Re: data integrity and drbd [ In reply to ]
On Tue, 21 Nov 2000, Thomas Mangin wrote:

> [...]
> > If machine A is up and is master, it has the "good bits". If B is up and
> is
> > fully synced to "A", then it also has the "good bits". If either machine
> > goes down, then either can continue on because it has the "good bits".
> [...]
>
> Hi philipp / all,
>
> Regarding this proposal, I am wondering if a site can be really be late on
> sync with protocol B and C and a filesystem mounted with the sync
> attribute ?
>
> I guess is that the remote filesystem is always a mirror of the master and
> it can not happen.
>
>
> The only moment it is not is during a Quick Sync after a reboot and a full
> Sync which can only be started manually. "datascript status" already detect
> those case and report "non ready" if a sync is performed. so no takeover can
> be done by heartbeat.

Think about the following case:

1 Node A (primary) -> Node B (secondary)
2 Node A crashes -> Node B takeover (primary)
3 Node A crashed -> Node B crashes
4 Node A comes up and comes back to life. Since Node B is crashed,
Node A will mount the drbd partition and become primary.

During phase 3 (while Node A is crashed) Node B may write data to the
drbd partition.

If this happens, data on Node A partition is not the newer anymore and it
can't be used until it sync with Node B partition, which has the newer
data.
Re: data integrity and drbd [ In reply to ]
On Tue, 21 Nov 2000, Alan Robertson wrote:

> data but doesn't know for sure that "A" doesn't have newer data. A comes
> back up, and the two machines compare generation numbers (1,1) for A, and
> (1,2) for B. B wins, so "A" syncs off of B and life goes on.

<snip>

Just as a comment, failover behaviour must be selected by the admin.

> Does this make any sense?

The generation number looks fine to me.
Re: data integrity and drbd [ In reply to ]
Thomas Mangin wrote:
>
> [...]
> > If machine A is up and is master, it has the "good bits". If B is up and
> is
> > fully synced to "A", then it also has the "good bits". If either machine
> > goes down, then either can continue on because it has the "good bits".
> [...]
>
> Hi philipp / all,
>
> Regarding this proposal, I am wondering if a site can be really be late on
> sync with protocol B and C and a filesystem mounted with the sync attribute
> ? I guess is that the remote filesystem is always a mirror of the master and
> it can not happen.

In the discussion, I assumed you were using a protocol and a mount method
which would lead you to be happy to let the other node take over. If not,
you made a mistake, and neither heartbeat nor drbd are in any position to do
anything abou tit.

-- Alan Robertson
alanr@example.com
Re: data integrity and drbd [ In reply to ]
Marcelo Tosatti wrote:
>
> On Tue, 21 Nov 2000, Thomas Mangin wrote:
>
> > [...]
> > > If machine A is up and is master, it has the "good bits". If B is up and
> > is
> > > fully synced to "A", then it also has the "good bits". If either machine
> > > goes down, then either can continue on because it has the "good bits".
> > [...]
> >
> > Hi philipp / all,
> >
> > Regarding this proposal, I am wondering if a site can be really be late on
> > sync with protocol B and C and a filesystem mounted with the sync
> > attribute ?
> >
> > I guess is that the remote filesystem is always a mirror of the master and
> > it can not happen.
> >
> >
> > The only moment it is not is during a Quick Sync after a reboot and a full
> > Sync which can only be started manually. "datascript status" already detect
> > those case and report "non ready" if a sync is performed. so no takeover can
> > be done by heartbeat.
>
> Think about the following case:
>
> 1 Node A (primary) -> Node B (secondary)
> 2 Node A crashes -> Node B takeover (primary)
> 3 Node A crashed -> Node B crashes
> 4 Node A comes up and comes back to life. Since Node B is crashed,
> Node A will mount the drbd partition and become primary.
>
> During phase 3 (while Node A is crashed) Node B may write data to the
> drbd partition.

And, you can expect it to!

> If this happens, data on Node A partition is not the newer anymore and it
> can't be used until it sync with Node B partition, which has the newer
> data.

This is the whole point of the generation numbers, obviously ;-)

-- Alan Robertson
alanr@example.com
Re: data integrity and drbd [ In reply to ]
Hi Alan,

I like this tuple scheme. ( You explained it in far more
detail than I explained my CSN approach )

BTW have you read the thread beginning with the mail:
http://www.geocrawler.com/lists/3/SourceForge/3756/25/4661083/

So let's conclude:
It's of no doubt that we need some kind of drbd-metadata store.

Where should we store that?

Is it possible to detect if QuickSync is sufficient?
(Is it possible to detect if the disk was replaced)

-Philipp

* Alan Robertson <alanr@example.com> [001121 18:48]:
> This is similar to a discussion we had a few months ago on the main ha
> list. I didn't cross-post here, and I don't think Phillip was reading the
> list at the time.
>
> You cannot go from any configuration files to tell you who should be master,
> and who should be slave, because you have to determine at run time who has a
> good copy of the data. This is not something configurable statically.
>
> The following portion of the discussion concerns the cases where at least
> one machine stays up all the time - in other words, no double failures.
> Double failures will be discussed later on...
>
> If machine A is up and is master, it has the "good bits". If B is up and is
> fully synced to "A", then it also has the "good bits". If either machine
> goes down, then either can continue on because it has the "good bits".
>
> If "A" goes down, and "B" takes over, then "A" doesn't have the "good bits"
> any more and is inelegible to take over the service even if it comes back
> up. If it syncs from "B", then it can take over, because it has the "good
> bits" again.
>
> The rule is that whenever a machine comes up automatically, it can only sync
> from the other side, and then after a successful sync the data state is
> good, and it can take over.
>
> It is *VITAL* for the scripts (or someone) to track this state, so that
> false takeovers don't happen and bad data is used.
>
> Heartbeat may *instruct* DRBD to take over, but drbd cannot do that just
> because heartbeat ordered it to - it may not have any good data. Heartbeat
> has no way of knowing that. It just wants to always bring the service up,
> whether drbd is able to or not. Hearbeat is NOT authoritative in this
> matter ;-) It has no idea whether what it's asking you to do is reasonable
> or possible.
>
> This is why my proposal has a state file for data integrity, "good", "bad"
> and "sync". Good meaning you have a copy of the "good bits", "bad" meaning
> you don't, and "sync" meaning you're in the process of getting them...
>
> Now, this assumes that some human anoints one of the machines to be the
> primary in the first place, and every time both go down, that they do so
> again. There is no easy way to avoid the first case, but there is a way of
> avoiding many occurrances of the second case. DOUBLE FAILURE DISCUSSION
> BELOW...
>
> This is where the generation tuples come in. Every time a drbd transition
> occurs, the generation number of the partition is incremented. When a
> machine comes up and it is the first to come up, it has to wait for the
> other one, and then it can decide which of them has the latest data. If
> only one side comes up, then a human will have to tell the other side to
> come up manually.
>
> The reason why they're tuples is that we need a generation number which is
> incremented every time a human forces a lone machine to take over, so that
> in the future, only the correct version of the data is used in this case as
> well. In other words, we don't care WHAT automatic generation number you're
> at if the human-generation is lower. This means that human overrides always
> outrank the automatically generated generation numbers.
>
> Here are a few of examples:
> -------------------------------------------------------
> Machine A is master, and B is fully synced. Both have generation # (1,1)
>
> B goes down and comes back up. "A" Knows it still has good data, so it
> stays primary. "B" syncs from "A" and continues. Both still have
> generation #(1,1)
>
> -------------------------------------------------------
> Machine A is master, and B is fully synced. Both have generation # (1,1)
> A goes down, and B takes over. B now has generation # (1,2).
> B goes down. A comes up, and looks for B. B comes back up, and the two
> machines compare generation numbers (1,1) for A, and (1,2) for B. B wins,
> so "A" syncs off of B and life goes on. Both now have generation # (1,2).
>
> -------------------------------------------------------------
> Machine A is master, and B is fully synced. Both have generation # (1,1)
> A goes down, and B takes over. B now has generation # (1,2).
> B goes down. When it comes up, it starts looking for "A". It has the good
> data but doesn't know for sure that "A" doesn't have newer data. A comes
> back up, and the two machines compare generation numbers (1,1) for A, and
> (1,2) for B. B wins, so "A" syncs off of B and life goes on. Both now have
> generation # (1,2).
>
> I guess this shows an area for possible improvement. If a machine is
> primary, and all secondaries are down, it could record information which
> would tell it what machine(s) have a copy of the data. When it came back
> up, if no other machines were listed in the list, it could safely take over
> without further adieu. This could be generalized to handle "n" nodes quite
> nicely.
>
> -----------------------------------------------------------------------
> Machine A is master, and B is fully synced. Both have generation # (1,1)
>
> A goes down, and B takes over, and runs for a few minutes. It has
> generation # (1,2).
>
> B crashes hard, and needs repair. Machine A comes back up, and cannot take
> over, because it doesn't know if it has good data (which it actually
> doesn't). A human being makes the decision that it is better to run with
> data which is 5 minutes old, than to not be up at all (perhaps they think
> the hard disk in "B" is bad). "A" now comes back up with generation (2,1).
> Eventually B may come back up with generation # (1,2) again. It sees that A
> is master and has a higher generation number [ (2,1) > (1,2) ], so it syncs
> up to A and goes on, with a copy of the "good bits".
>
> Does this make any sense?
>
> -- Alan Robertson
> alanr@example.com
> _______________________________________________
> DRBD-devel mailing list
> DRBD-devel@example.com
> http://lists.sourceforge.net/mailman/listinfo/drbd-devel
>
Re: data integrity and drbd [ In reply to ]
Philipp Reisner wrote:
>
> Hi Alan,
>
> I like this tuple scheme. ( You explained it in far more
> detail than I explained my CSN approach )
>
> BTW have you read the thread beginning with the mail:
> http://www.geocrawler.com/lists/3/SourceForge/3756/25/4661083/

I read parts of it, but I was too focused on what was already in my mind
(sorry!). It might not have been too clear at the time, but this generation
number scheme was what I talked to you about at Miami a month or so ago.

> So let's conclude:
> It's of no doubt that we need some kind of drbd-metadata store.
>
> Where should we store that?

In the disk partition, if possible. (I reordered your questions). Marcelo
says at the end of the partition. I think he's right. But then you'll have
to be LVM aware... I suppose you ought to be anyway... Growing a DRBD
partition is an interesting thought (which makes my head hurt).

> (Is it possible to detect if the disk was replaced)

I would support a magic number, and generation number and perhaps a little
other info inside the partition. If it's inside the partition, then

> Is it possible to detect if QuickSync is sufficient?

A very interesting question.

The most conservative assumption is that every time a node connects which
hasn't been synced in the past, you should do a full sync. Let me show an
example where this is necessary.

A is PRI
B is SEC
A fails.
B takes over.
A reboots -- and MUST have a full sync. Here's why:

A had disk blocks written which B never saw. B went on and overwrote some
but not all of those with different data. Now, unless A undoes all the disk
writes which B never ACKed, it will have an inconsistent set of disk blocks.

Possible cures to this include:

Implement a journal for A, so it can rollback changes never seen by B

Never write a disk block on A unless it was already ACKed by B

Put the bit map in NVRAM or on disk, and request all blocks that were in
it before.

OR Just give up and do a full sync.


Without rearchitecting drbd this means:

Any time the other side has a generation number higher than yours,
you have to do a full sync.

-- Alan Robertson
alanr@example.com
Re: data integrity and drbd [ In reply to ]
On 2000-11-21T15:32:16,
tony willoughby <twilloughby@example.com> said:

> We are using heartbeat/DRBD to achieve some level of high availability.
> As I understand this approach, if a node fails hard while the other node
> is booting then the booting node will not become operational without
> operator intervention.
>
> Am I correct? If so, then then HA/DRBD will no longer provide what I need.

You are correct.

However, what you currently have doesn't provide HA in this situation but data
corruption.

Sincerely,
Lars Marowsky-Brée <lmb@example.com>
Development HA

--
Perfection is our goal, excellence will be tolerated. -- J. Yahl
Re: data integrity and drbd [ In reply to ]
On Tue, 21 Nov 2000, Alan Robertson wrote:

> tony willoughby wrote:
> >
> > Alan,
> > Thank you for the crystal clear description.
> >
> > This seems like a good approach. I do have one concern though.
> >
> Thanks!
> >
> > We are using heartbeat/DRBD to achieve some level of high availability.
> > As I understand this approach, if a node fails hard while the other node
> > is booting then the booting node will not become operational without
> > operator intervention.
> >
> > Am I correct?
>
> Mostly. As I wrote the description, I realized that in the case where the
> node knew that the other side didn't have good data, it could go ahead and
> come up anyway. However you are correct in that there some circumstances
> where it cannot determine that the data is the right version it will refuse
> to come up in order to keep from compromising data integrity.
>
> > If so, then then HA/DRBD will no longer provide what I need.
>
> The approach I used was to consider data integrity paramount. One can
> always compromise that for an application where availability is higher
> priority than data integrity. There are two or three different "Oh Golly!"
> circumstances where it is unable to continue. It would certainly be
> possible to create an error handler interface where you could provide a
> script to run when this happens. Your script could then issue the "manual
> intervention" command. In this case it would continue just as though a
> human had done it.
>
> I assumed that the error exits would be a nice idea. You're saying that for
> you they are necessary. I don't see this as a show-stopper. Do you think
> this would meet your needs?

Yes, I believe this will satisfy my concerns.
Perhaps a timeout then invoke the error handler.

Tony Willoughby
ADC Telecommunications, Inc.
Broadband Access and Transport Group
mailto:tony_willoughby@example.com
Re: data integrity and drbd [ In reply to ]
On Tue, 21 Nov 2000, Lars Marowsky-Bree wrote:

> On 2000-11-21T15:32:16,
> tony willoughby <twilloughby@example.com> said:
>
> > We are using heartbeat/DRBD to achieve some level of high availability.
> > As I understand this approach, if a node fails hard while the other node
> > is booting then the booting node will not become operational without
> > operator intervention.
> >
> > Am I correct? If so, then then HA/DRBD will no longer provide what I need.
>
> You are correct.
>
> However, what you currently have doesn't provide HA in this situation but data
> corruption.

I understand. I'd like to have HA and good data.

:^)


Tony Willoughby
ADC Telecommunications, Inc.
Broadband Access and Transport Group
mailto:tony_willoughby@example.com
Re: data integrity and drbd [ In reply to ]
On 2000-11-21T13:44:43,
Alan Robertson <alanr@example.com> said:

> Mostly. As I wrote the description, I realized that in the case where the
> node knew that the other side didn't have good data, it could go ahead and
> come up anyway.

What are these cases?

How can a node know whether the operator did something to the other side? Ok,
if it can communicate, it may be able to - if both sides have bad data, and
they can communicate, they may decide to go "OK, you win, I lose".

If only one side comes up, it can STONITH the other side to make sure the
other side doesn't have good data ;) and go ahead and come up itself.

Sincerely,
Lars Marowsky-Brée <lmb@example.com>
Development HA

--
Perfection is our goal, excellence will be tolerated. -- J. Yahl
AW: data integrity and drbd [ In reply to ]
Hi Alan,

> The approach I used was to consider data integrity paramount. One can
> always compromise that for an application where availability is higher
> priority than data integrity. There are two or three different
> "Oh Golly!"
> circumstances where it is unable to continue. It would certainly be
> possible to create an error handler interface where you could provide a
> script to run when this happens. Your script could then issue the "manual
> intervention" command. In this case it would continue just as though a
> human had done it.
>
> I assumed that the error exits would be a nice idea. You're
> saying that for
> you they are necessary. I don't see this as a show-stopper. Do you think
> this would meet your needs?

I think this will proove to be a rather common need and it'd be nice to have
it supported "out of the box".

The probelm: a node starts up; it knows that it has a valid copy of data
(i.e drbd wasn't right in the middle of a sync when it went down). it needs
to contact the other side to find out which side has more recent data.

Probably a desirable reaction for high-availability requirements (as oposed
to emphasis on data integrity): have a configurable timeout for how long to
wait for info in the other node to become available; continue startup after
timeout expires. for max, data integrity, timeout might be infinite.

During wait for other node accept user interaction like pressing a key
(local console) or touching a file (remote).

Bye, Martin

"you have moved your mouse, please reboot to make this change take effect"
--------------------------------------------------
Martin Bene vox: +43-316-813824
simon media fax: +43-316-813824-6
Nikolaiplatz 4 e-mail: mb@example.com
8020 Graz, Austria
--------------------------------------------------
finger mb@example.com for PGP public key
Re: data integrity and drbd [ In reply to ]
Lars Marowsky-Bree wrote:
>
> On 2000-11-21T13:44:43,
> Alan Robertson <alanr@example.com> said:
>
> > Mostly. As I wrote the description, I realized that in the case where the
> > node knew that the other side didn't have good data, it could go ahead and
> > come up anyway.
>
> What are these cases?
>
> How can a node know whether the operator did something to the other side? Ok,
> if it can communicate, it may be able to - if both sides have bad data, and
> they can communicate, they may decide to go "OK, you win, I lose".

In the case where a node was "anointed" by an operator, it can't know that.
It *can* know that unless manual intervention occurred it is safe to come up
by itself. This requires a little more thought.

Basically, the idea was that if you know that the other side didn't have a
good copy of the data, then you could remember that. If you came up, and
that was the case, and you couldn't contact the other side, then you could
go ahead and start up without manual intervention.

As you point out, the other side *could* have a newer copy of the data, if
it had been manually appointed as the keeper of the good data.

> If only one side comes up, it can STONITH the other side to make sure the
> other side doesn't have good data ;) and go ahead and come up itself.

I don't think Stonith helps here, it just makes you feel better ;-).

The problem comes out of this:

You can have good data, but not quorum.

You can have quorum, but no good data.

This is why I talked about the tension between heartbeat and drbd.
[Heartbeat doesn't manage quorum, but it should ;-)]

If you have quorum, you might very well decide to stonith the other guy.
That isn't going to give you good data, unless rebooting him fixes your
OS/network problems. You still can't provide service.

-- Alan Robertson
alanr@example.com
Re: data integrity and drbd [ In reply to ]
On Tue, Nov 21, 2000 at 01:29:45PM -0700, Alan Robertson wrote:
> Philipp Reisner wrote:
> >
> > Hi Alan,
> >
> > I like this tuple scheme. ( You explained it in far more
> > detail than I explained my CSN approach )

I agree, the tuple scheme is quite good. I think there may be some more
information about deliberate state changes, as opposed to crashes/reboots,
that might be worth keeping to reduce the number of interventions needed.
I will try to come up with a proposal over the thanksgiving holidays.

> > So let's conclude:
> > It's of no doubt that we need some kind of drbd-metadata store.
> >
> > Where should we store that?
>
> In the disk partition, if possible. (I reordered your questions). Marcelo
> says at the end of the partition. I think he's right. But then you'll have
> to be LVM aware... I suppose you ought to be anyway... Growing a DRBD
> partition is an interesting thought (which makes my head hurt).

I really don't like this much. The problem is that drbd can be used by lvm,
various filesystems, and as raw space for things like databases. So unless
you fudge the partitioning scheme to make drbd devices appear smaller than
they are (good luck, there are lots of partitions schemes), you are counting
on getting lucky and not having the metadata overwritten.

Also, one of the really nice things about drbd is that it is totally format
compatible with the lower devices. I can take an existing filesystem and make
it a drbd primary anytime or shutdown drbd anytime and continue to use the
filesystem on the lower device.

I don't see any reason not to store the drbd state in /var somewhere. We
need a / filesystem to read the configuration anyway. If we lose / the
admins will have to be involved alread, so the lack of automatic recovery
is not a serious limitation.


Perhaps a little prioritization is in order: currently drbd has some
issues that make it marginal for most users:

- not SMP stable?

- resyncer very slow?

- protocol timeouts?

- config files difficult to manage for large numbers of drbd devices or
many nodes?

- easy to mess up and get primary/secondary confused?

- clustermanager interactions iffy?

- does not support serving blocks for multiple nodes, ie for GFS?

These may not be the right set of issues, but it is clear that there is
lots to do to drbd before it is "finished". It is not clear that spending
lots of effort to make it hide partition space from all the different
partition tools and LVM etc is the best thing to do right now.

I would like to see us get the stored state (probably with Alans tuples)
working soon. It will save a lot users from potential disasters. Perhaps
we could start with it in /var, and then if an easy way to store it on
the disks themselves comes along, we can always move it.

-dg

--
David Gould dg@example.com
SuSE, Inc., 580 2cd St. #210, Oakland, CA 94607 510.628.3380
"I personally think Unix is "superior" because on LSD it tastes
like Blue." -- jbarnett
Re: data integrity and drbd [ In reply to ]
On Tue, Nov 21, 2000 at 10:08:20PM +0100, Lars Marowsky-Bree wrote:
> On 2000-11-21T15:32:16,
> tony willoughby <twilloughby@example.com> said:
>
> > We are using heartbeat/DRBD to achieve some level of high availability.
> > As I understand this approach, if a node fails hard while the other node
> > is booting then the booting node will not become operational without
> > operator intervention.
> >
> > Am I correct? If so, then then HA/DRBD will no longer provide what I need.
>
> You are correct.
>
> However, what you currently have doesn't provide HA in this situation but data
> corruption.

This is a little overstated. Currently drbd makes it easy to get data
corruption, and does little to protect admins from mistakes. But it is
not always a mistake to come up and it will not always lead to corruption.

-dg

--
David Gould dg@example.com
SuSE, Inc., 580 2cd St. #210, Oakland, CA 94607 510.628.3380
"I personally think Unix is "superior" because on LSD it tastes
like Blue." -- jbarnett
Re: data integrity and drbd [ In reply to ]
On Tue, Nov 21, 2000 at 01:44:43PM -0700, Alan Robertson wrote:
> tony willoughby wrote:
> >
> > Alan,
> > Thank you for the crystal clear description.
> >
> > This seems like a good approach. I do have one concern though.
> >
> Thanks!
> >
> > We are using heartbeat/DRBD to achieve some level of high availability.
> > As I understand this approach, if a node fails hard while the other node
> > is booting then the booting node will not become operational without
> > operator intervention.
> >
> > Am I correct?
>
> Mostly. As I wrote the description, I realized that in the case where the
> node knew that the other side didn't have good data, it could go ahead and
> come up anyway. However you are correct in that there some circumstances
> where it cannot determine that the data is the right version it will refuse
> to come up in order to keep from compromising data integrity.
>
> > If so, then then HA/DRBD will no longer provide what I need.

Hmmm, this is why I wanted to store a bit more data about state
transitions. For example:

A is pri, B is sec.

B crashes, A continues.

A is shutdown cleanly and then restarted. B is still dead.

At this point, A could start as pri if it knows:
A was pri and cleanly shutdown.
B was stale (since disconnected before A's last write)

> I assumed that the error exits would be a nice idea. You're saying that for
> you they are necessary. I don't see this as a show-stopper. Do you think
> this would meet your needs?

I dunno, an error exit without more information availible to it is going to
have a hard time coming up with the "right answer". And if it is just going
to time out and force the system to come up anyway, welllllll,if you are
going to corrupt the data, why wait 10 minutes?

-dg
--
David Gould dg@example.com
SuSE, Inc., 580 2cd St. #210, Oakland, CA 94607 510.628.3380
"I personally think Unix is "superior" because on LSD it tastes
like Blue." -- jbarnett
Re: data integrity and drbd [ In reply to ]
David Gould wrote:
>

>
> - easy to mess up and get primary/secondary confused?
>
> - clustermanager interactions iffy?

I believe that this proposal will make these two problems go away.

-- Alan Robertson
alanr@example.com
Re: data integrity and drbd [ In reply to ]
David Gould wrote:
>
> On Tue, Nov 21, 2000 at 01:44:43PM -0700, Alan Robertson wrote:
> > tony willoughby wrote:
> > >
> > > Alan,
> > > Thank you for the crystal clear description.
> > >
> > > This seems like a good approach. I do have one concern though.
> > >
> > Thanks!
> > >
> > > We are using heartbeat/DRBD to achieve some level of high availability.
> > > As I understand this approach, if a node fails hard while the other node
> > > is booting then the booting node will not become operational without
> > > operator intervention.
> > >
> > > Am I correct?
> >
> > Mostly. As I wrote the description, I realized that in the case where the
> > node knew that the other side didn't have good data, it could go ahead and
> > come up anyway. However you are correct in that there some circumstances
> > where it cannot determine that the data is the right version it will refuse
> > to come up in order to keep from compromising data integrity.
> >
> > > If so, then then HA/DRBD will no longer provide what I need.
>
> Hmmm, this is why I wanted to store a bit more data about state
> transitions. For example:
>
> A is pri, B is sec.
>
> B crashes, A continues.
>
> A is shutdown cleanly and then restarted. B is still dead.
>
> At this point, A could start as pri if it knows:
> A was pri and cleanly shutdown.
> B was stale (since disconnected before A's last write)

This was discussed in other emails. You're right iff no one gave B a manual
override while "A" was down.. In that case, "B" has the good bits.

> > I assumed that the error exits would be a nice idea. You're saying that for
> > you they are necessary. I don't see this as a show-stopper. Do you think
> > this would meet your needs?
>
> I dunno, an error exit without more information availible to it is going to
> have a hard time coming up with the "right answer". And if it is just going
> to time out and force the system to come up anyway, welllllll,if you are
> going to corrupt the data, why wait 10 minutes?

What he was saying was "I know my application, and my customers and I want
to bring For some applications, this is exactly the right thing to do.
Others (banking, etc), this is a mistake. The reason for waiting a few
minutes is to let both machines recover and reboot from a power outage. One
machine will almost certainly come up first. It should wait a few minutes
to allow the other machine to come up, or get diddled by an admin and come
up.

What if your application is like doubleclick.com. Losing some advertising
clickthrough information is certainly less important than not showing
advertisements!

This is a perfect example where you don't want to wait for a human before
continuing.

-- Alan Robertson
alanr@example.com
Re: data integrity and drbd [ In reply to ]
On Tue, Nov 21, 2000 at 06:47:34PM -0700, Alan Robertson wrote:
> David Gould wrote:
> >
> > On Tue, Nov 21, 2000 at 01:44:43PM -0700, Alan Robertson wrote:
> > > tony willoughby wrote:
> > Hmmm, this is why I wanted to store a bit more data about state
> > transitions. For example:
> >
> > A is pri, B is sec.
> >
> > B crashes, A continues.
> >
> > A is shutdown cleanly and then restarted. B is still dead.
> >
> > At this point, A could start as pri if it knows:
> > A was pri and cleanly shutdown.
> > B was stale (since disconnected before A's last write)
>
> This was discussed in other emails. You're right iff no one gave B a manual
> override while "A" was down.. In that case, "B" has the good bits.

Sure, it doesn't cover everything, but it catches the usual cases. If the
opers insist on fooling it, well ... let them.

> > I dunno, an error exit without more information availible to it is going to
> > have a hard time coming up with the "right answer". And if it is just going
> > to time out and force the system to come up anyway, welllllll,if you are
> > going to corrupt the data, why wait 10 minutes?
>
> What he was saying was "I know my application, and my customers and I want
> to bring For some applications, this is exactly the right thing to do.
> Others (banking, etc), this is a mistake. The reason for waiting a few
> minutes is to let both machines recover and reboot from a power outage. One
> machine will almost certainly come up first. It should wait a few minutes
> to allow the other machine to come up, or get diddled by an admin and come
> up.
>
> What if your application is like doubleclick.com. Losing some advertising
> clickthrough information is certainly less important than not showing
> advertisements!
>
> This is a perfect example where you don't want to wait for a human before
> continuing.

We agree.
-dg

--
David Gould dg@example.com
SuSE, Inc., 580 2cd St. #210, Oakland, CA 94607 510.628.3380
"I personally think Unix is "superior" because on LSD it tastes
like Blue." -- jbarnett
Re: data integrity and drbd [ In reply to ]
On 2000-11-21T15:24:26,
David Gould <dg@example.com> said:

> Perhaps a little prioritization is in order: currently drbd has some
> issues that make it marginal for most users:
>
> - not SMP stable?

Serious.

> - resyncer very slow?

Not _that_ serious, although I would love if we could get some testing over
GigE done. If noone has the right hardware, I am sure SuSE will supply David
with it ;-)

> - protocol timeouts?
>
> - config files difficult to manage for large numbers of drbd devices or
> many nodes?

Annoying.

> - easy to mess up and get primary/secondary confused?
>
> - clustermanager interactions iffy?

Dangerous. Will be fixed - as far as I can see - by Alan's proposal I think.

> - does not support serving blocks for multiple nodes, ie for GFS?

Not that serious. This is a nice feature, and I definetely want it - not only
for GFS but other apps which can use raw partitions too - but it has about the
same priority as "more than 2 nodes".

> I would like to see us get the stored state (probably with Alans tuples)
> working soon. It will save a lot users from potential disasters. Perhaps
> we could start with it in /var, and then if an easy way to store it on
> the disks themselves comes along, we can always move it.

I completely agree.

Having the status in a _human readable_ file under /var/state (I think that
would be the right directory according to the FSSTND) would make
implementation and debugging easier, and also do away with all those annoying
issues about interacting with the data on the drbd partition.

Sincerely,
Lars Marowsky-Brée <lmb@example.com>
Development HA

--
Perfection is our goal, excellence will be tolerated. -- J. Yahl
Re: data integrity and drbd [ In reply to ]
On 2000-11-21T15:22:04,
Alan Robertson <alanr@example.com> said:

> The problem comes out of this:
>
> You can have good data, but not quorum.
>
> You can have quorum, but no good data.
>
> This is why I talked about the tension between heartbeat and drbd.
> [Heartbeat doesn't manage quorum, but it should ;-)]

Ok.

Hmmm. How can you have good data if you don't have quorum - you can't confirm
if you have good data, because that would require quorum.

You may _think_ you have better data than the rest of the cluster (which in
drbd's case boils down to the other node serving that drbd volume), but
without quorum, you can't decide on that.

> If you have quorum, you might very well decide to stonith the other guy.
> That isn't going to give you good data, unless rebooting him fixes your
> OS/network problems. You still can't provide service.

If you STONITH the other guy (and thus regain quorum in the case of a cluster
partition), the quorum ceptre allows you to postulate that your data is good.

If you don't have quorum, your data is not-good by definition.

Hmmmmmm. This problem boils down to "having had quorum the entire time since
last synced" (if the answer is yes, your data is good by definition), and what
to do if in the case of two nodes, both sides lost quorum and come up with two
potentially different data sets - who has the _better_ data in that case?

Ok, I think I am looping and going bad to the start ;-) But the interaction
between quorum and good data is IMHO definetely there.

Sincerely,
Lars Marowsky-Brée <lmb@example.com>
Development HA

--
Perfection is our goal, excellence will be tolerated. -- J. Yahl
Re: data integrity and drbd [ In reply to ]
On 2000-11-21T15:49:30,
David Gould <dg@example.com> said:

> Hmmm, this is why I wanted to store a bit more data about state
> transitions. For example:
>
> A is pri, B is sec.
>
> B crashes, A continues.
>
> A is shutdown cleanly and then restarted. B is still dead.
>
> At this point, A could start as pri if it knows:
> A was pri and cleanly shutdown.
> B was stale (since disconnected before A's last write)

How do we confirm B is still dead in case of a cluster partition?

Especially in the case of drbd being used for long distance mirroring, this is
possible.

Ok, this essentially requires double failure. A reboot of "A" during a
potential crash of "B" or during a cluster partition, which we can't decide
between without a third vote.

I think the best we can do in that case is scream at the operator and demand
manual intervention.

Damn, I think we really need a big graphical flow chart with all the possible
cases and see they are covered and help visualize the problem... xfig or Dia
anyone? ;-)

> I dunno, an error exit without more information availible to it is going to
> have a hard time coming up with the "right answer". And if it is just going
> to time out and force the system to come up anyway, welllllll,if you are
> going to corrupt the data, why wait 10 minutes?

Heh. ;-)

Sincerely,
Lars Marowsky-Brée <lmb@example.com>
Development HA

--
Perfection is our goal, excellence will be tolerated. -- J. Yahl
Re: data integrity and drbd [ In reply to ]
On Tue, 21 Nov 2000, David Gould wrote:

> I really don't like this much. The problem is that drbd can be used by lvm,
> various filesystems, and as raw space for things like databases. So unless
> you fudge the partitioning scheme to make drbd devices appear smaller than
> they are (good luck, there are lots of partitions schemes), you are counting
> on getting lucky and not having the metadata overwritten.

LVM tools use the BLKGETSIZE ioctl to get the device size so it works
correctly with metadata at the of the disk.

Things which deal with raw access should use BLKGETSIZE, too.

> Also, one of the really nice things about drbd is that it is totally format
> compatible with the lower devices. I can take an existing filesystem and make
> it a drbd primary anytime or shutdown drbd anytime and continue to use the
> filesystem on the lower device.

You can continue to use the filesystem directly (without drbd on top) with
metadata at the end of the disk.

The filesystem size is specified at creation time and stored in the
superblock. If you create the fs on top of drbd, it will inform the device
size as being real size minus drbd metadata. This way you can access the
device later without losing the metadata.

> I don't see any reason not to store the drbd state in /var somewhere. We
> need a / filesystem to read the configuration anyway. If we lose / the
> admins will have to be involved alread, so the lack of automatic recovery
> is not a serious limitation.
>
> Perhaps a little prioritization is in order: currently drbd has some
> issues that make it marginal for most users:
>
> - not SMP stable?

I used it with SMP sometime ago and it was stable. Since the locking
schemes haven't changed, it should be ok now, too.

> - resyncer very slow?
>
> - protocol timeouts?
>
> - config files difficult to manage for large numbers of drbd devices or
> many nodes?
>
> - easy to mess up and get primary/secondary confused?
>
> - clustermanager interactions iffy?
>
> - does not support serving blocks for multiple nodes, ie for GFS?

You want the drbd superblock (at the end of disk) for that, first.

The generation numbers, for example, should be stored on this superblock.

This way, the drbd partition is independant from the system which it was
running.

> These may not be the right set of issues, but it is clear that there is
> lots to do to drbd before it is "finished".

Another imporant issue with must be fixed, is IO error handling when
it occurs in the low level device:

1 - a "bad" bit is marked on the drbd superblock.

2 - Always panic or panic only if there is another "in sync" mirror. (user
selectable)

If the "bad" bit is set, the drbd may not allow to mount the partition
again. The "bad" bit can be cleaned by the admin later.

Of course we should not force this behaviour to the user. It must be configurable.

Comments?

> It is not clear that spending lots of effort to make it hide partition
> space from all the different partition tools and LVM etc is the best
> thing to do right now.

This hidden is space is important to store status information which is
critical (so it should not depend on the system which its running) such as
the generation numbers (as I previously said), status information (in
sync, bad, etc..) and maybe other stuff in the future like identification
numbers when we are using more than 2 mirrors, etc...

> I would like to see us get the stored state (probably with Alans tuples)
> working soon. It will save a lot users from potential disasters. Perhaps
> we could start with it in /var, and then if an easy way to store it on
> the disks themselves comes along, we can always move it.

Fine.
Re: data integrity and drbd [ In reply to ]
Lars Marowsky-Bree wrote:
>
> On 2000-11-21T15:22:04,
> Alan Robertson <alanr@example.com> said:
>
> > The problem comes out of this:
> >
> > You can have good data, but not quorum.
> >
> > You can have quorum, but no good data.
> >
> > This is why I talked about the tension between heartbeat and drbd.
> > [Heartbeat doesn't manage quorum, but it should ;-)]
>
> Ok.
>
> Hmmm. How can you have good data if you don't have quorum - you can't confirm
> if you have good data, because that would require quorum.

Here's what I mean by that: Imagine A has good data, B is down, so does
*not* have good data. A has the only good data around. If A subsequently
loses quorum, it *still* has the only good data. When it can contact the
other machine, they will find out it still has good data. Then it will have
quorum and can go on.

Now this is an interesting thought. If B is shut down because of loss of
quorum, and A didn't increment it's generation number, then when they
contact each other after loss of quorum, then you can't tell that "A" had
better data. It sounds like incrementing the generation number is necessary
after the transition is complete, and you know you still have quorum. This
is symmetric with the other case, in which the former-secondary is made
primary.

> You may _think_ you have better data than the rest of the cluster (which in
> drbd's case boils down to the other node serving that drbd volume), but
> without quorum, you can't decide on that.
>
> > If you have quorum, you might very well decide to stonith the other guy.
> > That isn't going to give you good data, unless rebooting him fixes your
> > OS/network problems. You still can't provide service.
>
> If you STONITH the other guy (and thus regain quorum in the case of a cluster
> partition), the quorum ceptre allows you to postulate that your data is good.
>
> If you don't have quorum, your data is not-good by definition.

By theoretical definition, yes. I'm talking about from the viewpoint of a
theoretical observer. As you said, the machine doesn't *know* it has the
best data, but only because the outside world might have anointed the other
machine to continue. Without this having occurred, then you *know* you have
the best data.
>
> Hmmmmmm. This problem boils down to "having had quorum the entire time since
> last synced" (if the answer is yes, your data is good by definition), and what
> to do if in the case of two nodes, both sides lost quorum and come up with two
> potentially different data sets - who has the _better_ data in that case?
>
> Ok, I think I am looping and going bad to the start ;-) But the interaction
> between quorum and good data is IMHO definetely there.

There is an interaction, yes, but nothing like a guarantee. Having quorum
doesn't guarantee good data AT ALL. NOT having quorum guarantees that you
cannot serve data, because you MIGHT NOT have the best data. However, you
still can have the best data, and the quorum has to wait for you to return
or for a human to anoint another machine in order to continue.

-- Alan Robertson
alanr@example.com

1 2  View All