Mailing List Archive

1 2  View All
Re: data integrity and drbd [ In reply to ]
On Tue, 21 Nov 2000, Alan Robertson wrote:

> David Gould wrote:
> >
> > On Tue, Nov 21, 2000 at 01:44:43PM -0700, Alan Robertson wrote:
> > > tony willoughby wrote:
> > > >
> > > > Alan,
> > > > Thank you for the crystal clear description.
> > > >
> > > > This seems like a good approach. I do have one concern though.
> > > >
> > > Thanks!
> > > >
> > > > We are using heartbeat/DRBD to achieve some level of high availability.
> > > > As I understand this approach, if a node fails hard while the other node
> > > > is booting then the booting node will not become operational without
> > > > operator intervention.
> > > >
> > > > Am I correct?
> > >
> > > Mostly. As I wrote the description, I realized that in the case where the
> > > node knew that the other side didn't have good data, it could go ahead and
> > > come up anyway. However you are correct in that there some circumstances
> > > where it cannot determine that the data is the right version it will refuse
> > > to come up in order to keep from compromising data integrity.
> > >
> > > > If so, then then HA/DRBD will no longer provide what I need.
> >
> > Hmmm, this is why I wanted to store a bit more data about state
> > transitions. For example:
> >
> > A is pri, B is sec.
> >
> > B crashes, A continues.
> >
> > A is shutdown cleanly and then restarted. B is still dead.
> >
> > At this point, A could start as pri if it knows:
> > A was pri and cleanly shutdown.
> > B was stale (since disconnected before A's last write)
>
> This was discussed in other emails. You're right iff no one gave B a manual
> override while "A" was down.. In that case, "B" has the good bits.
>
> > > I assumed that the error exits would be a nice idea. You're saying that for
> > > you they are necessary. I don't see this as a show-stopper. Do you think
> > > this would meet your needs?
> >
> > I dunno, an error exit without more information availible to it is going to
> > have a hard time coming up with the "right answer". And if it is just going
> > to time out and force the system to come up anyway, welllllll,if you are
> > going to corrupt the data, why wait 10 minutes?
>
> What he was saying was "I know my application, and my customers and I want
> to bring For some applications, this is exactly the right thing to do.
> Others (banking, etc), this is a mistake. The reason for waiting a few
> minutes is to let both machines recover and reboot from a power outage. One
> machine will almost certainly come up first. It should wait a few minutes
> to allow the other machine to come up, or get diddled by an admin and come
> up.
>
> What if your application is like doubleclick.com. Losing some advertising
> clickthrough information is certainly less important than not showing
> advertisements!

Or, your box is in a closet and this failure occurs at 2:00
a.m. during a holiday weekend.

My concern is for unattended systems.


>
> This is a perfect example where you don't want to wait for a human before
> continuing.
>
> -- Alan Robertson
> alanr@example.com
>

Tony Willoughby
ADC Telecommunications, Inc.
Broadband Access and Transport Group
mailto:tony_willoughby@example.com
Re: data integrity and drbd [ In reply to ]
tony willoughby wrote:
>
> On Tue, 21 Nov 2000, Alan Robertson wrote:
>
> > David Gould wrote:
> > >
> > > On Tue, Nov 21, 2000 at 01:44:43PM -0700, Alan Robertson wrote:
> > > > tony willoughby wrote:
> > > > >
> > > > > Alan,
> > > > > Thank you for the crystal clear description.
> > > > >
> > > > > This seems like a good approach. I do have one concern though.
> > > > >
> > > > Thanks!
> > > > >
> > > > > We are using heartbeat/DRBD to achieve some level of high availability.
> > > > > As I understand this approach, if a node fails hard while the other node
> > > > > is booting then the booting node will not become operational without
> > > > > operator intervention.
> > > > >
> > > > > Am I correct?
> > > >
> > > > Mostly. As I wrote the description, I realized that in the case where the
> > > > node knew that the other side didn't have good data, it could go ahead and
> > > > come up anyway. However you are correct in that there some circumstances
> > > > where it cannot determine that the data is the right version it will refuse
> > > > to come up in order to keep from compromising data integrity.
> > > >
> > > > > If so, then then HA/DRBD will no longer provide what I need.
> > >
> > > Hmmm, this is why I wanted to store a bit more data about state
> > > transitions. For example:
> > >
> > > A is pri, B is sec.
> > >
> > > B crashes, A continues.
> > >
> > > A is shutdown cleanly and then restarted. B is still dead.
> > >
> > > At this point, A could start as pri if it knows:
> > > A was pri and cleanly shutdown.
> > > B was stale (since disconnected before A's last write)
> >
> > This was discussed in other emails. You're right iff no one gave B a manual
> > override while "A" was down.. In that case, "B" has the good bits.
> >
> > > > I assumed that the error exits would be a nice idea. You're saying that for
> > > > you they are necessary. I don't see this as a show-stopper. Do you think
> > > > this would meet your needs?
> > >
> > > I dunno, an error exit without more information availible to it is going to
> > > have a hard time coming up with the "right answer". And if it is just going
> > > to time out and force the system to come up anyway, welllllll,if you are
> > > going to corrupt the data, why wait 10 minutes?
> >
> > What he was saying was "I know my application, and my customers and I want
> > to bring For some applications, this is exactly the right thing to do.
> > Others (banking, etc), this is a mistake. The reason for waiting a few
> > minutes is to let both machines recover and reboot from a power outage. One
> > machine will almost certainly come up first. It should wait a few minutes
> > to allow the other machine to come up, or get diddled by an admin and come
> > up.
> >
> > What if your application is like doubleclick.com. Losing some advertising
> > clickthrough information is certainly less important than not showing
> > advertisements!
>
> Or, your box is in a closet and this failure occurs at 2:00
> a.m. during a holiday weekend.
>
> My concern is for unattended systems.

If you say what you said, then you're also saying "unattended systems where
we don't really care if we have old data because we say old data is better
than no data". Only some applications (probably less than half) can say
that. What you're saying to your customers is "I guarantee uptime. Data
integrity is best effort."

-- Alan Robertson
alanr@example.com
Re: data integrity and drbd [ In reply to ]
On Wed, 22 Nov 2000, Lars Marowsky-Bree wrote:

===================== <snip> =====================================

>
> > - resyncer very slow?
>
> Not _that_ serious, although I would love if we could get some testing over
> GigE done. If noone has the right hardware, I am sure SuSE will supply David
> with it ;-)

This becomes *very* serious if what Alan said is true:

AlanR-> > Is it possible to detect if QuickSync is sufficient?
AlanR->
AlanR-> A very interesting question.
AlanR->
AlanR-> The most conservative assumption is that every time a node connects which
AlanR-> hasn't been synced in the past, you should do a full sync. Let me show an
AlanR-> example where this is necessary.
AlanR->
AlanR-> A is PRI
AlanR-> B is SEC
AlanR-> A fails.
AlanR-> B takes over.
AlanR-> A reboots -- and MUST have a full sync. Here's why:
AlanR->
AlanR-> A had disk blocks written which B never saw. B went on and overwrote some
AlanR-> but not all of those with different data. Now, unless A undoes all the disk
AlanR-> writes which B never ACKed, it will have an inconsistent set of disk blocks.

My testing showed a 4Gig DRBD disk took about 44 minutes to
sync on a 100M ethernet. That means 44 minutes to failover?

===================== <snip> =====================================

Tony Willoughby
ADC Telecommunications, Inc.
Broadband Access and Transport Group
mailto:tony_willoughby@example.com
Re: data integrity and drbd [ In reply to ]
> My testing showed a 4Gig DRBD disk took about 44 minutes to
> sync on a 100M ethernet. That means 44 minutes to failover?

This can REALLY be slower (3/4 x) with a "bad" network card such as
realtekl, even with good SCCI HD..
Two sync my mail servers HD (10G) with the traffic occurring it can take 18
hours !!
(Yes, I am planning to change my cards ;*)

But whatever happen, when I start a fullsync I know that I my master fail, I
have nothing.
And if I am spammed my uptime can go up to 50, which kill the drdb
connection.

I don't want to be spammed at those momment ;*)

Thomas
Re: data integrity and drbd [ In reply to ]
On Wed, 22 Nov 2000, Alan Robertson wrote:

> tony willoughby wrote:
> >
> > On Tue, 21 Nov 2000, Alan Robertson wrote:
> >
> > > David Gould wrote:
> > > >
> > > > On Tue, Nov 21, 2000 at 01:44:43PM -0700, Alan Robertson wrote:
> > > > > tony willoughby wrote:
> > > > > >
> > > > > > Alan,
> > > > > > Thank you for the crystal clear description.
> > > > > >
> > > > > > This seems like a good approach. I do have one concern though.
> > > > > >
> > > > > Thanks!
> > > > > >
> > > > > > We are using heartbeat/DRBD to achieve some level of high availability.
> > > > > > As I understand this approach, if a node fails hard while the other node
> > > > > > is booting then the booting node will not become operational without
> > > > > > operator intervention.
> > > > > >
> > > > > > Am I correct?
> > > > >
> > > > > Mostly. As I wrote the description, I realized that in the case where the
> > > > > node knew that the other side didn't have good data, it could go ahead and
> > > > > come up anyway. However you are correct in that there some circumstances
> > > > > where it cannot determine that the data is the right version it will refuse
> > > > > to come up in order to keep from compromising data integrity.
> > > > >
> > > > > > If so, then then HA/DRBD will no longer provide what I need.
> > > >
> > > > Hmmm, this is why I wanted to store a bit more data about state
> > > > transitions. For example:
> > > >
> > > > A is pri, B is sec.
> > > >
> > > > B crashes, A continues.
> > > >
> > > > A is shutdown cleanly and then restarted. B is still dead.
> > > >
> > > > At this point, A could start as pri if it knows:
> > > > A was pri and cleanly shutdown.
> > > > B was stale (since disconnected before A's last write)
> > >
> > > This was discussed in other emails. You're right iff no one gave B a manual
> > > override while "A" was down.. In that case, "B" has the good bits.
> > >
> > > > > I assumed that the error exits would be a nice idea. You're saying that for
> > > > > you they are necessary. I don't see this as a show-stopper. Do you think
> > > > > this would meet your needs?
> > > >
> > > > I dunno, an error exit without more information availible to it is going to
> > > > have a hard time coming up with the "right answer". And if it is just going
> > > > to time out and force the system to come up anyway, welllllll,if you are
> > > > going to corrupt the data, why wait 10 minutes?
> > >
> > > What he was saying was "I know my application, and my customers and I want
> > > to bring For some applications, this is exactly the right thing to do.
> > > Others (banking, etc), this is a mistake. The reason for waiting a few
> > > minutes is to let both machines recover and reboot from a power outage. One
> > > machine will almost certainly come up first. It should wait a few minutes
> > > to allow the other machine to come up, or get diddled by an admin and come
> > > up.
> > >
> > > What if your application is like doubleclick.com. Losing some advertising
> > > clickthrough information is certainly less important than not showing
> > > advertisements!
> >
> > Or, your box is in a closet and this failure occurs at 2:00
> > a.m. during a holiday weekend.
> >
> > My concern is for unattended systems.
>
> If you say what you said, then you're also saying "unattended systems where
> we don't really care if we have old data because we say old data is better
> than no data". Only some applications (probably less than half) can say
> that. What you're saying to your customers is "I guarantee uptime. Data
> integrity is best effort."

I suppose that is true, although I'd like to have my cake and eat it
too. :^)

Can I guarantee non-corrupt (albeit old) data? Would I need to use a
journaling filesystem?

Tony Willoughby
ADC Telecommunications, Inc.
Broadband Access and Transport Group
mailto:tony_willoughby@example.com
Re: data integrity and drbd [ In reply to ]
On 2000-11-22T09:20:03,
tony willoughby <twilloughby@example.com> said:

> > Not _that_ serious, although I would love if we could get some testing
> > over GigE done. If noone has the right hardware, I am sure SuSE will
> > supply David with it ;-)
>
> This becomes *very* serious if what Alan said is true:
>
>> > Is it possible to detect if QuickSync is sufficient?
>>
>> A very interesting question.
>>
>> The most conservative assumption is that every time a node connects which
>> hasn't been synced in the past, you should do a full sync. Let me show an
>> example where this is necessary.
>>
>> A is PRI B is SEC A fails. B takes over. A reboots -- and MUST have a
>> full sync. Here's why:
>>
>> A had disk blocks written which B never saw. B went on and overwrote some
>> but not all of those with different data. Now, unless A undoes all the
>> disk writes which B never ACKed, it will have an inconsistent set of disk
>> blocks.
>
> My testing showed a 4Gig DRBD disk took about 44 minutes to sync on a 100M
> ethernet. That means 44 minutes to failover?

No, in this case you are in fact dealing with a double failure. We will
detect it ;-), but drbd can't handle that yet.

Yes, I would say a speedup is necessary.

Sincerely,
Lars Marowsky-Brée <lmb@example.com>
Development HA

--
Perfection is our goal, excellence will be tolerated. -- J. Yahl
Re: data integrity and drbd [ In reply to ]
On Wed, 22 Nov 2000, tony willoughby wrote:

<snip>

> I suppose that is true, although I'd like to have my cake and eat it
> too. :^)
>
> Can I guarantee non-corrupt (albeit old) data?

You can use and external disk accessible throught shared SCSI or FC so you
only have 1 copy of the data (if you use RAID its still only 1 copy from
the point of view of the nodes which are accessing it).

> Would I need to use a journaling filesystem?

Journalling guarantees metadata consistency and this is a different issue
Re: data integrity and drbd [ In reply to ]
tony willoughby wrote:
>
> On Wed, 22 Nov 2000, Alan Robertson wrote:
>
> > tony willoughby wrote:
[snip]
> > > My concern is for unattended systems.
> >
> > If you say what you said, then you're also saying "unattended systems where
> > we don't really care if we have old data because we say old data is better
> > than no data". Only some applications (probably less than half) can say
> > that. What you're saying to your customers is "I guarantee uptime. Data
> > integrity is best effort."
>
> I suppose that is true, although I'd like to have my cake and eat it
> too. :^)
>
> Can I guarantee non-corrupt (albeit old) data? Would I need to use a
> journaling filesystem?

You get what you get. :-)

This depends on your application, your fileystem, the DRBD modes, mount
modes, etc. Journalling filesystems tend to help. What's your application?

It will be the same as a crash. If you can live with data from a crash,
then you can live with this. It is data which was what was on disk
(approximately) at some point in time. The only question is "When"?

This depends on how old the data is. You *can* have your cake and eat it
too, but it might taste a little (or maybe a lot) stale. It might even be
moldy ;-)

-- Alan Robertson
alanr@example.com
Re: data integrity and drbd [ In reply to ]
On Wed, Nov 22, 2000 at 10:56:28AM +0100, Lars Marowsky-Bree wrote:
> On 2000-11-21T15:24:26,
> David Gould <dg@example.com> said:
>
> > Perhaps a little prioritization is in order: currently drbd has some
> > issues that make it marginal for most users:
> >
> > - not SMP stable?
>
> Serious.
>
> > - resyncer very slow?
>
> Not _that_ serious, although I would love if we could get some testing over
> GigE done. If noone has the right hardware, I am sure SuSE will supply David
> with it ;-)

Network bandwidth is not the problem, the current resync implementation is.
drbd gets only about 3MB/sec on resync on 100bt.

> > - config files difficult to manage for large numbers of drbd devices or
> > many nodes?
>
> Annoying.

More than annoying, managability is a serious limitation for Linux clustering.

> > - easy to mess up and get primary/secondary confused?
> >
> > - clustermanager interactions iffy?
>
> Dangerous. Will be fixed - as far as I can see - by Alan's proposal I think.

Yes this is great.

> > - does not support serving blocks for multiple nodes, ie for GFS?
>
> Not that serious. This is a nice feature, and I definetely want it - not only
> for GFS but other apps which can use raw partitions too - but it has about the
> same priority as "more than 2 nodes".

I don't see mirroring across more than two nodes as important, but a decent
block server would be very very useful.

-dg

--
David Gould dg@example.com
SuSE, Inc., 580 2cd St. #210, Oakland, CA 94607 510.628.3380
"I personally think Unix is "superior" because on LSD it tastes
like Blue." -- jbarnett

1 2  View All