On Tue, 21 Nov 2000, Alan Robertson wrote:
> David Gould wrote:
> >
> > On Tue, Nov 21, 2000 at 01:44:43PM -0700, Alan Robertson wrote:
> > > tony willoughby wrote:
> > > >
> > > > Alan,
> > > > Thank you for the crystal clear description.
> > > >
> > > > This seems like a good approach. I do have one concern though.
> > > >
> > > Thanks!
> > > >
> > > > We are using heartbeat/DRBD to achieve some level of high availability.
> > > > As I understand this approach, if a node fails hard while the other node
> > > > is booting then the booting node will not become operational without
> > > > operator intervention.
> > > >
> > > > Am I correct?
> > >
> > > Mostly. As I wrote the description, I realized that in the case where the
> > > node knew that the other side didn't have good data, it could go ahead and
> > > come up anyway. However you are correct in that there some circumstances
> > > where it cannot determine that the data is the right version it will refuse
> > > to come up in order to keep from compromising data integrity.
> > >
> > > > If so, then then HA/DRBD will no longer provide what I need.
> >
> > Hmmm, this is why I wanted to store a bit more data about state
> > transitions. For example:
> >
> > A is pri, B is sec.
> >
> > B crashes, A continues.
> >
> > A is shutdown cleanly and then restarted. B is still dead.
> >
> > At this point, A could start as pri if it knows:
> > A was pri and cleanly shutdown.
> > B was stale (since disconnected before A's last write)
>
> This was discussed in other emails. You're right iff no one gave B a manual
> override while "A" was down.. In that case, "B" has the good bits.
>
> > > I assumed that the error exits would be a nice idea. You're saying that for
> > > you they are necessary. I don't see this as a show-stopper. Do you think
> > > this would meet your needs?
> >
> > I dunno, an error exit without more information availible to it is going to
> > have a hard time coming up with the "right answer". And if it is just going
> > to time out and force the system to come up anyway, welllllll,if you are
> > going to corrupt the data, why wait 10 minutes?
>
> What he was saying was "I know my application, and my customers and I want
> to bring For some applications, this is exactly the right thing to do.
> Others (banking, etc), this is a mistake. The reason for waiting a few
> minutes is to let both machines recover and reboot from a power outage. One
> machine will almost certainly come up first. It should wait a few minutes
> to allow the other machine to come up, or get diddled by an admin and come
> up.
>
> What if your application is like doubleclick.com. Losing some advertising
> clickthrough information is certainly less important than not showing
> advertisements!
Or, your box is in a closet and this failure occurs at 2:00
a.m. during a holiday weekend.
My concern is for unattended systems.
>
> This is a perfect example where you don't want to wait for a human before
> continuing.
>
> -- Alan Robertson
> alanr@example.com
>
Tony Willoughby
ADC Telecommunications, Inc.
Broadband Access and Transport Group
mailto:tony_willoughby@example.com
> David Gould wrote:
> >
> > On Tue, Nov 21, 2000 at 01:44:43PM -0700, Alan Robertson wrote:
> > > tony willoughby wrote:
> > > >
> > > > Alan,
> > > > Thank you for the crystal clear description.
> > > >
> > > > This seems like a good approach. I do have one concern though.
> > > >
> > > Thanks!
> > > >
> > > > We are using heartbeat/DRBD to achieve some level of high availability.
> > > > As I understand this approach, if a node fails hard while the other node
> > > > is booting then the booting node will not become operational without
> > > > operator intervention.
> > > >
> > > > Am I correct?
> > >
> > > Mostly. As I wrote the description, I realized that in the case where the
> > > node knew that the other side didn't have good data, it could go ahead and
> > > come up anyway. However you are correct in that there some circumstances
> > > where it cannot determine that the data is the right version it will refuse
> > > to come up in order to keep from compromising data integrity.
> > >
> > > > If so, then then HA/DRBD will no longer provide what I need.
> >
> > Hmmm, this is why I wanted to store a bit more data about state
> > transitions. For example:
> >
> > A is pri, B is sec.
> >
> > B crashes, A continues.
> >
> > A is shutdown cleanly and then restarted. B is still dead.
> >
> > At this point, A could start as pri if it knows:
> > A was pri and cleanly shutdown.
> > B was stale (since disconnected before A's last write)
>
> This was discussed in other emails. You're right iff no one gave B a manual
> override while "A" was down.. In that case, "B" has the good bits.
>
> > > I assumed that the error exits would be a nice idea. You're saying that for
> > > you they are necessary. I don't see this as a show-stopper. Do you think
> > > this would meet your needs?
> >
> > I dunno, an error exit without more information availible to it is going to
> > have a hard time coming up with the "right answer". And if it is just going
> > to time out and force the system to come up anyway, welllllll,if you are
> > going to corrupt the data, why wait 10 minutes?
>
> What he was saying was "I know my application, and my customers and I want
> to bring For some applications, this is exactly the right thing to do.
> Others (banking, etc), this is a mistake. The reason for waiting a few
> minutes is to let both machines recover and reboot from a power outage. One
> machine will almost certainly come up first. It should wait a few minutes
> to allow the other machine to come up, or get diddled by an admin and come
> up.
>
> What if your application is like doubleclick.com. Losing some advertising
> clickthrough information is certainly less important than not showing
> advertisements!
Or, your box is in a closet and this failure occurs at 2:00
a.m. during a holiday weekend.
My concern is for unattended systems.
>
> This is a perfect example where you don't want to wait for a human before
> continuing.
>
> -- Alan Robertson
> alanr@example.com
>
Tony Willoughby
ADC Telecommunications, Inc.
Broadband Access and Transport Group
mailto:tony_willoughby@example.com