Mailing List Archive

1 2 3  View All
remote reset [ In reply to ]
Fábio Olivé Leite wrote:
>
> Hi there,
>
> ) ps: suddently i wonder why we need to reboot the computer ?
>
> :)
>
> ) - it isent to reset a faulty hardware as we decide not to deal
> ) with hardware faults.
>
> Hmmm... I don't think this has been said anywhere by anyone. Some hardware
> faults might be solved by rebooting, as this will [try to] reset things to
> a known state.
>
> If things can't be solved by rebooting, we can use that feature of reboot
> counting mentioned earlier to halt, instead of rebooting. Or power off,
> ring a bell, whatever.
>
> What _was_ said, as I recall, is that if you have, for example, memory
> that flips a bit and nobody notices (no parity or ECC), you can't do much
> about it, you will only see the computer misbehaving. Reboot might just
> solve it, unless the fault is not transient.
>
> BEEP! I think the terms transient and permanent, and also silent and
> detected need some explaining. :)
>
> Silent fault -> one that goes undetected. Will cause all kinds of
> unexpected behaviour (which _can_ be detected).

But probably not diagnosed. Whose fault is it? This means
it's a real bear to deal with.
>
> Detected fault -> one that is detected. This can be worked around,
> usually.
>
> Transient fault -> one that happens once in a while. A bit that flips
> because of some power supply fluctuation (spelling?), for example. Once
> corrected or masked, it will not bug again (for a while).
>
> Permanent fault -> a disk with landed heads. A thing that dies and stays
> dead.
>
> Rebooting simplifies the job of finding out which processes (or even the
> kernel) is faulty. Instead of testing every service and running a special
> kernel checker, you just reboot quickly, so whatever was it that was
> faulty will go away.

At the risk of seeming excessively controversial I'll offer this
comment.

I doubt that HA adds more than one or 1.5 digits to the availability of
a machine.
If you have 3 nines (99.9%), it can bring you to 4. If at 4, to 5, etc.

This means that it does the right thing to recover about 9 times in 10,
or maybe 95 times in 100. 1 time in ten or 20 it screws up. I suspect
that we'd be doing well to do that.

Once we can do that, and we've been in the field a long time, we will
have enough data to figure out what to do to add another digit. I would
claim that we would just be guessing wildly to figure out what to do to
add that second additional 9 to an availability figure.

I would suggest we worry about adding a 9, and if we're successful, we
can worry about adding another 9 then.

So... Let's not worry about the .01% failure cases for now.

-- Alan Robertson
alanr@suse.com
remote reset [ In reply to ]
dgould@suse.com wrote:
>
> On Fri, May 12, 2000 at 09:16:01AM -0600, Alan Robertson wrote:
> > David Brower wrote:
> >
> > Ahhh... Fencing... at least STONITH... Someone is working on that. I
> > guess it's me.
>
> Alan,
>
> I did take a brief look at some of the devices out there, perhaps I
> could help?

Yes. I looked at several manufacturers. I want a switch that several
machines can connect to, and which allows each machine to have a
separate UPS. Few switches allow this, most only switch a common power
source off and on to several different machines.

The company that VA Linux uses (whose name I've sadly forgotten at the
moment), has a switch which does that, and you can telnet into it. VA
has the code for operating another of their switches over a serial port,
so I suspect it's a good place to start. Unfortunately, their code is
heavily wired into their infrastructure, so it'll mainly serve to
document how to operate the switch. It was pretty simple, though.
Security is an issue with telnet.

-- Alan Robertson
alanr@suse.com
remote reset [ In reply to ]
> > ) ps: suddently i wonder why we need to reboot the computer ?
> > ) - it isent to reset a faulty hardware as we decide not to deal
> > ) with hardware faults.
> >
> > Hmmm... I don't think this has been said anywhere by anyone. Some hardware
> > faults might be solved by rebooting, as this will [try to] reset things to
> > a known state.
> >
> > If things can't be solved by rebooting, we can use that feature of reboot
> > counting mentioned earlier to halt, instead of rebooting. Or power off,
> > ring a bell, whatever.
> >
> > What _was_ said, as I recall, is that if you have, for example, memory
> > that flips a bit and nobody notices (no parity or ECC), you can't do much
> > about it, you will only see the computer misbehaving. Reboot might just
> > solve it, unless the fault is not transient.
> >
> > BEEP! I think the terms transient and permanent, and also silent and
> > detected need some explaining. :)
> >
> > Silent fault -> one that goes undetected. Will cause all kinds of
> > unexpected behaviour (which _can_ be detected).
>
> But probably not diagnosed. Whose fault is it? This means
> it's a real bear to deal with.

This, to me, is a significant role of HA software. Fault detection is
only the beginning (or perhaps failure detection if you accept the
distinction proposed earlier between faults (phyisical) and failures (user
space)). Following fault detection are two critical steps:

Fault isolation -- try to localize the faulty subsystem to both minimize
damage caused directly by the fault and minimize downtimes to subsystems
that are uneffected by the fault. It is really useful to have a
resource dependcy tree. Knowing which resources depend on each
other helps isolate the faulty subsystem. Having a problem with one
application on a machine doesn't necessarily require rebooting the whole
machine, for example. In my view, heartbeat has taken a stab at
this in haresources, but I don't think it is that extensible.

Also, "hardened" drivers really help here...something that can run
card/board level diagnostics routinely and proactively report hardware
oddities/faults.

Fault recovery -- take appropriate actions to recover from the isolated
fault. here we should have _configurable_ escalation policies.
examples:
- if kill does not terminate a process (an example mentioned
earlier), escalating to a node reboot may be in order.
- maximum number of rolling boots before a powerdown
- maximum time to wait when using software to request a reboot before a
powerdown

> > Rebooting simplifies the job of finding out which processes (or even the
> > kernel) is faulty. Instead of testing every service and running a special
> > kernel checker, you just reboot quickly, so whatever was it that was
> > faulty will go away.

You need fault isolation. If you systematically reboot everytime you
encounter a fault, you will never know how to improve the availability of
your system. This reboot is costly, as it will force all services to
either be restarted locally or failed over...which == downtime! Rebooting
should be a level in your escalation policy (perhaps the second to last,
with the last step being a poweroff).

> At the risk of seeming excessively controversial I'll offer this
> comment.
>
> I doubt that HA adds more than one or 1.5 digits to the availability of
> a machine.
> If you have 3 nines (99.9%), it can bring you to 4. If at 4, to 5, etc.

Hmmm...I agree with your sentiment...focusing on the areas that will give
us the best improvement in availability first, then tackling the less
likely and more subtle problems later.

I do not agree with your numbers. Each successive nine (3 nines, four
nines, 5 nines) is more difficult than the last...this is not a linear
scale.

You are highlighting an important issue. Reliability modelling, the art
of applying well known statistical methods to discover the aggregate
availiablility of a complex system, can quickly point at focus areas that
will give you the most bang for your buck in terms of improvements in
availability. For example, we may find that focusing on a hearbeating
protocol is less important than hardened drivers (or vice-versa).

Regardless of what reliability modelling shows, I think the architectural
description alan is proposing will help us understand heartbeat's goals.
This in turn will help us know what the next step is!

-chris

1 2 3  View All