Fábio Olivé Leite wrote:
>
> Hi there,
>
> ) ps: suddently i wonder why we need to reboot the computer ?
>
> :)
>
> ) - it isent to reset a faulty hardware as we decide not to deal
> ) with hardware faults.
>
> Hmmm... I don't think this has been said anywhere by anyone. Some hardware
> faults might be solved by rebooting, as this will [try to] reset things to
> a known state.
>
> If things can't be solved by rebooting, we can use that feature of reboot
> counting mentioned earlier to halt, instead of rebooting. Or power off,
> ring a bell, whatever.
>
> What _was_ said, as I recall, is that if you have, for example, memory
> that flips a bit and nobody notices (no parity or ECC), you can't do much
> about it, you will only see the computer misbehaving. Reboot might just
> solve it, unless the fault is not transient.
>
> BEEP! I think the terms transient and permanent, and also silent and
> detected need some explaining. :)
>
> Silent fault -> one that goes undetected. Will cause all kinds of
> unexpected behaviour (which _can_ be detected).
But probably not diagnosed. Whose fault is it? This means
it's a real bear to deal with.
>
> Detected fault -> one that is detected. This can be worked around,
> usually.
>
> Transient fault -> one that happens once in a while. A bit that flips
> because of some power supply fluctuation (spelling?), for example. Once
> corrected or masked, it will not bug again (for a while).
>
> Permanent fault -> a disk with landed heads. A thing that dies and stays
> dead.
>
> Rebooting simplifies the job of finding out which processes (or even the
> kernel) is faulty. Instead of testing every service and running a special
> kernel checker, you just reboot quickly, so whatever was it that was
> faulty will go away.
At the risk of seeming excessively controversial I'll offer this
comment.
I doubt that HA adds more than one or 1.5 digits to the availability of
a machine.
If you have 3 nines (99.9%), it can bring you to 4. If at 4, to 5, etc.
This means that it does the right thing to recover about 9 times in 10,
or maybe 95 times in 100. 1 time in ten or 20 it screws up. I suspect
that we'd be doing well to do that.
Once we can do that, and we've been in the field a long time, we will
have enough data to figure out what to do to add another digit. I would
claim that we would just be guessing wildly to figure out what to do to
add that second additional 9 to an availability figure.
I would suggest we worry about adding a 9, and if we're successful, we
can worry about adding another 9 then.
So... Let's not worry about the .01% failure cases for now.
-- Alan Robertson
alanr@suse.com
>
> Hi there,
>
> ) ps: suddently i wonder why we need to reboot the computer ?
>
> :)
>
> ) - it isent to reset a faulty hardware as we decide not to deal
> ) with hardware faults.
>
> Hmmm... I don't think this has been said anywhere by anyone. Some hardware
> faults might be solved by rebooting, as this will [try to] reset things to
> a known state.
>
> If things can't be solved by rebooting, we can use that feature of reboot
> counting mentioned earlier to halt, instead of rebooting. Or power off,
> ring a bell, whatever.
>
> What _was_ said, as I recall, is that if you have, for example, memory
> that flips a bit and nobody notices (no parity or ECC), you can't do much
> about it, you will only see the computer misbehaving. Reboot might just
> solve it, unless the fault is not transient.
>
> BEEP! I think the terms transient and permanent, and also silent and
> detected need some explaining. :)
>
> Silent fault -> one that goes undetected. Will cause all kinds of
> unexpected behaviour (which _can_ be detected).
But probably not diagnosed. Whose fault is it? This means
it's a real bear to deal with.
>
> Detected fault -> one that is detected. This can be worked around,
> usually.
>
> Transient fault -> one that happens once in a while. A bit that flips
> because of some power supply fluctuation (spelling?), for example. Once
> corrected or masked, it will not bug again (for a while).
>
> Permanent fault -> a disk with landed heads. A thing that dies and stays
> dead.
>
> Rebooting simplifies the job of finding out which processes (or even the
> kernel) is faulty. Instead of testing every service and running a special
> kernel checker, you just reboot quickly, so whatever was it that was
> faulty will go away.
At the risk of seeming excessively controversial I'll offer this
comment.
I doubt that HA adds more than one or 1.5 digits to the availability of
a machine.
If you have 3 nines (99.9%), it can bring you to 4. If at 4, to 5, etc.
This means that it does the right thing to recover about 9 times in 10,
or maybe 95 times in 100. 1 time in ten or 20 it screws up. I suspect
that we'd be doing well to do that.
Once we can do that, and we've been in the field a long time, we will
have enough data to figure out what to do to add another digit. I would
claim that we would just be guessing wildly to figure out what to do to
add that second additional 9 to an availability figure.
I would suggest we worry about adding a 9, and if we're successful, we
can worry about adding another 9 then.
So... Let's not worry about the .01% failure cases for now.
-- Alan Robertson
alanr@suse.com