Mailing List Archive: Linux-HA diagnostics thoughts

Hi Jacob,

I've CCed this message to the development list, since it is good to discuss some
of these things with others interested in heartbeat development. I also changed
the subject line.

jacob.rief@tis.at wrote:
>
> Hello Alan,
>
> Alan Robertson wrote:
> > This is in the TODO list as "hardware diagnostics". The high-reliability
> > low-level communications is still desirable, if for no other reason than
> to
> > ensure a smooth transition.
>
> Probably it would be enough if heartbeat would start a script every N
> seconds
> which itself returns a status value. If the returned value is non-0,
> heartbeat
> considers the own host as failed and turns over to the failover host.
> I would suggest putting an other variable into ha.cf which is the path
> to that script. To check the interface card and cable I would put a few
> ping commands into that shell-script, and if all of them fail the host is
> considered as non working. Other people could check if the temperature
> of the CPU is to high or invent any other reason why a node should consider
> itself as failed.
> If You could point me onto the functions to modify, I could do the work if
> You want.

My concern is that failing the node is the *last* thing you want to do. It is
usually better to do an ifconfig down and ifconfig up on the interface to fix it
if you can. If a server has died, then restarting the server process is better
than failover. It is much more complex (in my view) than simply declaring
failure.

Also, some kinds of diagnostics need to be filtered so that false alarms are not
generated. Other kinds are 100% reliable. This kind of configurability (and
more) is necessary for really making these things work in the real world.

There isn't a framework in heartbeat for this at this point. There has been
some discussion of this, but no one has started it yet. I have been thinking
about hooking into Mon for this work.

-- Alan Robertson
alanr@bell-labs.com