Mailing List Archive

Re: Heartbeat
Hi Michael!

I've forwarded this mail to the ha-dev development mailing list, so that the
various people concerned with Linux-HA development can "hear" our discussion.

Michael Moerz wrote:
>
> Hi!
>
> Long long ago I just thougt to do something for linux-ha and now I think
> that I have enough spare time to do something.

Great!

> I would think that it is really interesting to write some kind of
> diagnostic program for checking the network connection.

Yes, that's clearly a significant need.

> I assume that this would be a parsing of some /proc files cause the kernel
> normaly does already much error-logging and provides therefore much
> information. Actually I don't know so much about the implementation of
> networking in linux, but I suppose that I will be able to read the code at
> least to get some basic understanding about that topic.
> So I think that I would start out by writing some monitoring for
> Ethernet-connection via TCP-IP. Actually I don't understand what you mean
> by "dead ethernet check for serial ports".

I just reread the TODO item, and apparently neither do I :-)

Let me see if I can guess what I meant :-)

There are two possible meanings in my view:
1) The ethernet drivers are supposed to be implementing a check to see if they
are connected to a hub/switch. Alan Cox has more details.
This would allow us to test any and all ethernet interfaces.

2) One could implement checking in the heartbeat code which would tell us
whether or not we had received any packets from a particular interface
(serial or ethernet) in a particular period of time. This would
work for all kinds of links that have heartbeats. For example,
with redundant links, we can see that we are hearing heartbeats
from one interface (serial for example), but not another (ethernet
for example), and notify "someone"


> Do you mean some upd/tcp port
> watcher or cua0 watcher ? I suppose that should mean some udp/tcp port
> watcher that tries to connect to different ports to see if they are still
> up and working. As far as my knowledge goes about the TCP-IP /udp stuff is
> that each service has its own commands to communicate, so each service
> will require an other testprogramm. That should implemented some kind of
> modular, so it's easy then to add not supported services.
>
> Mon is good example of a system that is structured into modules, but it's
> written in perl (actually I think that perl is good for shell-scripting
> and some very specialized string-things)
>
> Therefore I don't think that adapting Mon would work for heartbeat, so
> that it tells heartbeat everything about the services that are run on the
> different nodes of the cluster. (actually I mean the time needed would be
> as much as if a complete new software is implemented)
> :) Yeah, what I forgot to mention is that one of mon's primary puposes is
> to show a status-information via web-interface. Since perl is really good at
> creating webpages it's good that it's written in perl.
>
> Now I will stop to curse perl beeing a shell-scripting language and move
> over to some other topic.

Of course, these monitors I described could be fed into a "mon" module. Also,
Perl is pretty good at starting independent programs, etc.

> pinging would be a nice idea to be able to know what node is up or down.
> To let a node know when it's up or down it needs at least 2 IP-addresses.
> I would suggest to take the IP-addr of the other node and the gateway. (
> other node ;) I am just thinking in terms of one primary and one backup
> node since I currently just have two computers at home to play with)
> So if the other node fails, beeing able to ping the gateways would tell us
> that the network connection still works and that the other's node network
> isn't reliable anymore.

Yes, this would be a good thing. It's a little more tricky than this in
practice, but this would be good infrastructure to have.

> I think that I will have to come to an end, and I hope that you don't
> think that I am writting to much fuss about checking networking. So I hope
> that you like the thougt that I am trying to implementing some kind of
> diagnostic tool for ha-linux.

I feel that a diagnostic interface would be a good thing.

A really good unanswered question is exactly what to do when you have a
diagnostic failure. Should you just notify "mon", or should you do something
else? Mon is a reasonably good piece of software, and can do other things,
but it might be the case that heartbeat itself might also need knowledge of
the failure.

This is why this item (even if it is written up badly) is marked as depending
on the diagnostic framework. To do the job right, we really need a diagnostic
framework (or basic design). However, if you don't feel up creating a
diagnostic framework, but want to go forward on this anyway, it can be useful
just as a "Mon" module.

-- Alan Robertson
alanr@bell-labs.com