Hi Alan, ha-developers,
I toyed around with 0.45 release last week, and thought I would report
my results. Mostly, I am very happy with it; however, I did run into a
few snags of varying seriousness.
1. "/etc/rc.d/init.d/heartbeat restart" only stops heartbeat, it does
not actually restart it.
2. If heartbeat has failed over to the backup machine, and then the
heartbeat on the backup machine is cleanly stopped, it keeps the
resource even though it claims to have relinquished it (i.e. it
still has the IP address it took over from the original host).
3. One of my test machines is a laptop with a PCMCIA ethernet card.
When I yanked the card out, heartbeat failed over to the other
machine just fine, but when I put my NIC back in, the alias
interface was not recreated. Heartbeat was running on my laptop
the entire time, and was attempting to send out heartbeats on the
interface that no longer existed.
While I can see that laptops are unlikely HA hardware, I can
foresee using PCMCIA cards as hot-swappable devices. Something to
think about, though I can understand a response of "not our problem."
4. Situation: two machines, "good" and "bad". Bad has a failing disk, which
corrupts a few files on its filesystem. Good is the primary, bad is
the backup. On startup, good does not successfully grab the resource.
However, killing the heartbeat on good causes bad to successfully
take over. Restarting the heartbeat on good causes bad to relinquish,
but again good unsuccessfully attempts to take the resource.
Here's the typical sort of log on good:
heartbeat: 1999/10/14_15:24:53 info: ***********************
heartbeat: 1999/10/14_15:24:53 info: Configuration validated. Starting heartbeat.
heartbeat: 1999/10/14_15:24:53 notice: UDP heartbeat started on port 1001 interface eth0
heartbeat: 1999/10/14_15:24:53 error: Cannot open /proc/ha/.control: No such file or directory
heartbeat: 1999/10/14_15:24:59 warn: node bad.int.wirex.com: is dead
heartbeat: 1999/10/14_15:24:59 INFO: Running /etc/ha.d/rc.d/status status
and then nothing.
Occasionally, I would see something like this in good's logs when bad
would start up:
heartbeat: 1999/10/14_14:32:04 info: ***********************
heartbeat: 1999/10/14_14:32:04 info: Configuration validated. Starting heartbeat.
heartbeat: 1999/10/14_14:32:04 notice: UDP heartbeat started on port 1001 interface eth0
heartbeat: 1999/10/14_14:32:04 error: Cannot open /proc/ha/.control: No such file or directory
heartbeat: 1999/10/14_14:32:14 error: string2msg: no MSG_START
heartbeat: 1999/10/14_14:32:14 error: Bad message is: [9^FÞÍ]
Nothing of interest was found in the debug log.
All of this went away as soon as I stopped using bad, and started
using a third machine with good.
In this instance, the software failed me. It was unable to detect
that one of my machines was mildly insane; furthermore, the way the
problems manifested themselves, I was starting to believe the problem
was with the machine good, until I looked in bad's /var/log/messages
and saw disk errors.
Also note that I am using md5 authentication, and good did not complain
about bad's packets failing authentication (these would have shown up
in the debug log). Which begs me to ask: what is the security model
behind the authentication scheme? What sort of threats are you
attempting to prevent by using it?
If necessary, I can recreate the situation with bad, though I don't
have a lot of time that I can allocate to it.
5. Oh yeah, the proc module does not compile under 2.0.36, which is what
all my machines in my testbed are running.
Hope this is of use to you. Let me know if I can provide you with more
information. Thanks.
Steve
I toyed around with 0.45 release last week, and thought I would report
my results. Mostly, I am very happy with it; however, I did run into a
few snags of varying seriousness.
1. "/etc/rc.d/init.d/heartbeat restart" only stops heartbeat, it does
not actually restart it.
2. If heartbeat has failed over to the backup machine, and then the
heartbeat on the backup machine is cleanly stopped, it keeps the
resource even though it claims to have relinquished it (i.e. it
still has the IP address it took over from the original host).
3. One of my test machines is a laptop with a PCMCIA ethernet card.
When I yanked the card out, heartbeat failed over to the other
machine just fine, but when I put my NIC back in, the alias
interface was not recreated. Heartbeat was running on my laptop
the entire time, and was attempting to send out heartbeats on the
interface that no longer existed.
While I can see that laptops are unlikely HA hardware, I can
foresee using PCMCIA cards as hot-swappable devices. Something to
think about, though I can understand a response of "not our problem."
4. Situation: two machines, "good" and "bad". Bad has a failing disk, which
corrupts a few files on its filesystem. Good is the primary, bad is
the backup. On startup, good does not successfully grab the resource.
However, killing the heartbeat on good causes bad to successfully
take over. Restarting the heartbeat on good causes bad to relinquish,
but again good unsuccessfully attempts to take the resource.
Here's the typical sort of log on good:
heartbeat: 1999/10/14_15:24:53 info: ***********************
heartbeat: 1999/10/14_15:24:53 info: Configuration validated. Starting heartbeat.
heartbeat: 1999/10/14_15:24:53 notice: UDP heartbeat started on port 1001 interface eth0
heartbeat: 1999/10/14_15:24:53 error: Cannot open /proc/ha/.control: No such file or directory
heartbeat: 1999/10/14_15:24:59 warn: node bad.int.wirex.com: is dead
heartbeat: 1999/10/14_15:24:59 INFO: Running /etc/ha.d/rc.d/status status
and then nothing.
Occasionally, I would see something like this in good's logs when bad
would start up:
heartbeat: 1999/10/14_14:32:04 info: ***********************
heartbeat: 1999/10/14_14:32:04 info: Configuration validated. Starting heartbeat.
heartbeat: 1999/10/14_14:32:04 notice: UDP heartbeat started on port 1001 interface eth0
heartbeat: 1999/10/14_14:32:04 error: Cannot open /proc/ha/.control: No such file or directory
heartbeat: 1999/10/14_14:32:14 error: string2msg: no MSG_START
heartbeat: 1999/10/14_14:32:14 error: Bad message is: [9^FÞÍ]
Nothing of interest was found in the debug log.
All of this went away as soon as I stopped using bad, and started
using a third machine with good.
In this instance, the software failed me. It was unable to detect
that one of my machines was mildly insane; furthermore, the way the
problems manifested themselves, I was starting to believe the problem
was with the machine good, until I looked in bad's /var/log/messages
and saw disk errors.
Also note that I am using md5 authentication, and good did not complain
about bad's packets failing authentication (these would have shown up
in the debug log). Which begs me to ask: what is the security model
behind the authentication scheme? What sort of threats are you
attempting to prevent by using it?
If necessary, I can recreate the situation with bad, though I don't
have a lot of time that I can allocate to it.
5. Oh yeah, the proc module does not compile under 2.0.36, which is what
all my machines in my testbed are running.
Hope this is of use to you. Let me know if I can provide you with more
information. Thanks.
Steve