Mailing List Archive

2 problems about HeartBeat
Hi all,

I'm using heartbeat-0.4.6 on 2 RedHat 6.1 P-III 500 machines.
Though it can work well for most the time, I met 2 problems
some days ago.

The first problem is: I use both serial port ttyS0 and ethernet
eth1 as heartbeat link. After I draw off both the serial link
and ethernet link, the 2 machines cannot touch each other then
both take over the other's IP. But when I plus the 2 links again,
both the 2 machines don't release the other's IP which they took
over before. Thus, the 2 machines both take over 2 IPs.

I think some codes are omitted in the case above.
I'm reading the source code of heartbeat, and try to solve this
problem myself.

There is something strange in the second problem: After heartbeat
has run for about 20 day, one machine began to write some /proc info,
but after it logs
"heartbeat: 2000/02/06_14:13:14 error: Cannot open /proc/ha/.control: No
such file or directory",
it did not log anything more (normally, it would log some MSG stats
information).
From then on, the other machine could not touch this machine, and it's
heartbeat took over
2 IPs, while this machine still took over only one IP.

This happened only once. I haven't catch it later. It seems that
the fault machine's heartbeat entries a dead loop or infinite waiting
state.

Does anyone have some idea? Thank you very much.

I do not install the /proc module. The logs as follow:

heartbeat: 2000/02/04_14:13:13 error: Cannot open /proc/ha/.control: No
such file or directory
heartbeat: 2000/02/04_14:13:25 info: MSG stats: 100/691214 age 2
[pid2756/CONTROL]
heartbeat: 2000/02/04_14:13:25 info: ha_malloc stats: 2100/15206716
88000/49002 [pid2756/CONTROL]
heartbeat: 2000/02/04_14:13:25 info: RealMalloc stats: 89216 total
malloc bytes. pid 2756/CONTROL]
heartbeat: 2000/02/04_14:13:25 info: MSG stats: 0/2764799 age 0
[pid2759/MST_STATUS]
heartbeat: 2000/02/04_14:13:25 info: ha_malloc stats: 0/50457479 0/0
[pid2759/MST_STATUS]
heartbeat: 2000/02/04_14:13:25 info: RealMalloc stats: 1696 total malloc
bytes. pid 2759/MST_STATUS]
heartbeat: 2000/02/04_14:13:25 info: MSG stats: 0/691214 age 2
[pid2760/HBWRITE]
heartbeat: 2000/02/04_14:13:25 info: ha_malloc stats: 0/15206716 0/0
[pid2760/HBWRITE]
heartbeat: 2000/02/04_14:13:25 info: RealMalloc stats: 1216 total malloc
bytes. pid 2760/HBWRITE]
heartbeat: 2000/02/04_14:13:25 info: MSG stats: 0/1382371 age 0
[pid2761/HBREAD]
heartbeat: 2000/02/04_14:13:25 info: ha_malloc stats: 0/30412174 0/0
[pid2761/HBREAD]
heartbeat: 2000/02/04_14:13:25 info: RealMalloc stats: 1216 total malloc
bytes. pid 2761/HBREAD]
heartbeat: 2000/02/05_14:13:13 error: Cannot open /proc/ha/.control: No
such file or directory
heartbeat: 2000/02/05_14:13:26 info: MSG stats: 100/734415 age 1
[pid2756/CONTROL]
heartbeat: 2000/02/05_14:13:26 info: ha_malloc stats: 2100/16157138
88000/49003 [pid2756/CONTROL]
heartbeat: 2000/02/05_14:13:26 info: RealMalloc stats: 89216 total
malloc bytes. pid 2756/CONTROL]
heartbeat: 2000/02/05_14:13:26 info: MSG stats: 0/2937602 age 0
[pid2759/MST_STATUS]
heartbeat: 2000/02/05_14:13:26 info: ha_malloc stats: 0/53611142 0/0
[pid2759/MST_STATUS]
heartbeat: 2000/02/05_14:13:26 info: RealMalloc stats: 1696 total malloc
bytes. pid 2759/MST_STATUS]
heartbeat: 2000/02/05_14:13:26 info: MSG stats: 0/734415 age 1
[pid2760/HBWRITE]
heartbeat: 2000/02/05_14:13:26 info: ha_malloc stats: 0/16157138 0/0
[pid2760/HBWRITE]
heartbeat: 2000/02/05_14:13:26 info: RealMalloc stats: 1216 total malloc
bytes. pid 2760/HBWRITE]
heartbeat: 2000/02/05_14:13:26 info: MSG stats: 0/1468773 age 0
[pid2761/HBREAD]
heartbeat: 2000/02/05_14:13:26 info: ha_malloc stats: 0/32313018 0/0
[pid2761/HBREAD]
heartbeat: 2000/02/05_14:13:26 info: RealMalloc stats: 1216 total malloc
bytes. pid 2761/HBREAD]
heartbeat: 2000/02/06_14:13:14 error: Cannot open /proc/ha/.control: No
such file or directory
heartbeat: 2000/02/07_00:47:24 info: Heartbeat shutdown in progress.
heartbeat: 2000/02/07_00:47:24 info: Giving up all HA resources.
heartbeat: 2000/02/07_00:47:30 Releasing resource group: hostname1
123.4.5.6 daemon1 daemon2
heartbeat: 2000/02/07_00:47:32 INFO: Running /etc/ha.d/resource.d/IPaddr
123.4.5.6 stop
heartbeat: 2000/02/07_00:47:34 IP Address 123.4.5.6 released
heartbeat: 2000/02/07_00:47:36 INFO: Running
/etc/ha.d/resource.d/daemon1 stop
heartbeat: 2000/02/07_00:47:38 INFO: Running
/etc/ha.d/resource.d/daemon2 stop
heartbeat: 2000/02/07_00:47:38 info: All HA resources relinquished.
heartbeat: 2000/02/07_00:47:38 info: Heartbeat shutdown complete.
heartbeat: 2000/02/07_00:51:00 info: ***********************
heartbeat: 2000/02/07_00:51:00 info: Configuration validated. Starting
heartbeat.
heartbeat: 2000/02/07_00:51:00 error: Creating FIFO
/var/run/heartbeat-fifo.
heartbeat: 2000/02/07_00:51:00 notice: UDP heartbeat started on port
1002 interface eth0
heartbeat: 2000/02/07_00:51:00 error: Cannot open /proc/ha/.control: No
such file or directory
heartbeat: 2000/02/07_00:51:02 info: Requesting our resources.
heartbeat: 2000/02/07_00:51:03 INFO: Running /etc/ha.d/resource.d/IPaddr
123.4.5.6 status
heartbeat: 2000/02/07_00:51:03 INFO: Running /etc/ha.d/rc.d/ip-request
ip-request
heartbeat: 2000/02/07_00:51:04 INFO: Running
/etc/ha.d/rc.d/ip-request-resp ip-request-resp
heartbeat: 2000/02/07_00:51:04 received ip-request-resp 123.4.5.6 OK
heartbeat: 2000/02/07_00:51:04 Acquiring resource group: hostname1
123.4.5.6 daemon1 daemon2
heartbeat: 2000/02/07_00:51:04 INFO: Running
/etc/ha.d/resource.d/daemon2 start
heartbeat: 2000/02/07_00:51:05 INFO: Running
/etc/ha.d/resource.d/daemon1 start
heartbeat: 2000/02/07_00:51:05 INFO: Running /etc/ha.d/resource.d/IPaddr
123.4.5.6 start
heartbeat: 2000/02/07_00:51:06 INFO: ifconfig eth1:0 123.4.5.6 netmask
255.255.255.240 broadcast 202.106.169.255
heartbeat: 2000/02/07_00:51:06 Sending Gratuitous Arp for 123.4.5.6 on
eth1:0 [eth1]
heartbeat: 2000/02/07_01:02:09 info: Heartbeat shutdown in progress.
heartbeat: 2000/02/07_01:02:09 info: Giving up all HA resources.
heartbeat: 2000/02/07_01:02:09 Releasing resource group: hostname1
123.4.5.6 daemon1 daemon2
heartbeat: 2000/02/07_01:02:09 INFO: Running /etc/ha.d/resource.d/IPaddr
123.4.5.6 stop
heartbeat: 2000/02/07_01:02:09 IP Address 123.4.5.6 released
heartbeat: 2000/02/07_01:02:09 INFO: Running
/etc/ha.d/resource.d/daemon1 stop
heartbeat: 2000/02/07_01:02:10 INFO: Running
/etc/ha.d/resource.d/daemon2 stop
heartbeat: 2000/02/07_01:02:10 info: All HA resources relinquished.
heartbeat: 2000/02/07_01:02:10 info: Heartbeat shutdown complete.
heartbeat: 2000/02/07_01:07:46 info: ***********************
heartbeat: 2000/02/07_01:07:46 info: Configuration validated. Starting
heartbeat.
heartbeat: 2000/02/07_01:07:46 notice: UDP heartbeat started on port
1002 interface eth0
heartbeat: 2000/02/07_01:07:46 error: Cannot open /proc/ha/.control: No
such file or directory
heartbeat: 2000/02/07_01:07:48 info: Requesting our resources.
heartbeat: 2000/02/07_01:07:48 INFO: Running /etc/ha.d/resource.d/IPaddr
123.4.5.6 status
heartbeat: 2000/02/07_01:07:48 INFO: Running /etc/ha.d/rc.d/ip-request
ip-request
heartbeat: 2000/02/07_01:07:49 INFO: Running
/etc/ha.d/rc.d/ip-request-resp ip-request-resp
heartbeat: 2000/02/07_01:07:49 received ip-request-resp 123.4.5.6 OK
heartbeat: 2000/02/07_01:07:49 Acquiring resource group: hostname1
123.4.5.6 daemon1 daemon2
heartbeat: 2000/02/07_01:07:49 INFO: Running
/etc/ha.d/resource.d/daemon2 start
heartbeat: 2000/02/07_01:07:50 INFO: Running
/etc/ha.d/resource.d/daemon1 start
heartbeat: 2000/02/07_01:07:50 INFO: Running /etc/ha.d/resource.d/IPaddr
123.4.5.6 start
heartbeat: 2000/02/07_01:07:50 INFO: ifconfig eth1:0 123.4.5.6 netmask
255.255.255.240 broadcast 202.106.169.255
heartbeat: 2000/02/07_01:07:50 Sending Gratuitous Arp for 123.4.5.6 on
eth1:0 [eth1]
heartbeat: 2000/02/08_01:07:47 error: Cannot open /proc/ha/.control: No
such file or directory
heartbeat: 2000/02/08_01:07:48 info: MSG stats: 100/43203 age 1
[pid1317/CONTROL]
heartbeat: 2000/02/08_01:07:48 info: ha_malloc stats: 2100/950466
88000/48978 [pid1317/CONTROL]
heartbeat: 2000/02/08_01:07:48 info: RealMalloc stats: 89136 total
malloc bytes. pid 1317/CONTROL]
heartbeat: 2000/02/08_01:07:48 info: MSG stats: 0/172809 age 1
[pid1320/MST_STATUS]
heartbeat: 2000/02/08_01:07:48 info: ha_malloc stats: 0/3153771 0/0
[pid1320/MST_STATUS]
heartbeat: 2000/02/08_01:07:48 info: RealMalloc stats: 960 total malloc
bytes. pid 1320/MST_STATUS]
heartbeat: 2000/02/08_01:07:48 info: MSG stats: 0/43203 age 1
[pid1321/HBWRITE]
heartbeat: 2000/02/08_01:07:48 info: ha_malloc stats: 0/950466 0/0
[pid1321/HBWRITE]
heartbeat: 2000/02/08_01:07:48 info: RealMalloc stats: 1136 total malloc
bytes. pid 1321/HBWRITE]
2 problems about HeartBeat [ In reply to ]
Qiming Liang wrote:
>
> Hi all,
>
> I'm using heartbeat-0.4.6 on 2 RedHat 6.1 P-III 500 machines.
> Though it can work well for most the time, I met 2 problems
> some days ago.
>
> The first problem is: I use both serial port ttyS0 and ethernet
> eth1 as heartbeat link. After I draw off both the serial link
> and ethernet link, the 2 machines cannot touch each other then
> both take over the other's IP. But when I plus the 2 links again,
> both the 2 machines don't release the other's IP which they took
> over before. Thus, the 2 machines both take over 2 IPs.

This is a known behavior of the heartbeat code. For testing, the proper way
to verify correct behavior is to shut down the daemons on one machine.


> I think some codes are omitted in the case above.
> I'm reading the source code of heartbeat, and try to solve this
> problem myself.

It is described in the TODO list. Send me the patch if you fix it. Please
base your fix off the CVS repository if you fix it.

> There is something strange in the second problem: After heartbeat
> has run for about 20 day, one machine began to write some /proc info,
> but after it logs
> "heartbeat: 2000/02/06_14:13:14 error: Cannot open /proc/ha/.control: No
> such file or directory",
> it did not log anything more (normally, it would log some MSG stats
> information).
> From then on, the other machine could not touch this machine, and it's
> heartbeat took over
> 2 IPs, while this machine still took over only one IP.
>
> This happened only once. I haven't catch it later. It seems that
> the fault machine's heartbeat entries a dead loop or infinite waiting
> state.

I would suggest upgrading to 0.4.6c. It has a few bugs fixed, and in
particular, if a machine's heart stops beating it will shut down, and the
other node will take over. There are also some problems with time jumps fixed
in the code which are of use in some situations. I know it's marked "bleeding
edge", but in reality, it's the best release with no known drawbacks over
0.4.6.


> Does anyone have some idea? Thank you very much.
>
> I do not install the /proc module. The logs as follow:

Could you upgrade and let me know how it goes?

Thanks!

-- Alan Robertson
alanr@bell-labs.com