Mailing List Archive

Re: Heartbeat Problem
Hi Bill,

Thanks for giving heartbeat a try!

> Subject: Heartbeat Problem
> Date: Tue, 26 Oct 1999 15:15:29 -0500
> From: Bill Bacher <bill_bacher@inlet.com>
>
> Alan,
>
> We're testing Heartbeat on two machines. Both run Apache Web Servers, and the
> intent is they will offer up the same content, load sharing by IP address only.
> We're using heartbeat to have one carry the full load if the other should die
> for some reason. Both are configured to take over for the other under heartbeat
> control.

OK. This should work out fine, as long as you have your own mechanism for
synchronizing your web servers data.


> We've tried heartbeat 0.4.5 and 0.4.5a and have seen a strange problem with both
> versions. We're seeing the IP address being re-assigned to the other machine
> without being under heartbeat control. With version 0.4.5, we stopped heartbeat
> altogether and still saw the IP addresses shifting between the two machines.
> Running 0.4.5a today, we're seeing IP transfers but nothing is showing up in the
> heartbeat logs indicating it is controlling things. It's almost as if the fake
> part is running on its own.

What do you mean by shifting on it's own? Do you mean it's showing up in
ifconfig?
Can you do an ifconfig, save the output, and do another one later, and it has
changed without showing up in the heartbeat (ha-log/ha-debug) logs?

> We're running Red Hat 6.0, Kernel 2.2.5-15smp on Dell servers with dual Pentium
> II processors. We installed from the rpm.

Glad somebody does ;-)

> Any idea what might be going on? In one of your postings you mention looking at
> 4 logs. What can we be watching besides ha-log, ha-debug, and messages?

Two logs * two machines = 4 logs :-)
>
> Thanks in advance for your assistance.

This is a new one. If you've installed the "fake" package, you shouldn't have.
Heartbeat does it all...

Assuming that's not the case...

I have no idea where this might be happening, so I'll tell you a little about
what is going on so you can verify what part our code might have in it...

Everything related to IP address takeover and giveback takes place in
/etc/ha.d/resource.d/IPaddr. If it didn't happen there, then heartbeat didn't
do it.

In general, all resource scripts look a lot like /etc/rc.d/init.d
startup/shutdown scripts, except some of them (notably IPaddr) require another
argument. When you start up apache, you do this: /etc/rc.d/init.d/httpd start.
To take over an IP address, you do this: /etc/ha.d/resource.d/IPaddr ip-address
start. To give it up, you do the same thing with "stop" instead of "start".

If you look at the funciton ip_start(), you'll see that every time we take over
an IP address, a message "INFO: ifconfig...", and a message "Sending Gratuitous
Arp for ..." should occur EVERY time we take over an IP address. If these don't
occur, then there is an extremely high probability that we didn't perform an IP
address takeover.

Exactly what did happen to you is harder to say... But it seems unlikely that
we did the dirty deed.

Please let us know what you find out.

Thanks!!

-- Alan Robertson
alanr@bell-labs.com
Re: Heartbeat Problem [ In reply to ]
Alan,

Thanks for the quick response. Here's some more detail:

Quoting Alan Robertson <alanr@bell-labs.com>:

> Hi Bill,
>
> Thanks for giving heartbeat a try!
>
> > Subject: Heartbeat Problem
> > Date: Tue, 26 Oct 1999 15:15:29 -0500
> > From: Bill Bacher <bill_bacher@inlet.com>
> >
> > Alan,
> >
> > We're testing Heartbeat on two machines. Both run Apache Web Servers, and
> the
> > intent is they will offer up the same content, load sharing by IP address
> only.
> > We're using heartbeat to have one carry the full load if the other should
> die
> > for some reason. Both are configured to take over for the other under
> heartbeat
> > control.
>
> OK. This should work out fine, as long as you have your own mechanism for
> synchronizing your web servers data.

We think we have that covered ;-)
>
>
> > We've tried heartbeat 0.4.5 and 0.4.5a and have seen a strange problem
> with both
> > versions. We're seeing the IP address being re-assigned to the other
> machine
> > without being under heartbeat control. With version 0.4.5, we stopped
> heartbeat
> > altogether and still saw the IP addresses shifting between the two
> machines.
> > Running 0.4.5a today, we're seeing IP transfers but nothing is showing up
> in the
> > heartbeat logs indicating it is controlling things. It's almost as if the
> fake
> > part is running on its own.
>
> What do you mean by shifting on it's own? Do you mean it's showing up in
> ifconfig?
> Can you do an ifconfig, save the output, and do another one later, and it
> has
> changed without showing up in the heartbeat (ha-log/ha-debug) logs?

What appears to be happening is that the machine that takes over the IP address
when the first one fails is never actually giving it back up. I had a script on
both mchines that ran an ifconfig -a every 30 seconds and dumped it to a log.
When one machine went down, the ha-log showed the second machine taking over and
the ifconfig on that machine showed the same thing. When the down machine came
back up 2 1/2 minutes later, the ha-log on the second machine indicated it saw
the first one back, but ifconfig still showed it servicing the other IP address.
Evidently, in this condition, when you hit the 1st machine either through an ssh
or a web browser, it's kind of random as to which machine actually services the
request, which is what I was interperting as randome IP address capture. At the
risk of making this too long, I've included the logs and the results of ifconfig
on the end of this. The clocks on the two machines are off by a minute or two
if you're cross referencing time stamps.
>
> > We're running Red Hat 6.0, Kernel 2.2.5-15smp on Dell servers with dual
> Pentium
> > II processors. We installed from the rpm.
>
> Glad somebody does ;-)
>
> > Any idea what might be going on? In one of your postings you mention
> looking at
> > 4 logs. What can we be watching besides ha-log, ha-debug, and messages?
>
> Two logs * two machines = 4 logs :-)
Gotcha.
> >
> > Thanks in advance for your assistance.
>
> This is a new one. If you've installed the "fake" package, you shouldn't
> have.
> Heartbeat does it all...

Nope, only heartbeat.
>
> Assuming that's not the case...
>
> I have no idea where this might be happening, so I'll tell you a little
> about
> what is going on so you can verify what part our code might have in it...
>
> Everything related to IP address takeover and giveback takes place in
> /etc/ha.d/resource.d/IPaddr. If it didn't happen there, then heartbeat
> didn't
> do it.
>
> In general, all resource scripts look a lot like /etc/rc.d/init.d
> startup/shutdown scripts, except some of them (notably IPaddr) require
> another
> argument. When you start up apache, you do this: /etc/rc.d/init.d/httpd
> start.
> To take over an IP address, you do this: /etc/ha.d/resource.d/IPaddr
> ip-address
> start. To give it up, you do the same thing with "stop" instead of
> "start".
>
> If you look at the funciton ip_start(), you'll see that every time we take
> over
> an IP address, a message "INFO: ifconfig...", and a message "Sending
> Gratuitous
> Arp for ..." should occur EVERY time we take over an IP address. If these
> don't
> occur, then there is an extremely high probability that we didn't perform
> an IP
> address takeover.
>
> Exactly what did happen to you is harder to say... But it seems unlikely
> that
> we did the dirty deed.
>
> Please let us know what you find out.
>
> Thanks!!
>
> -- Alan Robertson
> alanr@bell-labs.com
>



---
Bill Bacher
UNIX Network Administrator
McLeodUSA Internetworks
319.790.5056 Phone
319.369.3089 Fax



/var/log/ha-log from 10.3.67.211 (2nd machine):
heartbeat: 1999/10/26_08:56:47 info: ***********************
heartbeat: 1999/10/26_08:56:47 info: Configuration validated. Starting
heartbeat.
heartbeat: 1999/10/26_08:56:47 notice: Starting serial heartbeat on tty
/dev/ttyS0
heartbeat: 1999/10/26_08:56:47 notice: UDP heartbeat started on port 1001
interface eth0
heartbeat: 1999/10/26_08:56:47 error: Cannot open /proc/ha/.control: No such
file or directory
heartbeat: 1999/10/26_09:01:27 info: Heartbeat shutdown in progress.
heartbeat: 1999/10/26_09:01:27 info: Giving up all HA resources.
heartbeat: 1999/10/26_09:01:27 info: All HA resources relinquished.
heartbeat: 1999/10/26_09:01:27 info: Heartbeat shutdown complete.
heartbeat: 1999/10/26_09:03:43 info: ***********************
heartbeat: 1999/10/26_09:03:43 info: Configuration validated. Starting
heartbeat.
heartbeat: 1999/10/26_09:03:44 notice: Starting serial heartbeat on tty
/dev/ttyS0
heartbeat: 1999/10/26_09:03:44 notice: UDP heartbeat started on port 1001
interface eth0
heartbeat: 1999/10/26_09:03:44 error: Cannot open /proc/ha/.control: No such
file or directory
heartbeat: 1999/10/26_16:35:30 warn: node intranet002.iw.mcld.net: is dead
heartbeat: 1999/10/26_16:35:30 INFO: Running /etc/ha.d/rc.d/status status
heartbeat: 1999/10/26_16:35:30 Taking over resource group 10.3.67.220
heartbeat: 1999/10/26_16:35:30 Acquiring resource group: intranet002.iw.mcld.net
10.3.67.220
heartbeat: 1999/10/26_16:35:30 INFO: Running /etc/ha.d/resource.d/IPaddr
10.3.67.220 start
heartbeat: 1999/10/26_16:35:30 INFO: ifconfig eth0:0 10.3.67.220 netmask
255.255.255.128 broadcast 10.3.67.255
heartbeat: 1999/10/26_16:35:30 Sending Gratuitous Arp for 10.3.67.220 on eth0:0
[eth0]
heartbeat: 1999/10/26_16:38:00 notice: node intranet002.iw.mcld.net seq restart
1 vs 13760
heartbeat: 1999/10/26_16:38:00 info: node intranet002.iw.mcld.net: status
unknown
heartbeat: 1999/10/26_16:38:00 INFO: Running /etc/ha.d/rc.d/status status

/var/log/ha-debug fro m10.3.67.211 (2nd machine):
heartbeat: 1999/10/26_09:15:17 debug: Got an NS_rexmit.
heartbeat: 1999/10/26_09:15:17 debug: Got an NS_rexmit.
heartbeat: 1999/10/26_16:35:30 Running /etc/ha.d/rc.d/status: status
heartbeat: 1999/10/26_16:35:30 Starting /etc/ha.d/resource.d/IPaddr 10.3.67.220
start
heartbeat: 1999/10/26_16:35:40 /etc/ha.d/resource.d/IPaddr 10.3.67.220 start
done. RC=0
heartbeat: 1999/10/26_16:38:00 Running /etc/ha.d/rc.d/status: status

Result of script that runs ifconfig -a every 30 seconds on 10.3.67.211 (2nd
machine):
Tue Oct 26 16:35:32 CDT 1999
eth0 Link encap:Ethernet HWaddr 00:20:35:E7:42:89
inet addr:10.3.67.211 Bcast:10.3.67.255 Mask:255.255.255.128
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:58047 errors:0 dropped:0 overruns:0 frame:0
TX packets:17086 errors:0 dropped:0 overruns:40 carrier:18
collisions:4693 txqueuelen:100
Interrupt:19 Base address:0xe4e0

eth0:0 Link encap:Ethernet HWaddr 00:20:35:E7:42:89
inet addr:10.3.67.220 Bcast:10.3.67.255 Mask:255.255.255.128
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
Interrupt:19 Base address:0xe4e0

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:3924 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0

This continued through the night and is still the case:

Wed Oct 27 08:19:52 CDT 1999
eth0 Link encap:Ethernet HWaddr 00:20:35:E7:42:89
inet addr:10.3.67.211 Bcast:10.3.67.255 Mask:255.255.255.128
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1896892 errors:0 dropped:0 overruns:0 frame:0
TX packets:964717 errors:0 dropped:0 overruns:44 carrier:189
collisions:1161224 txqueuelen:100
Interrupt:19 Base address:0xe4e0

eth0:0 Link encap:Ethernet HWaddr 00:20:35:E7:42:89
inet addr:10.3.67.220 Bcast:10.3.67.255 Mask:255.255.255.128
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
Interrupt:19 Base address:0xe4e0

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:3924 Metric:1
RX packets:2 errors:0 dropped:0 overruns:0 frame:0
TX packets:2 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0



/var/log/ha-log on 10.3.67.220 (1st machine):
heartbeat: 1999/10/26_08:55:19 info: ***********************
heartbeat: 1999/10/26_08:55:19 info: Configuration validated. Starting
heartbeat.
heartbeat: 1999/10/26_08:55:19 notice: Starting serial heartbeat on tty
/dev/ttyS0
heartbeat: 1999/10/26_08:55:19 notice: UDP heartbeat started on port 1001
interface eth0
heartbeat: 1999/10/26_08:55:19 error: Cannot open /proc/ha/.control: No such
file or directory
heartbeat: 1999/10/26_09:00:10 warn: node freeweb.iw.mcld.net: is dead
heartbeat: 1999/10/26_09:00:10 INFO: Running /etc/ha.d/rc.d/status status
heartbeat: 1999/10/26_09:00:10 Taking over resource group 10.3.67.211
heartbeat: 1999/10/26_09:00:10 Acquiring resource group: freeweb.iw.mcld.net
10.3.67.211
heartbeat: 1999/10/26_09:00:10 INFO: Running /etc/ha.d/resource.d/IPaddr
10.3.67.211 start
heartbeat: 1999/10/26_09:00:10 INFO: ifconfig eth0:0 10.3.67.211 netmask
255.255.255.128 broadcast 10.3.67.255
heartbeat: 1999/10/26_09:00:10 Sending Gratuitous Arp for 10.3.67.211 on eth0:0
[eth0]
heartbeat: 1999/10/26_09:02:17 notice: node freeweb.iw.mcld.net seq restart 1 vs
141
heartbeat: 1999/10/26_09:02:17 info: node freeweb.iw.mcld.net: status unknown
heartbeat: 1999/10/26_09:02:17 INFO: Running /etc/ha.d/rc.d/status status
heartbeat: 1999/10/26_09:12:21 warn: node freeweb.iw.mcld.net: is dead
heartbeat: 1999/10/26_09:12:21 INFO: Running /etc/ha.d/rc.d/status status
heartbeat: 1999/10/26_09:12:21 Taking over resource group 10.3.67.211
heartbeat: 1999/10/26_09:13:50 error: 49 lost packet(s) for
[freeweb.iw.mcld.net] [298:348]
heartbeat: 1999/10/26_09:13:50 info: node freeweb.iw.mcld.net: status unknown
heartbeat: 1999/10/26_09:13:50 INFO: Running /etc/ha.d/rc.d/status status
heartbeat: 1999/10/26_16:33:56 info: Heartbeat shutdown in progress.
heartbeat: 1999/10/26_16:33:56 info: Giving up all HA resources.
heartbeat: 1999/10/26_16:33:56 info: All HA resources relinquished.
heartbeat: 1999/10/26_16:33:56 info: Heartbeat shutdown complete.
heartbeat: 1999/10/26_16:36:27 info: ***********************
heartbeat: 1999/10/26_16:36:27 info: Configuration validated. Starting
heartbeat.
heartbeat: 1999/10/26_16:36:27 notice: Starting serial heartbeat on tty
/dev/ttyS0
heartbeat: 1999/10/26_16:36:27 notice: UDP heartbeat started on port 1001
interface eth0
heartbeat: 1999/10/26_16:36:27 error: Cannot open /proc/ha/.control: No such
file or directory

/var/log/ha-debug from 10.3.67.220 (1st machine):
heartbeat: 1999/10/26_09:00:10 Running /etc/ha.d/rc.d/status: status
heartbeat: 1999/10/26_09:00:10 Starting /etc/ha.d/resource.d/IPaddr 10.3.67.211
start
heartbeat: 1999/10/26_09:00:20 /etc/ha.d/resource.d/IPaddr 10.3.67.211 start
done. RC=0
heartbeat: 1999/10/26_09:02:17 Running /etc/ha.d/rc.d/status: status
heartbeat: 1999/10/26_09:12:21 Running /etc/ha.d/rc.d/status: status
heartbeat: 1999/10/26_09:13:50 debug: Got an NS_rexmit.
heartbeat: 1999/10/26_09:13:50 debug: Got an NS_rexmit.
heartbeat: 1999/10/26_09:13:51 Running /etc/ha.d/rc.d/status: status

ifconfig -a on 10.3.67.220 (1st machine)
eth0 Link encap:Ethernet HWaddr 00:90:27:3A:80:C4
inet addr:10.3.67.220 Bcast:10.3.67.255 Mask:255.255.255.128
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:50719 errors:0 dropped:0 overruns:0 frame:0
TX packets:32010 errors:0 dropped:0 overruns:0 carrier:17
collisions:7294 txqueuelen:100
Interrupt:17 Base address:0xdce0

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:3924 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0