Mailing List Archive

Heartbeat giving up resources on high load
Hello

I still have the problem that heartbeat gives up the resources under
high load. Doing a test on my secondary node (florix) heartbeat put the
following in the logfile:

Jul 18 17:29:25 florix heartbeat[1050]: WARN: node florix: is dead
Jul 18 17:29:25 florix heartbeat[1050]: ERROR: No local heartbeat. Forcing shutdown.
Jul 18 17:29:25 florix heartbeat[1050]: info: Node florix: status active
Jul 18 17:29:25 florix heartbeat[1048]: info: Heartbeat shutdown in progress.
Jul 18 17:29:25 florix heartbeat[25949]: info: Giving up all HA resources.
Jul 18 17:29:28 florix heartbeat: info: Running /etc/ha.d/rc.d/status status
Jul 18 17:29:31 florix heartbeat: info: /usr/lib/heartbeat/mach_down: nice_failback: acquiring foreign resources
Jul 18 17:29:35 florix heartbeat[25949]: info: All HA resources relinquished.
Jul 18 17:29:36 florix heartbeat[1048]: info: Heartbeat shutdown complete.

This test was done with heartbeat 0.4.8 under a SMP Linux 2.2.15. I notice
that some process of heartbeat are not locked into memory:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 10680 0.2 0.2 1300 720 ttyS0 SL 06:49 0:00 /usr/lib/heartbeat/heartbeat
root 10682 0.0 0.2 1300 724 ttyS0 SL 06:49 0:00 /usr/lib/heartbeat/heartbeat
root 10683 0.0 0.2 1300 708 ttyS0 SL 06:49 0:00 /usr/lib/heartbeat/heartbeat
root 10684 0.0 0.2 1296 712 ttyS0 S 06:49 0:00 /usr/lib/heartbeat/heartbeat
root 10685 0.0 0.2 1300 708 ttyS0 SL 06:49 0:00 /usr/lib/heartbeat/heartbeat
root 10686 0.0 0.2 1296 704 ttyS0 S 06:49 0:00 /usr/lib/heartbeat/heartbeat

Could this be the problem?

Or could it be due to the very high disk load? There are lots of very small
files being written to disk and then deleted again. I am already using a
SW Raid5 spread accross 5 disks to make the disk IO faster, which did
help a lot. I notice heartbeat makes use of FIFO's. Does anyone know if
FIFO's are effected by disk performance, ie. if the disks are very busy
that FIFO's will block in such a situation?

I plan to use this system as a file distributing system that can have very
high loads at certain times. If heartbeat terminates itself on one of the
node the other node will take the load but will then terminate as well and
my system will be dead for the outside world.

What can I do so that this does not happen?

Thanks,
Holger
Heartbeat giving up resources on high load [ In reply to ]
Holger Kiehl wrote:
>
> Hello
>
> I still have the problem that heartbeat gives up the resources under
> high load. Doing a test on my secondary node (florix) heartbeat put the
> following in the logfile:
>
> Jul 18 17:29:25 florix heartbeat[1050]: WARN: node florix: is dead
> Jul 18 17:29:25 florix heartbeat[1050]: ERROR: No local heartbeat. Forcing shutdown.
> Jul 18 17:29:25 florix heartbeat[1050]: info: Node florix: status active
> Jul 18 17:29:25 florix heartbeat[1048]: info: Heartbeat shutdown in progress.
> Jul 18 17:29:25 florix heartbeat[25949]: info: Giving up all HA resources.
> Jul 18 17:29:28 florix heartbeat: info: Running /etc/ha.d/rc.d/status status
> Jul 18 17:29:31 florix heartbeat: info: /usr/lib/heartbeat/mach_down: nice_failback: acquiring foreign resources
> Jul 18 17:29:35 florix heartbeat[25949]: info: All HA resources relinquished.
> Jul 18 17:29:36 florix heartbeat[1048]: info: Heartbeat shutdown complete.
>
> This test was done with heartbeat 0.4.8 under a SMP Linux 2.2.15. I notice
> that some process of heartbeat are not locked into memory:
>
> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
> root 10680 0.2 0.2 1300 720 ttyS0 SL 06:49 0:00 /usr/lib/heartbeat/heartbeat
> root 10682 0.0 0.2 1300 724 ttyS0 SL 06:49 0:00 /usr/lib/heartbeat/heartbeat
> root 10683 0.0 0.2 1300 708 ttyS0 SL 06:49 0:00 /usr/lib/heartbeat/heartbeat
> root 10684 0.0 0.2 1296 712 ttyS0 S 06:49 0:00 /usr/lib/heartbeat/heartbeat
> root 10685 0.0 0.2 1300 708 ttyS0 SL 06:49 0:00 /usr/lib/heartbeat/heartbeat
> root 10686 0.0 0.2 1296 704 ttyS0 S 06:49 0:00 /usr/lib/heartbeat/heartbeat
>
> Could this be the problem?
>
> Or could it be due to the very high disk load? There are lots of very small
> files being written to disk and then deleted again. I am already using a
> SW Raid5 spread accross 5 disks to make the disk IO faster, which did
> help a lot. I notice heartbeat makes use of FIFO's. Does anyone know if
> FIFO's are effected by disk performance, ie. if the disks are very busy
> that FIFO's will block in such a situation?
>
> I plan to use this system as a file distributing system that can have very
> high loads at certain times. If heartbeat terminates itself on one of the
> node the other node will take the load but will then terminate as well and
> my system will be dead for the outside world.
>
> What can I do so that this does not happen?

I have also observed that heartbeat processes don't always lock
themselves in memory successfully. However, the return code from the
system call is always successful :-( I suppose I need to look into this
some more.

In CVS there is a version of heartbeat which implements a new feature
"warntime" which might be of some help in your situation. The idea is
that you set your failover time (deadtime) up quite a bit higher, and
then set warntime to a value more like what you'd like to set deadtime
to. Then whenever a heartbeat packet comes in later than warntime, it
reports this fact, along with *how* late the heartbeat was. If you test
it under load (as you've been doing), this should give you a pretty good
idea of how low you can safely set your deadtime value.

-- Alan Robertson
alanr@suse.com
Heartbeat giving up resources on high load [ In reply to ]
On Wed, 19 Jul 2000, Alan Robertson wrote:

> In CVS there is a version of heartbeat which implements a new feature
> "warntime" which might be of some help in your situation. The idea is
> that you set your failover time (deadtime) up quite a bit higher, and
> then set warntime to a value more like what you'd like to set deadtime
> to. Then whenever a heartbeat packet comes in later than warntime, it
> reports this fact, along with *how* late the heartbeat was. If you test
> it under load (as you've been doing), this should give you a pretty good
> idea of how low you can safely set your deadtime value.
>
Yes, with warntime I was able to determine the deadtime (in my case 60).
The nice side effect is that you now can see the time when your machine
is very busy.

Thanks,
Holger
Heartbeat giving up resources on high load [ In reply to ]
Holger Kiehl wrote:
>
> On Wed, 19 Jul 2000, Alan Robertson wrote:
>
> > In CVS there is a version of heartbeat which implements a new feature
> > "warntime" which might be of some help in your situation. The idea is
> > that you set your failover time (deadtime) up quite a bit higher, and
> > then set warntime to a value more like what you'd like to set deadtime
> > to. Then whenever a heartbeat packet comes in later than warntime, it
> > reports this fact, along with *how* late the heartbeat was. If you test
> > it under load (as you've been doing), this should give you a pretty good
> > idea of how low you can safely set your deadtime value.
> >
> Yes, with warntime I was able to determine the deadtime (in my case 60).
> The nice side effect is that you now can see the time when your machine
> is very busy.

Oooohhh! 60 seconds is a *long* time. Something is still really wrong
here. Of course, it *must* be something besides heartbeat ;-)

-- Alan Robertson
alanr@suse.com
Heartbeat giving up resources on high load [ In reply to ]
60 seconds may *not* be all that unreasonable, given certain circumstances,
from experience on similar systems. Holger, look at your system especially
for:
- paging activity.
- interrupt-level processing, e.g., high rate of I/O and/or network
interrupts.
- other fixed-priority processes that may be spinning.

Also, do you have any programs installed that do processing at interrupt
level? Sometimes, apps such as firewalls or the like have a tendency to do
lots of work at interrupt level, wreaking havoc with process scheduling.

These have been the opinions of:
Peter R. Badovinatz -- (503)578-5530 (TL 775)
Clusters and High Availability, Beaverton, OR
wombat@us.ibm.com
and in no way should be construed as official opinion of IBM, Corp., my
email id notwithstanding.


Alan Robertson <alanr@suse.com>@lists.tummy.com on 07/23/2000 02:29:13 AM

Please respond to linux-ha-dev@lists.tummy.com

Sent by: linux-ha-dev-admin@lists.tummy.com


To: linux-ha-dev@lists.tummy.com
cc:
Subject: Re: [Linux-ha-dev] Heartbeat giving up resources on high load



Holger Kiehl wrote:
>
> On Wed, 19 Jul 2000, Alan Robertson wrote:
>
> > In CVS there is a version of heartbeat which implements a new feature
> > "warntime" which might be of some help in your situation. The idea is
> > that you set your failover time (deadtime) up quite a bit higher, and
> > then set warntime to a value more like what you'd like to set deadtime
> > to. Then whenever a heartbeat packet comes in later than warntime, it
> > reports this fact, along with *how* late the heartbeat was. If you
test
> > it under load (as you've been doing), this should give you a pretty
good
> > idea of how low you can safely set your deadtime value.
> >
> Yes, with warntime I was able to determine the deadtime (in my case 60).
> The nice side effect is that you now can see the time when your machine
> is very busy.

Oooohhh! 60 seconds is a *long* time. Something is still really wrong
here. Of course, it *must* be something besides heartbeat ;-)

-- Alan Robertson
alanr@suse.com

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.tummy.com
http://lists.tummy.com/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
Heartbeat giving up resources on high load [ In reply to ]
On Mon, 24 Jul 2000 wombat@us.ibm.com wrote:

>
> 60 seconds may *not* be all that unreasonable, given certain circumstances,
> from experience on similar systems. Holger, look at your system especially
> for:
>
At first I also thought this is a lot, but on second thought, I rather
accept a one minute downtime then risking the possibility of both heartbeat
going away. I think the default deadtime of 10 is a bit low on servers
that do a lot of I/O.

> - paging activity.
>
I haven't checked this. Next time I will monitor it. But how do I know
what value is a lot of paging activity?

> - interrupt-level processing, e.g., high rate of I/O and/or network
> interrupts.
>
I am currently only testing and have a very high disk I/O rate (with peaks
of 1600 files/second). Later when these boxes becomes operational these
files will come via the network but at a smaller rate.

> - other fixed-priority processes that may be spinning.
>
> Also, do you have any programs installed that do processing at interrupt
> level? Sometimes, apps such as firewalls or the like have a tendency to do
> lots of work at interrupt level, wreaking havoc with process scheduling.
>
None that I know of.

Holger