Mailing List Archive: heartbeat still terminates under heavy load

heartbeat still terminates under heavy load

May 17, 2000, 12:47 AM

Post #1 of 5 (1153 views)

Hello

Trying heartbeat-0.4.7b under 2.2.15 with heavy load it still terminates:

May 16 13:10:28 florix /usr/lib/heartbeat/heartbeat[1396]: node florix: is dead
May 16 13:10:28 florix /usr/lib/heartbeat/heartbeat[1396]: No local heartbeat. Forcing shutdown.
May 16 13:10:31 florix heartbeat: INFO: Running /etc/ha.d/rc.d/status status
May 16 13:10:35 florix /usr/lib/heartbeat/heartbeat[1394]: Heartbeat shutdown in progress.
May 16 13:10:36 florix /usr/lib/heartbeat/heartbeat[20067]: Giving up all HA resources.
May 16 13:10:42 florix /usr/lib/heartbeat/heartbeat[20067]: All HA resources relinquished.
May 16 13:10:42 florix /usr/lib/heartbeat/heartbeat[1394]: Heartbeat shutdown complete.

Just some additional information, I did this test on the secondary note.
The machine is a dual PIII-450 and a SW-Raid 5 spread across five disks.

The load was around 34 when heartbeat gave up. However the disusage at that
time was very high (copying lots of small files via ftp locally). Besides,
does anyone know how to get disk/filesystem statistics in linux, to see how
busy the disk/filesystem are?

But maybe this behaviour is correct after all. If the disk is so busy
that heartbeat must think that the disk ie the node is dead. But I think
for nice_failback care should be taken that heartbeat does not terminate.
Giving away the resources is okay, but not the termination.

Another thing I noticed is that two heartbeat process are not locked
into memory:

20132 ttyS0 SL 0:00 /usr/lib/heartbeat/heartbeat
20134 ttyS0 SL 0:00 /usr/lib/heartbeat/heartbeat
20135 ttyS0 SL 0:00 /usr/lib/heartbeat/heartbeat
20136 ttyS0 S 0:00 /usr/lib/heartbeat/heartbeat
20137 ttyS0 SL 0:00 /usr/lib/heartbeat/heartbeat
20138 ttyS0 S 0:00 /usr/lib/heartbeat/heartbeat

Is this correct?

Then I still have two cosmetic wishes for heartbeat (I know christmas
is still far away, but still ... ;-))

- When heartbeat starts that it always prints out the version number.

- In the logfile don't always print out the full path of heartbeat, eg:

May 16 08:18:23 florix /usr/lib/heartbeat/heartbeat[1389]: Configuration ...

Just

May 16 08:18:23 florix heartbeat[1389]: Configuration ...

This would make reading the log files much easier.

Looking at /proc/<heartbeat-proc-id>/fd directory I notice that all of them
have /etc/ha.d/haresources open. Maybe this is another candidate for
close on exec.

Holger

heartbeat still terminates under heavy load [ In reply to ]

olive at conectiva

May 17, 2000, 5:45 AM

Post #2 of 5 (1142 views)

Permalink

Hi there,

) Looking at /proc/<heartbeat-proc-id>/fd directory I notice that all of them
) have /etc/ha.d/haresources open. Maybe this is another candidate for
) close on exec.

Or even closing it right after reading it into memory, since, as it is a
configuration file that does not get updated automagically by heartbeat,
it should not be kept open. Of course, correct me if there is a reason for
such behaviour. :)

[]!
Fábio
( Fábio Olivé Leite -* ConectivaLinux *- olive@conectiva.com[.br] )
( PPGC/UFRGS MSc candidate -*- Advisor: Taisy Silva Weber )
( Linux - Distributed Systems - Fault Tolerance - Security - /etc )

heartbeat still terminates under heavy load [ In reply to ]

alanr at suse

May 17, 2000, 6:02 AM

Post #3 of 5 (1145 views)

Permalink

Holger Kiehl wrote:
>
> Hello
>
> Trying heartbeat-0.4.7b under 2.2.15 with heavy load it still terminates:
>
> May 16 13:10:28 florix /usr/lib/heartbeat/heartbeat[1396]: node florix: is dead
> May 16 13:10:28 florix /usr/lib/heartbeat/heartbeat[1396]: No local heartbeat. Forcing shutdown.
> May 16 13:10:31 florix heartbeat: INFO: Running /etc/ha.d/rc.d/status status
> May 16 13:10:35 florix /usr/lib/heartbeat/heartbeat[1394]: Heartbeat shutdown in progress.
> May 16 13:10:36 florix /usr/lib/heartbeat/heartbeat[20067]: Giving up all HA resources.
> May 16 13:10:42 florix /usr/lib/heartbeat/heartbeat[20067]: All HA resources relinquished.
> May 16 13:10:42 florix /usr/lib/heartbeat/heartbeat[1394]: Heartbeat shutdown complete.
>
> Just some additional information, I did this test on the secondary note.
> The machine is a dual PIII-450 and a SW-Raid 5 spread across five disks.
>
> The load was around 34 when heartbeat gave up. However the disusage at that
> time was very high (copying lots of small files via ftp locally). Besides,
> does anyone know how to get disk/filesystem statistics in linux, to see how
> busy the disk/filesystem are?
>
> But maybe this behaviour is correct after all. If the disk is so busy
> that heartbeat must think that the disk ie the node is dead. But I think
> for nice_failback care should be taken that heartbeat does not terminate.
> Giving away the resources is okay, but not the termination.
as such a dumb resource management model.

> Another thing I noticed is that two heartbeat process are not locked
> into memory:
>
> 20132 ttyS0 SL 0:00 /usr/lib/heartbeat/heartbeat
> 20134 ttyS0 SL 0:00 /usr/lib/heartbeat/heartbeat
> 20135 ttyS0 SL 0:00 /usr/lib/heartbeat/heartbeat
> 20136 ttyS0 S 0:00 /usr/lib/heartbeat/heartbeat
> 20137 ttyS0 SL 0:00 /usr/lib/heartbeat/heartbeat
> 20138 ttyS0 S 0:00 /usr/lib/heartbeat/heartbeat
>
> Is this correct?

If 'L' means locked in memory, then it looks like a bug. Those should
be the read processes. I call make_realtime() in the read processes,
too, so I'm a little confused about why this is happening. I don't
appear to call make_normaltime() in them either. Are there any
suspicious messages in the logs? Could you start it
/usr/lib/heartbeat/heartbeat with the -d option and send me the logs?

>
> Then I still have two cosmetic wishes for heartbeat (I know christmas
> is still far away, but still ... ;-))
>
> - When heartbeat starts that it always prints out the version number.
>
> - In the logfile don't always print out the full path of heartbeat, eg:
>
> May 16 08:18:23 florix /usr/lib/heartbeat/heartbeat[1389]: Configuration ...
>
> Just
>
> May 16 08:18:23 florix heartbeat[1389]: Configuration ...
>
> This would make reading the log files much easier.

This makes sense. I'm not sure where syslog gets this name, but I just
made a change which might help.

> Looking at /proc/<heartbeat-proc-id>/fd directory I notice that all of them
> have /etc/ha.d/haresources open. Maybe this is another candidate for
> close on exec.
>

Maybe even for an fclose ;-). I fixed it in CVS.

Thanks!

-- Alan Robertson
alanr@suse.com

heartbeat still terminates under heavy load [ In reply to ]

lclaudio at conectiva

May 17, 2000, 7:44 AM

Post #4 of 5 (1150 views)

Permalink

Hi!

> The load was around 34 when heartbeat gave up. However the disusage at that
> time was very high (copying lots of small files via ftp locally).

Under such a heavy load I sometimes see some misbehaviour in
my linux box. Rick van Riel (who subscribes to linux-ha-dev) has
released "FairSched" and "Anti memory hogs" patches for the
kernel... maybe it can help you. Riel?

Luis

[ Luis Claudio R. Goncalves lclaudio@conectiva.com.br ]
[. BSc in Computer Science -- MSc coming soon -- Gospel User -- Linuxer ]
[. Fault Tolerance - Real-Time - Distributed Systems - IECLB - IS 40:31 ]
[. LateNite Programmer -- Jesus Is The Solid Rock On Which I Stand -- ]

heartbeat still terminates under heavy load [ In reply to ]

crispin at wirex

May 17, 2000, 9:20 AM

Post #5 of 5 (1137 views)

Permalink

Holger Kiehl wrote:

> Besides,
> does anyone know how to get disk/filesystem statistics in linux, to see how
> busy the disk/filesystem are?

vmstat gives you a bunch of statistics, including block I/O rates and CPU utilizations. Example:
here's vmstat for my mostly idle workstation ("vmstat 5" means report every 5 seconds). The
I/O spike in the middle is me doing wc on my inbox.

/home/crispin[17] vmstat 5
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 0 0 33676 1684 1064 43160 1 1 21 5 14 15 10 15 75
1 0 0 33676 1716 1064 43128 0 0 0 1 391 574 5 12 83
3 0 0 33676 1712 1064 43128 0 0 0 1 364 526 4 13 83
2 0 0 33676 852 592 44392 5 0 1003 1 2378 529 17 26 57
1 0 0 33676 1332 180 44428 0 0 765 0 1891 497 17 22 61
1 0 0 33676 1636 156 44156 0 0 12 1 438 633 5 16 79
1 0 0 33676 1700 156 44092 0 0 0 0 367 540 4 11 85

Crispin
-----
Crispin Cowan, CTO, WireX Communications, Inc. http://wirex.com
Free Hardened Linux Distribution: http://immunix.org
JOBS! http://immunix.org/jobs.html