Mailing List Archive

strange behavior in heartbeat children
Hi there,

After days chasing this bug, I and Luis Claudio found that some scripts
and specially a small program behave strangely when called by heartbeat.

When we got tired of seeing messages from heartbeat that "datadisk
something start" had failed, and even "httpd startup failed", we started
debugging some initscripts and drbd's datadisk script as well.

It turns out that anything that uses the "action" and "daemon" functions
defined in /etc/rc.d/init.d/functions ends up running a little program
called initlog, that runs a command and logs its output.

When run from heartbeat, this program works fine but seems to always
return 255 (or -1, if that suits you better), which causes the action or
daemon call to return that value. We use the function "action" in drbd's
datadisk, so it seems to allways fail when run from heartbeat.

Anybody can confirm this, or even shed some light over it? I browsed the
sources of this initlog program, and there are lots of points where it
returns -1 when it finds an error. That can be interpreted as 255 if seen
as an unsigned char.

What shouldn't be happening is having it work fine and return 0 when run
by hand in a shell, and return 255 when run by heartbeat... I feel really
messed up after trying to figure this out for a whole afternoon, so I
can't go on exploring this now... :)

See ya!
Fábio
( Fábio Olivé Leite -* ConectivaLinux *- olive@conectiva.com[.br] )
( PPGC/UFRGS MSc candidate -*- Advisor: Taisy Silva Weber )
( Linux - Distributed Systems - Fault Tolerance - Security - /etc )
strange behavior in heartbeat children [ In reply to ]
On 2000-05-23T18:37:27,
Fábio Olivé Leite <olive@conectiva.com.br> said:

> It turns out that anything that uses the "action" and "daemon" functions
> defined in /etc/rc.d/init.d/functions ends up running a little program
> called initlog, that runs a command and logs its output.

On a slightly different topic, those functions are probably pretty Red Hat
specific ;-) If you are using them, it may be better to instead provide a
compatible set of functions with heartbeat instead, to make the ha.d scripts
platform generic.

Sincerely,
Lars Marowsky-Brée <lmb@suse.de>
Development HA

--
Perfection is our goal, excellence will be tolerated. -- J. Yahl
strange behavior in heartbeat children [ In reply to ]
Lars Marowsky-Bree wrote:
> =

> On 2000-05-23T18:37:27,
> F=E1bio Oliv=E9 Leite <olive@conectiva.com.br> said:
> =

> > It turns out that anything that uses the "action" and "daemon" functi=
ons
> > defined in /etc/rc.d/init.d/functions ends up running a little progra=
m
> > called initlog, that runs a command and logs its output.
> =

> On a slightly different topic, those functions are probably pretty Red =
Hat
> specific ;-) If you are using them, it may be better to instead provide=
a
> compatible set of functions with heartbeat instead, to make the ha.d sc=
ripts
> platform generic.

I experienced the same problem on RedHat with the standard RedHat
scripts. I had a not to deep look into it but couldn't find a reason.
The services are actually started but for some reason the script doesn't
return 0.

juri

-- =

juri.haberland@innominate.de
innominate AG
networking people
phone: +49-30-308806-45 fax: -77 web: http://innominate.de
strange behavior in heartbeat children [ In reply to ]
Hi there,

) On a slightly different topic, those functions are probably pretty Red Hat
) specific ;-) If you are using them, it may be better to instead provide a
) compatible set of functions with heartbeat instead, to make the ha.d scripts
) platform generic.

Yep... that's why drbd's datadisk is smart enough to provide its own
action routine if it can't find one somewhere else. :)

This morning I delved into it a little more, and saw that when initlog is
called by heartbeat, somehow its children get reaped without notice, so
the waitpid in the end of the RunCommand function (of initlog) returns an
error, which gets propagated until the final exit(). The error is ENOCHLD,
which is very strange, considering everything works fine when called by
hand in a shell, and we can see the child is really spawned fine.

The only thing I can think of is that heartbeat may be messing its signals
to suit its needs, and those changes affect its children as well. Maybe
all signal handlers should be set to default before exec'ing...

Alan? Can you enlighten us on this? :)

If you want, I can provide some "incriminating" straces... heheh.

Cheers!
Fábio
( Fábio Olivé Leite -* ConectivaLinux *- olive@conectiva.com[.br] )
( PPGC/UFRGS MSc candidate -*- Advisor: Taisy Silva Weber )
( Linux - Distributed Systems - Fault Tolerance - Security - /etc )
strange behavior in heartbeat children [ In reply to ]
Fábio Olivé Leite wrote:
>
> Hi there,
>
> ) On a slightly different topic, those functions are probably pretty Red Hat
> ) specific ;-) If you are using them, it may be better to instead provide a
> ) compatible set of functions with heartbeat instead, to make the ha.d scripts
> ) platform generic.
>
> Yep... that's why drbd's datadisk is smart enough to provide its own
> action routine if it can't find one somewhere else. :)
>
> This morning I delved into it a little more, and saw that when initlog is
> called by heartbeat, somehow its children get reaped without notice, so
> the waitpid in the end of the RunCommand function (of initlog) returns an
> error, which gets propagated until the final exit(). The error is ENOCHLD,
> which is very strange, considering everything works fine when called by
> hand in a shell, and we can see the child is really spawned fine.
>
> The only thing I can think of is that heartbeat may be messing its signals
> to suit its needs, and those changes affect its children as well. Maybe
> all signal handlers should be set to default before exec'ing...

Most of what I do regarding signal handling is normal for daemon
processes. The only thing I can do that might be funny would be that I
ignore SIGCHLD. This could be the problem. You could try and change it
before it starts the scripts, and see if that fixes the problem.

-- Alan Robertson
alanr@suse.com