Mailing List Archive

Re: things to look for in runaway server? (fwd)
(Tony, hope you don't mind me forwarding this)

This is a followup to the all-children-were-zombies problem I had this
morning. I'm positive the scoreboard wasn't getting nuked. This would seem
to be a fatal error for heavily-used systems. Thoughts?

Brian

--=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=--
brian@organic.com brian@hyperreal.com http://www.[hyperreal,organic].com/

---------- Forwarded message ----------
Date: Thu, 14 Sep 1995 22:59:37 -0500
From: Tony Sanders <sanders@bsdi.com>
To: Brian Behlendorf <brian@organic.com>
Subject: Re: things to look for in runaway server?

Brian Behlendorf writes:
> The machine just went south again - this time every process became a
> zombie. not sure how sudden it was, since I didn't watch it get to this
> state, but it was not answering queries even though other services worked
> fine. Included here is the output of that script - see anything that
> could help? Should I take this to the list?

In this case, it's either an httpd or corruption of the scoreboard
file (or something of that nature). The parent, 1431 is hung in
a wait system call so it's probably wait'ing on the wrong thing or
something.

httpd should probably be using WNOHANG so that it cannot get stuck.

I'm not really familiar with the code yet but from what I've just
seen, the process handling and scoreboard code looks pretty scary.
The processing handling should probably be abstracted out into an API
and the code moved into it's own file. This should allow for the
scoreboard to be moved into shared memory (e.g., w/mmap() on BSD/OS)
for performance reasons.

> ------------------ ps axlww
> UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TT TIME COMMAND
> 103 167 1 0 2 0 664 532 select S ?? 2:33.06 /usr/local/sbar/cbd -n
> 0 1431 1 0 10 0 436 216 wait Is ?? 1:20.85 /usr/local/web/bin/httpd -f /usr/local/web/conf/cyborganic.conf
> 32767 26676 1431 0 28 0 0 0 - Z ?? 0:00.00 (httpd)
... bunch of these ...
> 32767 27156 1431 3 28 0 0 0 - Z ?? 0:00.00 (httpd)
> 32767 27166 1431 1 2 0 1024 464 netio I ?? 0:02.19 /usr/local/web/bin/httpd -f /usr/local/web/conf/cyborganic.conf
> 32767 27168 1431 0 28 0 0 0 - Z ?? 0:00.00 (httpd)
... bunch more ...