Mailing List Archive

0.73g + FreeBSD 2.0.5
On Thu, 22 Jun 1995, Chuck Murcko wrote:
>
> Built 0.7.3g on Solaris 2.3, started it with 15 children. Ran about 10
> minutes, then got a raft of SIGSEGVs and kernel panic with a data fault.

I had the same problem (bad page fault, kernel panic) with 0.73g
the first time I ran it on my FreeBSD machine, but not on the second
try. It looks like a FreeBSD issue, so I won't bother this list with
the details. 20 children, 50 clients, died in less than 20 minutes.

> [Thu Jun 22 14:17:27 1995] socket error: accept failed
> [Thu Jun 22 14:17:27 1995] could not get local address
> [Thu Jun 22 14:17:27 1995] httpd: caught SIGSEGV, dumping core

I've never seen any of the messages in my logs, despite crashes
due to various other reasons. The pattern goes something like this:
within the first few minutes, two to three children are zombified.
Then one or two more go in the next ten minutes or so, and after that
more pop up at a more or less even pace. Note that this is with 10+
requests per second, so under less strenuous conditions, the zombies
may take much longer to show up.

Last night, I ran the same test as before (20 children, 50
benchmark clients) but with a cron job sighupping the parent server
once an hour. From the client logs, it looks like a SIGHUP kills any
connections currently in progress. Is this still true?

Came in this morning and the machine was still chugging away, no
fd leaks, CGI's were happily running, not too much swapping going on,
but had three zombies (this was 8 minutes after the last SIGHUP).

> And now, on a lighter note:
> Documentation is like sex: when it is good, it is very, very good; and
> when it is bad, it is better than nothing.
> -- Dick Brandon

Bahahahahaha!!!
--
Brian ("Though this be madness, yet there is method in't") Tao
taob@gate.sinica.edu.tw <-- work ........ play --> taob@io.org
Re: 0.73g + FreeBSD 2.0.5 [ In reply to ]
Brian T writes,

> Last night, I ran the same test as before (20 children, 50
> benchmark clients) but with a cron job sighupping the parent server
> once an hour. From the client logs, it looks like a SIGHUP kills any
> connections currently in progress. Is this still true?

SIGHUP tells the parent to kill all the children immediately.

> Came in this morning and the machine was still chugging away, no
> fd leaks, CGI's were happily running, not too much swapping going on,
> but had three zombies (this was 8 minutes after the last SIGHUP).

I think I have a possible fix for the zombies. The problem seems to be
with the SIGCHLD signal handler.. it's possible for it to have interrupts
disabled while it replaces one dead child, but it misses a signal from another
dead child during that time.

The fix is to have the handler cleanup all current dead children, but then
there's still a chance that before ints are reenabled, but after it has
checked for another dead child, another one dies. To fix that, the parent
will look for dead children after reenabling interrupts, but before it
goes to sleep.


0.7.3h to follow sometime today.


rob
--
http://nqcd.lanl.gov/~hartill/