Mailing List Archive: 0.72j vapourlock, FreeBSD and BSD/OS (Re: update ii)

On Sat, 17 Jun 1995, Brian Behlendorf wrote:
>
> On Sat, 17 Jun 1995, Chuck Murcko wrote:
> > Unfortuantely, There are real
> > problems with 0.7.2j eating up all the file descriptors under FreeBSD and
> > BSDI under heavy load, and I have to look at those first.
>
> Make sure your MAXUSERS is at least as big as the number of children you
> expect to run. That was a problem too for links.net until that was done.
> fstat was showing *lots* of file descriptors, but on this machine it was
> also serving 4 virtual hosts, each of which had at least 3 fd's for log
> files and such. It did seem like there was some leakage going on as
> well, so I won't rule that out....

Although I'd have to redo the test to be 100% certain, I recall
one of the many FreeBSD 2.0.5 kernel configs I tried had MAXUSERS set
to 128 and OPEN_MAX set to 256. Server still locks up within the
first twenty minutes of running with 50 children (serving 50 WebHound
clients) and practically nothing else in the process table.

Data point: If I reconfigure the 50 WebHound clients *not* to call
any CGI URL's, the server runs happily for at least two hours (I
terminated the test early). fstat shows 200-some open files belonging
to user "nobody". But with 5% of requests going to CGI scripts, I've
counted as many as 1154 open files belonging to user "nobody" before
the machine locks. Reducing the number of clients (and Apache
children) to 10 allows it to run for at least 10 hours (an overnight
test).

Data point: NCSA httpd 1.4 suffers from exactly the same problem...
after 15 to 20 minutes of heavy serving, the kernel no longer seems
able to spawn any new processes. In every single case (both in NCSA
and Apache), client logs show the last request submitted to the server
was for a CGI.

Data point: This does not happen under Apache 0.65 with the same
number of clients blasting away at the server. I didn't even need to
change MAXUSERS from the default setting of 16. This is both with and
without CGI's.

Nine CGI's are used as part of the test. Five are pretty
simple-minded ones (cal, date, finger, uptime and fortune). The other
four are wlint (a perl-based HTML syntax checker), greplog (something
I wrote which simply displays access_log lines matching a string),
webplot (a CGI front-end to gnuplot, which also calls ppmtogif) and a
glimpse index search of the FreeBSD kernel source tree.

StartServers = 50, MaxRequestsPerChild = 30, Timeout = 300.
Fiddling with the timeout value did not have any noticeable effect.
So the problem appears to be related to: sustained heavy loads, a
large number of pre-forked children, and CGI processing. Something
wonky in the way a child httpd forks a CGI process, perhaps?

> 2) all the children die, only the parent is left, and no files are being
> served. Is there a condition whereby a parent might not detect the death
> of a child?

I see something like this even when ignoring CGI requests. After
2 hours 53 minutes today (93035 requests served), I terminated a
50-client, no-CGI benchmark run and got this:

PID TT STAT TIME COMMAND
0 ?? DLs 0:00.00 (swapper)
[...]
260 ?? IWs 0:26.28 httpd-root (apache)
261 ?? Z 0:00.00 (apache)
262 ?? Z 0:00.00 (apache)
263 ?? Z 0:00.00 (apache)
264 ?? Z 0:00.00 (apache)
265 ?? Z 0:00.00 (apache)
266 ?? Z 0:00.00 (apache)
267 ?? Z 0:00.00 (apache)
268 ?? Z 0:00.00 (apache)
269 ?? Z 0:00.00 (apache)
270 ?? Z 0:00.00 (apache)
271 ?? Z 0:00.00 (apache)
272 ?? Z 0:00.00 (apache)
399 ?? Z 0:00.00 (apache)
658 ?? Z 0:00.00 (apache)
755 ?? Z 0:00.00 (apache)
1664 ?? Z 0:00.00 (apache)
2433 ?? Z 0:00.00 (apache)
2697 ?? Z 0:00.00 (apache)
3135 ?? Z 0:00.00 (apache)
3334 ?? S 0:01.08 apache-40-19 (apache)
3366 ?? S 0:00.98 apache-30-16 (apache)
3390 ?? S 0:00.86 apache-49-6 (apache)
3397 ?? S 0:00.96 apache-16-21 (apache)
3418 ?? S 0:00.76 apache-35-11 (apache)
3434 ?? S 0:00.93 apache-15-27 (apache)
3470 ?? S 0:01.00 apache-39-30 (apache)
3474 ?? S 0:00.83 apache-13-21 (apache)
3475 ?? S 0:00.48 apache-14-13 (apache)
3488 ?? S 0:00.82 apache-23-24 (apache)
3489 ?? S 0:00.60 apache-32-24 (apache)
3499 ?? S 0:00.59 apache-26-20 (apache)
3500 ?? S 0:00.49 apache-45-16 (apache)
3502 ?? S 0:00.55 apache-43-23 (apache)
3503 ?? S 0:00.60 apache-19-26 (apache)
3504 ?? S 0:00.58 apache-37-26 (apache)
3505 ?? S 0:00.50 apache-28-23 (apache)
3506 ?? S 0:00.52 apache-22-22 (apache)
3507 ?? S 0:00.43 apache-34-16 (apache)
3508 ?? S 0:00.43 apache-18-18 (apache)
3509 ?? S 0:00.39 apache-48-16 (apache)
3510 ?? S 0:00.28 apache-33-8 (apache)
3512 ?? S 0:00.33 apache-41-14 (apache)
3513 ?? S 0:00.25 apache-24-11 (apache)
3514 ?? S 0:00.29 apache-29-11 (apache)
3516 ?? S 0:00.24 apache-12-11 (apache)
3517 ?? S 0:00.17 apache-42-7 (apache)
3518 ?? S 0:00.19 apache-38-7 (apache)
3519 ?? S 0:00.18 apache-17-7 (apache)
3520 ?? S 0:00.11 apache-20-4 (apache)
3521 ?? S 0:00.13 apache-47-2 (apache)

19 out of 50 children are zombies, but with zero accumulated CPU
time and zero resident/virtual memory size (which I didn't show here).
Note the PID's too... there is a run of 12 successive zombies
immediately following the root process. Is httpd-root trying to
restart a child, but somehow failing? The blank argv[0] must be a
clue.
--
Brian ("Though this be madness, yet there is method in't") Tao
taob@gate.sinica.edu.tw <-- work ........ play --> taob@io.org