Mailing List Archive: leaking fd's

leaking fd's

brian at organic

Jun 20, 1995, 10:11 AM

Post #1 of 10 (692 views)

The "403" problem definitely seems to be "too many open files" in the file
system, according to ktrace. Apparently there is an fd leak happening
somewhere... and killing the children doesn't seem to free them, only
killing the parent. "fstat | wc -l" linearly increases with time... RobH,
when you added that extra file descriptor to the CGI environment (so the
script can write directly to the client)... could that be adding to the
problem? Where would I find that to try commenting it out?

Brian

--=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=--
brian@organic.com brian@hyperreal.com http://www.[hyperreal,organic].com/

Re: leaking fd's [ In reply to ]

Jun 20, 1995, 11:33 AM

Post #2 of 10 (684 views)

> The "403" problem definitely seems to be "too many open files" in the file
> system, according to ktrace. Apparently there is an fd leak happening
> somewhere... and killing the children doesn't seem to free them, only
> killing the parent. "fstat | wc -l" linearly increases with time... RobH,
> when you added that extra file descriptor to the CGI environment (so the
> script can write directly to the client)... could that be adding to the
> problem? Where would I find that to try commenting it out?

Open a URL which does a printenv, you should see
CLIENT_FD=x

For each request, "x" should be the lowest free FD. On 0.7.3 here, it stays
at a constant 12, even after timeout/lost connections.

If you want to comment it out, it's in httpd_script.c look for
CLIENT_FD then comment out the entire if{} block around it.

--
http://nqcd.lanl.gov/~hartill/

Re: leaking fd's [ In reply to ]

brian at organic

Jun 20, 1995, 7:50 PM

Post #3 of 10 (685 views)

On Tue, 20 Jun 1995, Rob Hartill wrote:
> > The "403" problem definitely seems to be "too many open files" in the file
> > system, according to ktrace. Apparently there is an fd leak happening
> > somewhere... and killing the children doesn't seem to free them, only
> > killing the parent. "fstat | wc -l" linearly increases with time... RobH,
> > when you added that extra file descriptor to the CGI environment (so the
> > script can write directly to the client)... could that be adding to the
> > problem? Where would I find that to try commenting it out?
>
> Open a URL which does a printenv, you should see
> CLIENT_FD=x
>
> For each request, "x" should be the lowest free FD. On 0.7.3 here, it stays
> at a constant 12, even after timeout/lost connections.

It's crept up steadily throughout the day here, from in the low-20's to
mid-50's now.

I strongly suspect it's in the virtual host code now. fstat showed that
of the 54 fd's open by a particular child, 40 or so were "/var/log/httpd",
which is the default log file. I think it's in virtual host because that
would explain why Rob hasn't seen it on his systems. I don't have a
heavily-visited web site which *doesn't* use virtualhost these days, so I
can't test this, and I also don't know the vhost code and the per-request
structures well enough to debug this. It also must be a particular
condition that causes this, as it doesn't increase with every vhost
access, or anything else predictable.

This is all with 0.7.2h - I'll be trying out 0.7.3c tonight to see how
things go.

Brian

--=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=--
brian@organic.com brian@hyperreal.com http://www.[hyperreal,organic].com/

Re: leaking fd's [ In reply to ]

brian at organic

Jun 20, 1995, 9:04 PM

Post #4 of 10 (686 views)

On Tue, 20 Jun 1995, Rob Hartill wrote:
...
> which means that the parent SIGSEGV handler is getting in the way. Grrrr.

Oh yeah! I also forgot to mention that recently we had another one of
those run-away-with-the-CPU httpd's that wasn't leaving any sign by
ktrace and burning up CPU in the kernel - however, doing a ktrace and
sending the child a kill -HUP returned:

22547 httpd PSIG SIGHUP caught handler=0x5ff0 mask=0x1400 code=0x0
22547 httpd CALL sigprocmask(0x3,0)
22547 httpd PSIG SIGSEGV caught handler=0x5e18 mask=0x0 code=0x0
22547 httpd RET sigprocmask 5121/0x1401
22547 httpd CALL gettimeofday(0xefbd572c,0)
22547 httpd RET gettimeofday 0

... at which point it went back to running away with the CPU, not dying.
A kill -HUp to the parent did clear it away though...

Looking forward to 0.7.3d :)

Brian

--=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=--
brian@organic.com brian@hyperreal.com http://www.[hyperreal,organic].com/

Re: leaking fd's [ In reply to ]

Jun 20, 1995, 9:16 PM

Post #5 of 10 (683 views)

> It's crept up steadily throughout the day here, from in the low-20's to
> mid-50's now.

: : : :

> This is all with 0.7.2h - I'll be trying out 0.7.3c tonight to see how
> things go.

I put 0.7.3c on Cardiff, and noticed occasional SIGSEGVs in the error log.
gdb says..

(gdb) where
#0 0xef7939a0 in kill ()
#1 0xef7abc2c in abort ()
#2 0x77bc in handle_sig_seg_fault () at httpd.c:212
#3 0x87e0 in standalone_main (argc=133120, argv=0x20800) at httpd.c:729
#4 0x8a1c in main (argc=2, argv=0xefffff7c) at httpd.c:784

which means that the parent SIGSEGV handler is getting in the way. Grrrr.

I haven't been able to reproduce the problem on my local HP or on a
local Sparc 20.

I have a couple of minor fixes so I'll upload 0.7.3d in a moment..
one of them might have fixed the kill problem, but I'm not sure.
There are two SIGSEGVs in the error log, but they coincide with the
restart at reboot time, and they come after an "accept failed" mssg.

Shit, just as I wrote that, one appeared in the log, at the exact moment
a "not found" was also logged... more log watching for me.

--
http://nqcd.lanl.gov/~hartill/

Re: leaking fd's [ In reply to ]

Jun 20, 1995, 10:17 PM

Post #6 of 10 (686 views)

> Looking forward to 0.7.3d :)

I just had to pull it from Cardiff... it was SIGSEGV'ing right
left and center now.. the latest gdb report is,

#0 0xef7939a0 in kill ()
#1 0xef7abc2c in abort ()
#2 0x77bc in handle_sig_seg_fault () at httpd.c:212
#3 0xef76acdc in _sigtramp ()
#4 0x18610 in add_accept (request=0x369a8, recs=0x20678,
accept_info=0xeffefb68) at http_mime_db.c:272
#5 0x18b64 in do_header_line (request=0x369a8, accept_recs=0x20678,
accept_line=0xeffefc8b "") at http_mime_db.c:433
#6 0x18b94 in note_client_accept (request=0x369a8,
accept_line=0xeffefc88 "*/*") at http_mime_db.c:444
#7 0x110fc in get_mime_headers (request=0x369a8) at http_mime.c:595
#8 0x93f8 in process_request (request=0x369a8) at http_request.c:273
#9 0x81f0 in child_main (sa_server=0x20968, sa_client=0x2115c) at httpd.c:561
#10 0x82f4 in make_child (argc=2, argv=0xefffff7c, childnum=30,
sa_server=0x20968, sa_client=0x20978) at httpd.c:596
#11 0x86f4 in standalone_main (argc=133120, argv=0x20800) at httpd.c:711
#12 0x8a2c in main (argc=2, argv=0xefffff7c) at httpd.c:786

I hate computers.

Re: leaking fd's [ In reply to ]

Jun 21, 1995, 4:17 AM

Post #7 of 10 (686 views)

On Tue, 20 Jun 1995, Brian Behlendorf wrote:
>
> It's crept up steadily throughout the day here, from in the low-20's to
> mid-50's now.

That's per child or in total (and with how many children?) I get
5 or 6 per httpd child, plus whatever else belongs to the CGI, but for
some reason, that number held steady last night when I ran the test
again. Still trying to figure out what I did differently this time.

I still have the problem with a lot of zombie children. I thought
perhaps setting MaxRequestsPerChild to 1 might help, but it doesn't...
after only ten minutes (about 6000 requests), 47 of the 50 children
are in a zombie state. SIGHUP/STOPping the parent zaps only the
zombies, but leaves the three remaining httpd processes. Turning
MxRequestsPerChild back up to 30 and running for ten hours leaves 5
out of 50 children still running. Any idea why the parent does not
restart a child process? The benchmark includes artificial
"connection drops", which may be causing this... I'll try taking that
out to see what happens.
--
Brian ("Though this be madness, yet there is method in't") Tao
taob@gate.sinica.edu.tw <-- work ........ play --> taob@io.org

Re: leaking fd's [ In reply to ]

Jun 21, 1995, 9:17 AM

Post #8 of 10 (686 views)

> On Tue, 20 Jun 1995, Brian Behlendorf wrote:
> >
> > It's crept up steadily throughout the day here, from in the low-20's to
> > mid-50's now.
>
> That's per child or in total (and with how many children?) I get
> 5 or 6 per httpd child, plus whatever else belongs to the CGI, but for
> some reason, that number held steady last night when I ran the test
> again. Still trying to figure out what I did differently this time.

I'd be interested to see if 0.7.3 does this.

Re: zombies, after running unattended for 12 hours, 0.7.3 at cardiff
now has 3 zombies. I remember seeing 50+ the other day, but these two
observations might not have any real meaning.. we'll see.

--
http://nqcd.lanl.gov/~hartill/

Re: leaking fd's [ In reply to ]

Jun 21, 1995, 8:50 PM

Post #9 of 10 (685 views)

On Wed, 21 Jun 1995, Rob Hartill wrote:
>
> Re: zombies, after running unattended for 12 hours, 0.7.3 at cardiff
> now has 3 zombies. I remember seeing 50+ the other day, but these two
> observations might not have any real meaning.. we'll see.

Okay, ran 0.73f for about 8.5 hours last night, 123340 requests
served. Started off with 20 servers instead of 50 (which finally
dawned on me might not be such a good idea with a machine with just 16
megabytes ;-)), but still had 50 WebHound clients blasting away at it
from other machines. No fd leakage (even with CGI's), was running my
standard kernel that doesn't have special considerations for children
with a zillion open files or processes, but still had the zombie
children.

This morning there were 7 zombies and 3 active children... the
other 10 simply weren't there. The error_log did not record any
SIGSEGV's or core dumps, and there appear to be a normal progression
of "parent replaced dead child" messages several times a minute.
Looking at the benchmark logs, there is a noticeable decrease in the
rate at which requests were being honoured (started out at between 400
to 600 per minute and dropped to around 100 per minute at the end of
the 8 hours).

The number of connection timeouts also increased in frequency by a
factor to 10 at the end of the run relative to the start (with 50
clients fighting for the remaining handful of servers, no wonder...).
Anyone else seeing this kind of behaviour? Anyone else putting Apache
under this type of artificial load?
--
Brian ("Though this be madness, yet there is method in't") Tao
taob@gate.sinica.edu.tw <-- work ........ play --> taob@io.org

Re: leaking fd's [ In reply to ]

chuck at telebase

Jun 22, 1995, 8:35 AM

Post #10 of 10 (685 views)