Mailing List Archive: Re: conserver eventually goes catatonic after SIGPIPE (on NetBSD)

Re: conserver eventually goes catatonic after SIGPIPE (on NetBSD)

Jun 4, 2002, 7:34 PM

Post #1 of 3 (551 views)

[. On Wednesday, May 29, 2002 at 11:59:43 (-0700), Bryan Stansell wrote: ]
> Subject: Re: conserver eventually goes catatonic after SIGPIPE (on NetBSD)
>
> well, that's true, it does only initialize things one at a time, but in
> the case of '|' syntax consoles, all that requires is a fork of a
> process. could a similar mode of thinking could be used for the
> chat-based consoles?

So far I've only looked at the "chat" feature as something done
immediately after the "port" (TTY, TELNET or other TCP, pipe, etc.)
connection has been opened, and which is in some way necessary for the
sucessful ongoing functioning of the console connection. It's only done
once at initialisation time, and presumably the connection should be
closed and reopened at some later time if the chat fails.

Anything beyond that, such as periodic commands, response to certain
output at any time, etc., should be handled externally, which is what
I'm doing now to collect data from the serial ports of my UPS systems:

(printf "format\r"; sleep 2) | console -f best-ups-1 | process

> > BTW, nobody should be afraid of having one process per console port
> > either, even if they have a thousand console ports to manage. This is
> > unix, after all! (even if some implementations are a bit stupid about
> > fork(), it's not as if this application would be constantly re-forking
> > processes continuously)
>
> well, actually, maybe they should.

well, no, they shouldn't have to be afraid of one process per
connection.....

> in the past, i've seen multi-gig
> systems run low on vm because of conserver's footprint.

(hmmm.... I wonder if this could have been part of the problem I was
seeing too.....)

Normally my conserver processes run at only 100 killobytes or so or so
when they first start, but they seem to grow.... The other day I
spotted one that was 21 megabytes. Here's proof in that the child is
now over five times as big as its parent and it's done nothing but
handle bunches of "console" client connections (and this is with sixteen
ports, one for a remote conserver and the rest doing telnet connections):

$ ps -auxc | sed -n -e '1p' -e '/conserver/p'
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
root 27231 0.0 0.2 492 104 ?? Is 4:42PM 0:22.80 conserver
root 27232 0.0 0.2 2548 104 ?? I 4:42PM 2:14.83 conserver

Running a few more manual client connections repeatedly, and then
disconnecting again, leaves it nearly 100KB larger already!

$ ps -auxc | sed -n -e '1p' -e '/conserver/p'
USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND
root 27231 0.0 0.2 492 104 ?? Ss 4:42PM 0:23.82 conserver
root 27232 0.8 0.2 2640 104 ?? R 4:42PM 2:19.62 conserver

Once this leak is fixed for good there's no issue with running a
thousand or more such processes on a properly provisioned server --
remember that all modern Unix systems will share the code segment (and
most page it from the original executable too, not from swap).

(Note, I haven't started to look for this leak yet.....)

> i personally
> like the idea of being able to use a relatively low-end machine as a
> console server (i can't count the number of times i've heard of
> infrastructure machines like this left out of budgets or seen as
> unimportant by management or just added to an already loaded admin
> box).

Sure, but in this case the solution I pose will actually reduce total
resource requirements, not increase them, at least not until you get
into thousands of connections at which point you've gone way beyond the
capabilities of your low-end machine (eg. old SS2 w/32MB) anyway.

It's probably best to split the port process out into a separate binary
though -- that way the overhead of the master isn't replicated for every
connection.

Also, think about all the advantages w.r.t. configuration reloads and
long-running connections. No more need to kill clients when all you're
doing is adding or chaning only one console connection.....

--
Greg A. Woods

+1 416 218-0098; <gwoods@acm.org>; <g.a.woods@ieee.org>; <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>

Re: conserver eventually goes catatonic after SIGPIPE (on NetBSD) [ In reply to ]

woods at weird

Jun 4, 2002, 7:37 PM

Post #2 of 3 (544 views)

Permalink

[. On Tuesday, May 28, 2002 at 18:06:18 (-0700), Bryan Stansell wrote: ]
> Subject: Re: conserver eventually goes catatonic after SIGPIPE (on NetBSD)
>
> On Tue, May 28, 2002 at 06:10:56PM -0400, Greg A. Woods wrote:
> yep. guess it's just never come up before 'cause it's mostly
> processing a read() when it notices broken things (a client would have
> to be sending data at just the right time). i didn't even look for a
> SIGPIPE handler until this came up, actually. regardless, yes, this
> needs to be added.

So far I've been doing well with just setting SIGPIPE to SIG_IGN, though
there's not a lot of error checking on write() calls, and that's
necessary to clean up properly (EPIPE should be returned on a write() to
a closed socket when SIGPIPE has been ignored).

--
Greg A. Woods

+1 416 218-0098; <gwoods@acm.org>; <g.a.woods@ieee.org>; <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>

Re: conserver eventually goes catatonic after SIGPIPE (on NetBSD) [ In reply to ]

woods at weird

Jun 4, 2002, 7:55 PM

Post #3 of 3 (543 views)

Permalink

[. On Sunday, May 26, 2002 at 00:10:06 (-0700), Bryan Stansell wrote: ]
> Subject: Re: conserver eventually goes catatonic after SIGPIPE (on NetBSD)
>
> looking at #2, you see it's calling waitpid() from ConsChat().
> ConsChat() is part of your patch. the problem, i'm guessing, is that
> the waitpid() inside the while loop has a little bad logic.
> specifically, what happens when the waitpid() returns an error that
> isn't EINTR? it'll come around for another waitpid() and, i suppose,
> lock up like this. at least, that's my guess - i haven't done any real
> testing - just scanned the code quickly.

Hmmm... but there's never been any errno value other than EINTR -- there
would be a "ConsChat: error waiting for chat process:" message in my log
if there had.....

I've done a whole lot of more careful error checking, including blocking
SIGCHLD before calling waitpid(), setting an alarm(), checking that the
process still exists when the alarm expires and EINTR is returned.
I've also added a break out of the loop if ECHILD is returned. I don't
know what to do if either of EFAULT or EINVAL are returned --
something's drastically wrong in that case and it should probably
abort()....

So far the deadlock hasn't occured again, though perhaps the blocking of
SIGCHLD has prevented it. The problem without the blocking (or
ignoring) of SIGCHLD is that the delivery (and catch) caused waitpid()
to be interrupted and for it to return EINTR. I don't know why the
second call didn't work though -- perhaps there's a race condition in my
kernel that loses the status information if it's waitpid() itself that
is interrupted....

--
Greg A. Woods

+1 416 218-0098; <gwoods@acm.org>; <g.a.woods@ieee.org>; <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>