Mailing List Archive

Timeout connecting to server
We are using Conserver 7.2.7 to serve about 900 console lines on 25
Cyclades TS terminal concentrators, from a Sun Ultra 5 running Solaris
8. We had no problems at levels of 500-600 lines, but the recent
expansion to 900 lines appears to have led to the following interesting
behavior:

After the server has been up for ten days or so, a few users begin
experiencing timeouts when connecting to a small number of console
lines, viz:

--------------------------------------------------------
myhost$ console beta-15-1
< --- Three minutes of silence --- >
console: connect: 61897@conserver: Connection timed out
--------------------------------------------------------

Logging into the Conserver server, I notice a number of connections to
port 61897 in CLOSE_WAIT state. These entries tend to hang around for a
LONG time (e.g. days):

--------------------------------------------------------------------
conserver# netstat -a|grep 61897
*.61897 *.* 0 0 32768 0 LISTEN
lyell.panasas.com.61897 kinsman.2458 1 0 33304 0 ESTABLISHED
lyell.panasas.com.61897 build-bsd6.1851 57920 0 33304 0 CLOSE_WAIT
lyell.panasas.com.61897 build-bsd6.1855 57920 0 33304 0 CLOSE_WAIT
lyell.panasas.com.61897 build-bsd6.1863 57920 0 33304 0 CLOSE_WAIT
lyell.panasas.com.61897 rack-bsd2.2776 57920 0 33304 0 CLOSE_WAIT
lyell.panasas.com.61897 rack-bsd2.2778 57920 0 33304 0 CLOSE_WAIT
lyell.panasas.com.61897 rack-bsd2.2781 57920 0 33304 0 CLOSE_WAIT
lyell.panasas.com.61897 rack-bsd2.2783 57920 0 33304 0 CLOSE_WAIT
lyell.panasas.com.61897 kinsman.1984 57920 0 33304 0 CLOSE_WAIT
--------------------------------------------------------------------

One also sees timeouts when using commands such as "console -x"... the
list of connections pauses at a certain point, and eventually times out.
It seems likely that a single Conserver daemon (out of the 55 or so
that are spawned to handle 900 lines) is being affected.

Restarting Conserver is sometimes (but not always) effective in clearing
this up. In many cases, though, the only solution is to reboot the server.

I had previously bumped up certain values in /etc/system (e.g.
"maxusers", "tcp:tcp_conn_hash_size") to better handle the large number
of connections to Conserver, and I'm also planning to install the latest
Solaris patch cluster, in case this is a Solaris TCP/IP issue...

... but I thought I ought to ask the List as well, in case others have
seen this before.

TIA,
S


--
--
steve lammert software engineer voice: +1-412-323-3500
slammert@panasas.com panasas, inc fax: +1-412-323-3511
Re: Timeout connecting to server [ In reply to ]
So, I solved my immediate problem by using "lsof" (Sol8 binary obtained
via freshmeat.net) to obtain the pid of the conserver daemon which was
not responding. Killing the pid, then sending the "reconnect" signal to
Conserver, and I'm back in business.

... but I'd still like to know why this happens ...

Cheers,
S


Steve Lammert wrote:
>
> We are using Conserver 7.2.7 to serve about 900 console lines on 25
> Cyclades TS terminal concentrators, from a Sun Ultra 5 running Solaris
> 8. We had no problems at levels of 500-600 lines, but the recent
> expansion to 900 lines appears to have led to the following interesting
> behavior:
>
> After the server has been up for ten days or so, a few users begin
> experiencing timeouts when connecting to a small number of console
> lines, viz:
>
> --------------------------------------------------------
> myhost$ console beta-15-1
> < --- Three minutes of silence --- >
> console: connect: 61897@conserver: Connection timed out
> --------------------------------------------------------
>
> Logging into the Conserver server, I notice a number of connections to
> port 61897 in CLOSE_WAIT state. These entries tend to hang around for a
> LONG time (e.g. days):
>
> --------------------------------------------------------------------
> conserver# netstat -a|grep 61897
> *.61897 *.* 0 0 32768 0 LISTEN
> lyell.panasas.com.61897 kinsman.2458 1 0 33304 0 ESTABLISHED
> lyell.panasas.com.61897 build-bsd6.1851 57920 0 33304 0 CLOSE_WAIT
> lyell.panasas.com.61897 build-bsd6.1855 57920 0 33304 0 CLOSE_WAIT
> lyell.panasas.com.61897 build-bsd6.1863 57920 0 33304 0 CLOSE_WAIT
> lyell.panasas.com.61897 rack-bsd2.2776 57920 0 33304 0 CLOSE_WAIT
> lyell.panasas.com.61897 rack-bsd2.2778 57920 0 33304 0 CLOSE_WAIT
> lyell.panasas.com.61897 rack-bsd2.2781 57920 0 33304 0 CLOSE_WAIT
> lyell.panasas.com.61897 rack-bsd2.2783 57920 0 33304 0 CLOSE_WAIT
> lyell.panasas.com.61897 kinsman.1984 57920 0 33304 0 CLOSE_WAIT
> --------------------------------------------------------------------
>
> One also sees timeouts when using commands such as "console -x"... the
> list of connections pauses at a certain point, and eventually times out.
> It seems likely that a single Conserver daemon (out of the 55 or so
> that are spawned to handle 900 lines) is being affected.
>
> Restarting Conserver is sometimes (but not always) effective in clearing
> this up. In many cases, though, the only solution is to reboot the server.
>
> I had previously bumped up certain values in /etc/system (e.g.
> "maxusers", "tcp:tcp_conn_hash_size") to better handle the large number
> of connections to Conserver, and I'm also planning to install the latest
> Solaris patch cluster, in case this is a Solaris TCP/IP issue...
>
> ... but I thought I ought to ask the List as well, in case others have
> seen this before.
>
> TIA,
> S
>
>


--
--
steve lammert software engineer voice: +1-412-323-3500
slammert@panasas.com panasas, inc fax: +1-412-323-3511
Re: Timeout connecting to server [ In reply to ]
On Mon, 28 Apr 2003, Steve Lammert wrote:
>
> So, I solved my immediate problem by using "lsof" (Sol8 binary obtained
> via freshmeat.net) to obtain the pid of the conserver daemon which was
> not responding. Killing the pid, then sending the "reconnect" signal to
> Conserver, and I'm back in business.
>
> ... but I'd still like to know why this happens ...

I've seen similar behavior when a terminal server port is refusing
connections or some such. The daemon that handles the port sleeps and
retries, but while it's sleeping, all its consoles are unresponsive.

There isn't an easy, permanent fix for the problem, unless one could fork
off a misbehaving console into its own daemon, perhaps using some of the
tricks developed for dynamic reconfig.