Mailing List Archive

consolue -u stops, gets Connection timed out
I'm running a fairly large installation with conserver (1100+ console).

Doing a console -u will print out a bunch of consoles, but at some points,
stops for a bit and then prints:

console.i386: connect(): 52044@192.168.9.2: Connection timed out

and then prints out the rest of the consoles.

192.168.9.2 is our master conserver server. If 52044 is a process id, there
is no such process on that host.

I realize I could probably restart the master processes, but that would
probably result in a fair number of upset users, so just trying to figure out
what this error is, how to fix it, and how to avoid having it happen again.

Thanks.

_______________________________________________
users mailing list
users@conserver.com
https://www.conserver.com/mailman/listinfo/users
Re: consolue -u stops, gets Connection timed out [ In reply to ]
It seems to me that the server the client is connecting do is not accept
()'ing the connection. My guess is that for some reason it is too busy
or blocking somewhere.


On Fri, 2007-01-05 at 11:52 -0800, Mark Wedel wrote:
> I'm running a fairly large installation with conserver (1100+ console).
>
> Doing a console -u will print out a bunch of consoles, but at some points,
> stops for a bit and then prints:
>
> console.i386: connect(): 52044@192.168.9.2: Connection timed out
>
> and then prints out the rest of the consoles.
>
> 192.168.9.2 is our master conserver server. If 52044 is a process id, there
> is no such process on that host.
>
> I realize I could probably restart the master processes, but that would
> probably result in a fair number of upset users, so just trying to figure out
> what this error is, how to fix it, and how to avoid having it happen again.
>
> Thanks.
>
> _______________________________________________
> users mailing list
> users@conserver.com
> https://www.conserver.com/mailman/listinfo/users

_______________________________________________
users mailing list
users@conserver.com
https://www.conserver.com/mailman/listinfo/users
Re: consolue -u stops, gets Connection timed out [ In reply to ]
On Fri, Jan 05, 2007 at 11:52:05AM -0800, Mark Wedel wrote:
> Doing a console -u will print out a bunch of consoles, but at some points,
> stops for a bit and then prints:
>
> console.i386: connect(): 52044@192.168.9.2: Connection timed out
>
> and then prints out the rest of the consoles.
>
> 192.168.9.2 is our master conserver server. If 52044 is a process id, there
> is no such process on that host.

52044 is the port number that the master processes expected a
sub-process to be listening on (one that actually handles console
connections). for whatever reason, that sub-process is either not
picking up the connection or the master process hasn't realized
something was wrong and taken it out of the list of sub-processes (many
possibilities here - bug dealing with SIGHUP, bug dealing with reaping
children, etc). if any of the conserver processes is still lingering in
a bad state (say, you find the one that has that socket open but it's
wedged or looping), killing it off should clear things up (the master
would reap it, clean up it's list, respawn another, etc). it would be
interesting to know if any consoles are missing from the -u output...it
could help narrow the possibilities of how it get into the broken state.

> I realize I could probably restart the master processes, but that would
> probably result in a fair number of upset users, so just trying to figure out
> what this error is, how to fix it, and how to avoid having it happen again.

that would certainly clean things up. if you can't find a process to
kill off, then the server probably got into a broken state because of a
bug and there's really no other choice than this.

out of curiosity, is this 8.1.14 or 8.1.15?

Bryan
_______________________________________________
users mailing list
users@conserver.com
https://www.conserver.com/mailman/listinfo/users
Re: consolue -u stops, gets Connection timed out [ In reply to ]
Bryan Stansell wrote:

> 52044 is the port number that the master processes expected a
> sub-process to be listening on (one that actually handles console
> connections). for whatever reason, that sub-process is either not
> picking up the connection or the master process hasn't realized
> something was wrong and taken it out of the list of sub-processes (many
> possibilities here - bug dealing with SIGHUP, bug dealing with reaping
> children, etc). if any of the conserver processes is still lingering in
> a bad state (say, you find the one that has that socket open but it's
> wedged or looping), killing it off should clear things up (the master
> would reap it, clean up it's list, respawn another, etc). it would be
> interesting to know if any consoles are missing from the -u output...it
> could help narrow the possibilities of how it get into the broken state.

Ok - found some more details. Found the process that is responsible for that
port:

8211: conserver -d
ff21fe5c write (21, 10c428, 200)
00030220 FileWrite (11d688, 3a400, 10c428, 400, 1, 0) + 2e0
0001ee74 FlushConsole (c7c80, ffbffbf8, ffbffb78, ffffffff, ffbffbf8, 0) + 728
0001fee0 Kiddie (b0e08, 4cde8, 4c354, 4c2d4, 3, 4cc00) + dec
00020660 Spawn (11d4d0, ffffffff, 11d4d0, cb37, 0, 4d90d) + 3e4
00022d94 main (4b400, ffbffdec, ffbffdf8, 4ce04, 0, 0) + db8
000152e4 _start (0, 0, 0, 0, 0, 0) + 5c
# ksh -o vi
# truss -f -p 8211
8211: write(33, 0x0010C428, 512) (sleeping...)
8211: Received signal #1, SIGHUP, in write() [caught]
8211: write(33, "1B [ 2 5 ; 7 5 H1B [ 2 5".., 512) Err#4 EINTR
8211: setcontext(0xFFBFF768)
8211: write(33, 0x0010C428, 512) (sleeping...)
...

Have the sighup most likely because I have an automatic script that generates
the conserver console database (pulling the information from another database).

FD 33:
conserver 8211 root 33u VCHR 23,159 0t27975504 641133
/devices/pseudo/clone@0:ptmx->ptm

I am running 8.1.14, on sparc solaris 9

I can see which consoles are being served by that process, and console -u
<host> on them also times out. I'm presuming they are all missing from the
console -u (no console specified) option.

It sounds like just killing 8211 should fix the problem (the master process
will see it died and restart at anew). I don't know if this is a problem you
want further debugging data from or not.


_______________________________________________
users mailing list
users@conserver.com
https://www.conserver.com/mailman/listinfo/users
Re: consolue -u stops, gets Connection timed out [ In reply to ]
On Fri, Jan 05, 2007 at 02:22:51PM -0800, Mark Wedel wrote:
> It sounds like just killing 8211 should fix the problem (the master
> process will see it died and restart at anew). I don't know if this is a
> problem you want further debugging data from or not.

yep, that should fix it. from the output it looks like the console on
fd 33 is defined to be a program of some sort (since it's talking to a
pseudo-tty). looks like that code doesn't set O_NONBLOCK on the fd,
where sockets, etc would. could be an oversite - i thought i had added
O_NONBLOCK to everything a while back. anyway, that's probably the
issue...as the FileWrite() code is supposed to hide/deal with that.

if my assumption about the console type is wrong, please let me know.
otherwise, i believe that's the issue. now, why the console is not
accepting the data from the write() call, that's intresting in itself
(and finding the right program connected to that pseudo-tty and killing
it is probably a more graceful way than killing that conserver
sub-process). conserver should gracefully handle that situation, that's
for sure...but there could be another issue lurking around that you
might want to investigate.

Bryan
_______________________________________________
users mailing list
users@conserver.com
https://www.conserver.com/mailman/listinfo/users
Re: consolue -u stops, gets Connection timed out [ In reply to ]
Bryan Stansell wrote:
> On Fri, Jan 05, 2007 at 02:22:51PM -0800, Mark Wedel wrote:
>> It sounds like just killing 8211 should fix the problem (the master
>> process will see it died and restart at anew). I don't know if this is a
>> problem you want further debugging data from or not.
>
> yep, that should fix it. from the output it looks like the console on
> fd 33 is defined to be a program of some sort (since it's talking to a
> pseudo-tty). looks like that code doesn't set O_NONBLOCK on the fd,
> where sockets, etc would. could be an oversite - i thought i had added
> O_NONBLOCK to everything a while back. anyway, that's probably the
> issue...as the FileWrite() code is supposed to hide/deal with that.

That makes sense. For a lot of our consoles, we use various scripts that log
into different types of service processors and then get the console. I could
certainly believe that some of those connections could go away, SP get hung,
etc. ptree shows:

# ptree 8211
7712 conserver -d
8211 conserver -d
8227 <defunct>
8221 /bin/sh -ce ssh -l admin stingtest-sp.sfbay
8222 ssh -l admin stingtest-sp.sfbay
4575 /bin/sh -ce ssh -l admin blower-sp
4577 ssh -l admin blower-sp

I'm tending to guess that it may be process 8227 that is causing problems, as
the two other ones look just fine, but not sure.

In any case, killing off 8211 fixed the problem - Thanks!

_______________________________________________
users mailing list
users@conserver.com
https://www.conserver.com/mailman/listinfo/users
Re: consolue -u stops, gets Connection timed out [ In reply to ]
On Jan 5 14:22, Mark Wedel wrote:
> # truss -f -p 8211
> 8211: write(33, 0x0010C428, 512) (sleeping...)
> 8211: Received signal #1, SIGHUP, in write() [caught]
> 8211: write(33, "1B [ 2 5 ; 7 5 H1B [ 2 5".., 512) Err#4 EINTR
> 8211: setcontext(0xFFBFF768)
> 8211: write(33, 0x0010C428, 512) (sleeping...)
> ...


I see something like this on my Linux server running conserver (8.1.14)
every now and again. I think I narrowed it down to when a conserver
client is connected over a VPN connection and the VPN goes away. It
usually shows up as a write back to the conserver client is stuck.
Sometimes I can ping the dead client's address and the conserver process
will free up and continue.

Nate
_______________________________________________
users mailing list
users@conserver.com
https://www.conserver.com/mailman/listinfo/users
Re: consolue -u stops, gets Connection timed out [ In reply to ]
On Fri, 2007-01-05 at 17:22 -0600, Nathan Straz wrote:
> I see something like this on my Linux server running conserver
> (8.1.14)
> every now and again. I think I narrowed it down to when a conserver
> client is connected over a VPN connection and the VPN goes away. It
> usually shows up as a write back to the conserver client is stuck.
> Sometimes I can ping the dead client's address and the conserver
> process
> will free up and continue.


>

One thing I've done and you can debate about the drawbacks is adjust my
keep alive times in the Linux kernel so that the "client" will drop
within a few minutes of a downed connection. When I say "client" I'm
referring to my Console.pm perl module that is used by applications to
connect to remote consoles. I don't use the official client that
often :). Since the client can wait on read() until the keep alive
passes much like telnet adjusting those values allows it to notice a bad
VPN connection and attempt to reopen. The VPNs _do_ go down and I need
the client to see this fast and do a new connect.



_______________________________________________
users mailing list
users@conserver.com
https://www.conserver.com/mailman/listinfo/users