Mailing List Archive

Reconnect causes slowdown in 7.2.1
Well, I'd been trying for quite a while to find the cause of random
slowdowns on some (but not all) of my console connections.

The slowness is in the response time of console connections. This
includes things like the time it takes to make the initial connection
(time from typing "console dev4-001" to getting a "[Enter `^Ec?' for
help]") or, once you're connected, from hitting a key and getting a
response from the machine you're connected to.

The slowness affects all 16 consoles managed by a given Conserver process.

What I found was that if a terminal server was rejecting reverse-telnet
connections on even one port managed by a Conserver process, it'd slow 'em
all down.

Now the reinit wasn't happening immediately. It'd be attempted every
few seconds, so the load should be negligible. And it didn't seem to bog
the rest of the host down, just that one Conserver process.

So, is this problem addressed by the reinit changes in 7.2.3?

Or does the sleep in the reinit loop hold up the whole Conserver process?


Thanks in advance,
Aaron Burt
Open Source Development Lab
Re: Reconnect causes slowdown in 7.2.1 [ In reply to ]
On Mon, Jan 06, 2003 at 04:22:37PM -0800, Aaron Burt wrote:
> The slowness is in the response time of console connections. This
> includes things like the time it takes to make the initial connection
> (time from typing "console dev4-001" to getting a "[Enter `^Ec?' for
> help]") or, once you're connected, from hitting a key and getting a
> response from the machine you're connected to.

are we talking about a second or more like 10 second delays? see below
for more.

> The slowness affects all 16 consoles managed by a given Conserver process.

that makes sense...when a process is "busy" waiting, it'll hang the
group of consoles it's managing.

> So, is this problem addressed by the reinit changes in 7.2.3?

i'd have to say no. i believe most things are the same in regards to
this issue, but, of course, there are other things the new stuff has
that could be useful. but, i'm digressing.

> Or does the sleep in the reinit loop hold up the whole Conserver process?

there are a couple of sleep() calls that would hold up a conserver
process. each are less than a second, however, and i'd really be
surprised if they were "stacking up" and giving you long pauses
(although a second can seem like a long time too). if you're seeing
very long pauses, it's more likely the call to connect() that's
hanging. was your terminal server actively rejecting the
reverse-telnet connections, or is it just half-opening the socket? if
it's not getting an active rejection, you'll see a 10 second delay
before the conserver process gives up and decides to move on - any
other consoles managed by that process will hang since it's not
multi-threaded or anything (this would be an ideal place to optimize
conserver's functionality with threads). if you think it's a connect()
issue, you can use the --with-timeout flag to reduce the delay to a
smaller value (like 1) as a work-around - just make sure your normal
terminal server response time is below that value.

if none of this seems to help, you could try tracing the process and
seeing where, exactly, these long delays occur. or even running in
debug mode could help track it down. if we can narrow down what part
of the code is actually the culprit, perhaps it can be changed to be
less devastating.

Bryan
Re: Reconnect causes slowdown in 7.2.1 [ In reply to ]
On Mon, 6 Jan 2003, Bryan Stansell wrote:
> On Mon, Jan 06, 2003 at 04:22:37PM -0800, Aaron Burt wrote:
> > The slowness is in the response time of console connections. This
> > includes things like the time it takes to make the initial connection
> > (time from typing "console dev4-001" to getting a "[Enter `^Ec?' for
> > help]") or, once you're connected, from hitting a key and getting a
> > response from the machine you're connected to.
>
> are we talking about a second or more like 10 second delays? see below
> for more.

From 1+ up to about 18 seconds to connect, depending on how many consoles
are in reinit. Keystroke-to-host response times are typically about half
the connect time. I timed by counting seconds, so the accuracy leaves
something to be desired.

Strangely, Conserver commands and responses took around 2 seconds
consistently wheter 1 or 7 consoles were in reinit. This was true for
^Ec commands and for "console down" messages when sending keystrokes to
downed consoles.

> > The slowness affects all 16 consoles managed by a given Conserver process.
>
> that makes sense...when a process is "busy" waiting, it'll hang the
> group of consoles it's managing.

That's a shame.

> > So, is this problem addressed by the reinit changes in 7.2.3?
>
> i'd have to say no. i believe most things are the same in regards to
> this issue, but, of course, there are other things the new stuff has
> that could be useful. but, i'm digressing.

Indeed. The ability to turn off auto-reinit, for one. I'll have to see
if I can find a way to dump a list of consoles in reinit and to force a
console down/up. With that, I should be able to automagically find and
fix blocked ports, which is a common problem after network outages and
suchlike.

> > Or does the sleep in the reinit loop hold up the whole Conserver process?
>
> there are a couple of sleep() calls that would hold up a conserver
> process. each are less than a second, however, and i'd really be
> surprised if they were "stacking up" and giving you long pauses

That's what they appear to be doing. I found a group that had around 7
ports in reinit, and the delay decreased in a linear fashion as I brought
ports out of reinit.

The reinit retries also happened faster as fewer ports were in reinit.

> (although a second can seem like a long time too). if you're seeing
> very long pauses, it's more likely the call to connect() that's
> hanging. was your terminal server actively rejecting the
> reverse-telnet connections, or is it just half-opening the socket?

It was sending "port in use" or some such and then dropping the
connection.
Re: Reconnect causes slowdown in 7.2.1 [ In reply to ]
On Tue, Jan 07, 2003 at 03:49:33PM -0800, Aaron Burt wrote:
> >From 1+ up to about 18 seconds to connect, depending on how many consoles
> are in reinit. Keystroke-to-host response times are typically about half
> the connect time. I timed by counting seconds, so the accuracy leaves
> something to be desired.
>
> Strangely, Conserver commands and responses took around 2 seconds
> consistently wheter 1 or 7 consoles were in reinit. This was true for
> ^Ec commands and for "console down" messages when sending keystrokes to
> downed consoles.

ok...this is very bizarre (well, unexpected in my mind), but there has
to be a good reason for it. what, i don't know, but maybe we can track
it down.

> Indeed. The ability to turn off auto-reinit, for one. I'll have to see
> if I can find a way to dump a list of consoles in reinit and to force a
> console down/up. With that, I should be able to automagically find and
> fix blocked ports, which is a common problem after network outages and
> suchlike.

the 'console -i' output shows lots of data and is there for just this
purpose. hopefully it has enough for what you need.

> > (although a second can seem like a long time too). if you're seeing
> > very long pauses, it's more likely the call to connect() that's
> > hanging. was your terminal server actively rejecting the
> > reverse-telnet connections, or is it just half-opening the socket?
>
> It was sending "port in use" or some such and then dropping the
> connection.

that helps. then, yeah, you're hitting just about every sleep call and
probably in rapid succession. it still would be interesting to see
truss/strace/whatever output of a child process that was busily trying
to bring up a port. and you would benefit from the new code in that
you could turn off the immediate auto-bringup of the console and let it
kick in only every minute or longer.

so, i *think*, now that i'm at the end of your message, that i
understand what's going on. at least, i have a good guess (your note
of it getting a 'port in use' and being dropped was the key). if
you're logging all those ports, you should have a heck of a set of
large logfiles, huh?

aside from the "work-around" possibilities (like the new-for-you
auto-retry options), i don't have much else that can help. but it is
making me think of ways to redo the code so i can get rid of the sleep
statements and, hopefully, reduce or remove the noticable delays.
dunno if or when it'll become code, so try the work-arounds for now.

Bryan