Mailing List Archive: question about conserver scaling out ability

question about conserver scaling out ability

Nov 17, 2009, 11:29 PM

Post #1 of 6 (3803 views)

HI,

We are using conserver to handle the consoles in our cluster, everything
worked perfect until several months ago when our cluster is growing larger
and larger. For now, our cluster has 2,000 nodes and will be growing to
16,000 nodes in the near future, we are seeing problems with the 2,000
nodes.

1. the conserver will start responding slow after the conserver have been
running for a while(maybe several days, I am not so sure), when the
conserver responds slow, it probably takes more than 10 seconds to open the
node console, or occasionally can not open the consoles for the nodes at
all. we have to restart the conserver to fix the problem.

2. The conserver restart will take a very long time, about 5 minutes, to
finish the initialization with 2,000 nodes, during the conserver
initialization, the rcons will get "Connection refused" error.

We did try some scaling tuning for conserver, but does not seem quite
helpful. Could you give me some further instructions on the conserver
scaling tuning? thank you.

1. Hierarchy: we setup several conserver hosts in the cluster, use the
"master" keyword on the central management node to specify which conserver
should the console goes to.

#xCAT BEGIN aixcn1 CONS
console aixcn1 {
type exec;
master aixsn1;
}
#xCAT END aixcn1 CONS

2. Change the number of consoles each daemon can handle, we changed the
number to 64 by specifying -m 64 with the conserver daemon

Thanks,
-------------------------------------------------------------------------
Li,Guang Cheng (Àî¹â³É)
IBM China Software Development Laboratory

Re: question about conserver scaling out ability [ In reply to ]

wernli at in2p3

Nov 17, 2009, 11:58 PM

Post #2 of 6 (3708 views)

Permalink

On Wed, Nov 18, 2009 at 03:29:44PM +0800, Guang Cheng Li wrote:
> We did try some scaling tuning for conserver, but does not seem quite
> helpful. Could you give me some further instructions on the conserver
> scaling tuning? thank you.

Whatever optimization you'll end up using, you will have to use more than
one conserver.

The neat thing about conserver.cf is that you can have exactly the same
across different servers, and when one fails, if your consoles are SOL
controlled, you can just s/conserver1/conserver2/ for these and reload
the service.

We for one handle the configuration changes' sync using cvs.

_______________________________________________
users mailing list
users@conserver.com
https://www.conserver.com/mailman/listinfo/users

Re: question about conserver scaling out ability [ In reply to ]

Andras.Horvath at cern

Nov 18, 2009, 12:23 AM

Post #3 of 6 (3706 views)

Permalink

On Wed, Nov 18, 2009 at 08:58:47AM +0100, Fabien Wernli wrote:
>
> Whatever optimization you'll end up using, you will have to use more than
> one conserver.

FWIW, we're using one conserver node for every 2-300 machines right now,
and an external method to identify each client's "headnode". We'll
probably not scale up into the thousands of clients per machine simply
because downtime on a "headnode" would then mean thousands of
inaccessible consoles.

Andras

--
Andras HORVATH
Systems engineer, CERN IT FIO
Tel: +41 22 767 4290
Fax: +41 22 766 9154

_______________________________________________
users mailing list
users@conserver.com
https://www.conserver.com/mailman/listinfo/users

Re: question about conserver scaling out ability [ In reply to ]

john at iastate

Nov 18, 2009, 5:30 AM

Post #4 of 6 (3709 views)

Permalink

2009/11/18 Guang Cheng Li <liguangc@cn.ibm.com>

> We are using conserver to handle the consoles in our cluster, everything
> worked perfect until several months ago when our cluster is growing larger
> and larger. For now, our cluster has 2,000 nodes and will be growing to
> 16,000 nodes in the near future, we are seeing problems with the 2,000
> nodes.
>
> 1. the conserver will start responding slow after the conserver have been
> running for a while(maybe several days, I am not so sure), when the
> conserver responds slow, it probably takes more than 10 seconds to open the
> node console, or occasionally can not open the consoles for the nodes at
> all. we have to restart the conserver to fix the problem.
>

I would suspect that perhaps you have started swapping, either due to a
memory leak or just memory consumption.

> 2. The conserver restart will take a very long time, about 5 minutes, to
> finish the initialization with 2,000 nodes, during the conserver
> initialization, the rcons will get "Connection refused" error.
>

What method are you using to connect to the nodes? Our conserver (only
~500) nodes connects via Cyclades ACS-48 boxes, and we quickly found out
that 'raw socket' connections scaled vastly better than 'ssh' ones.

John

Re: question about conserver scaling out ability [ In reply to ]

liguangc at cn

Nov 19, 2009, 3:10 AM

Post #5 of 6 (3708 views)

Permalink

HI,

We are not using the terminal servers to connect to the nodes, we are using
the IBM hardware management console to open the consoles. The IBM hardware
management console has its own utility to open the consoles to all the
nodes that are managed by it.

Thanks,
-------------------------------------------------------------------------
Li,Guang Cheng (æŽå…‰æˆ)
IBM China Software Development Laboratory

John Hascall
<john@iastate.edu
> To
Sent by: users@conserver.com
users-bounces@con cc
server.com
Subject
Re: question about conserver
2009-11-18 21:30 scaling out ability

2009/11/18 Guang Cheng Li <liguangc@cn.ibm.com>
We are using conserver to handle the consoles in our cluster, everything
worked perfect until several months ago when our cluster is growing
larger and larger. For now, our cluster has 2,000 nodes and will be
growing to 16,000 nodes in the near future, we are seeing problems with
the 2,000 nodes.

1. the conserver will start responding slow after the conserver have been
running for a while(maybe several days, I am not so sure), when the
conserver responds slow, it probably takes more than 10 seconds to open
the node console, or occasionally can not open the consoles for the nodes
at all. we have to restart the conserver to fix the problem.

I would suspect that perhaps you have started swapping, either due to a
memory leak or just memory consumption.
2. The conserver restart will take a very long time, about 5 minutes, to
finish the initialization with 2,000 nodes, during the conserver
initialization, the rcons will get "Connection refused" error.

What method are you using to connect to the nodes?Â Our conserver (only
~500) nodes connects via Cyclades ACS-48 boxes, and we quickly found out
that 'raw socket' connections scaled vastly better than 'ssh' ones.
John_______________________________________________
users mailing list
users@conserver.com
https://www.conserver.com/mailman/listinfo/users

Re: question about conserver scaling out ability [ In reply to ]

cpz at tuunq

Nov 19, 2009, 8:54 AM

Post #6 of 6 (3698 views)

Permalink

Guang Cheng Li wrote:
> We are not using the terminal servers to connect to the nodes, we are using
> the IBM hardware management console to open the consoles. The IBM hardware
> management console has its own utility to open the consoles to all the
> nodes that are managed by it.

Have you looked to the overhead of the IBM console app? IME these sort of
apps try to be all things to all people, where in your case, you simply need
it to be a conduit from the conserver daemon to the target host. If the IBM
app has a "be really stupid" mode, you might try that.

z!
_______________________________________________
users mailing list
users@conserver.com
https://www.conserver.com/mailman/listinfo/users

Mailing List Archive

Mailing List Archive

Attached Files: