Mailing List Archive: Improving Rancid's processing speed when having 1k+ devices

Improving Rancid's processing speed when having 1k+ devices

vv.corto at gmail

Jul 25, 2019, 5:29 AM

Post #1 of 9 (1347 views)

Well, as per title, is there any way to improve rancid's speed with so many
devices? At the moment I set PAR_COUNT to 300, so it will connect in
parallel to 300 devices at a time, but the reality is that most time does
not seem to be taken by connecting and retrieving config but by what
happens next in the file processing and git-comitting.

To give you some stats, with current settings it takes around 9 minutes to
do 1200 devices. I have only 1 group with all devices under the same group.

Any trick you might have, please let me know!

Thanks,

Vlad

Re: Improving Rancid's processing speed when having 1k+ devices [ In reply to ]

emille at abccommunications

Jul 25, 2019, 8:14 AM

Post #2 of 9 (1346 views)

I've seen/heard stories of people pre-empting rancid with an snmp-get of the config-last-changed / last committed OID, to generate a list of devices to run against.

Have always wanted to set that up for our instance as we are approaching the 500 device mark, but it's not become a big enough problem for us... yet.

From: Rancid-discuss [mailto:rancid-discuss-bounces@shrubbery.net] On Behalf Of Florin Vlad Olariu
Sent: Thursday, July 25, 2019 5:30 AM
To: rancid-discuss@shrubbery.net
Subject: [rancid] Improving Rancid's processing speed when having 1k+ devices

Well, as per title, is there any way to improve rancid's speed with so many devices? At the moment I set PAR_COUNT to 300, so it will connect in parallel to 300 devices at a time, but the reality is that most time does not seem to be taken by connecting and retrieving config but by what happens next in the file processing and git-comitting.

To give you some stats, with current settings it takes around 9 minutes to do 1200 devices. I have only 1 group with all devices under the same group.

Any trick you might have, please let me know!

Thanks,

Vlad

Re: Improving Rancid's processing speed when having 1k+ devices [ In reply to ]

heas at shrubbery

Jul 25, 2019, 9:38 AM

Post #3 of 9 (1346 views)

Thu, Jul 25, 2019 at 08:14:28AM -0700, Emille Blanc:
> I've seen/heard stories of people pre-empting rancid with an snmp-get of the config-last-changed / last committed OID, to generate a list of devices to run against.

a building block for that is in the FAQ S3 Q10; using syslog ....

_______________________________________________
Rancid-discuss mailing list
Rancid-discuss@shrubbery.net
http://www.shrubbery.net/mailman/listinfo/rancid-discuss

Re: Improving Rancid's processing speed when having 1k+ devices [ In reply to ]

heas at shrubbery

Jul 25, 2019, 9:55 AM

Post #4 of 9 (1346 views)

Thu, Jul 25, 2019 at 02:29:37PM +0200, Florin Vlad Olariu:
> Well, as per title, is there any way to improve rancid's speed with so many
> devices? At the moment I set PAR_COUNT to 300, so it will connect in
> parallel to 300 devices at a time, but the reality is that most time does
> not seem to be taken by connecting and retrieving config but by what
> happens next in the file processing and git-comitting.
>
> To give you some stats, with current settings it takes around 9 minutes to
> do 1200 devices. I have only 1 group with all devices under the same group.
>
> Any trick you might have, please let me know!

Typically, the network and, more so, the devices are the slow part. Some
devices are much slower than others. more parallelism helps a lot - your
high PAR_COUNT. other thoughts:

- cvs is slow. use svn or git. svn is probably faster; but I have not
benchmarked the two for the functions that rancid uses.
- make sure that the rancid user is not process rlimited to less than ~605
processes; or PAR_COUNT * 2 + 5 or so.
- perl is a meory pig. if the host/vm has memory pressure, this would be
something to address.
- retrieving device output does not require much cpu, but process does use
some - dont starve it
- use rancid.conf:NOPIPE=YES; i think this is faster because perl is a pig.
- if you only need configs, then reduce what is collected to just show version
and show running. or have one hourly group that collects that, and a daily
group that collects everything. less processing, and esp many fewer regexes.

multiple groups might help, at least for the SCM part. split your one large
group into a few. make sure to use a separate cron for each so that they run
in parallel.

I havent attempted to benchmark or optimize any parts for a while. There was
a complaint about the start-up time for control_rancid, which seems to me to
be inconsequential, but I do not know what the users were attempting to do
with rancid that made this matter. There are other benefits to this, so I've
started to re-write it; this is not ready yet.

9 minutes for 1200 devices seems reasonable to me. :)

_______________________________________________
Rancid-discuss mailing list
Rancid-discuss@shrubbery.net
http://www.shrubbery.net/mailman/listinfo/rancid-discuss

Re: Improving Rancid's processing speed when having 1k+ devices [ In reply to ]

scott.granados at gmail

Jul 25, 2019, 10:16 AM

Post #5 of 9 (1346 views)

I would also recommend running multiple rancid servers maybe scatter them geographically so it’s not a single machine pulling all the weight. Break the work loads up among them.

> On Jul 25, 2019, at 12:55 PM, john heasley <heas@shrubbery.net> wrote:
>
> Thu, Jul 25, 2019 at 02:29:37PM +0200, Florin Vlad Olariu:
>> Well, as per title, is there any way to improve rancid's speed with so many
>> devices? At the moment I set PAR_COUNT to 300, so it will connect in
>> parallel to 300 devices at a time, but the reality is that most time does
>> not seem to be taken by connecting and retrieving config but by what
>> happens next in the file processing and git-comitting.
>>
>> To give you some stats, with current settings it takes around 9 minutes to
>> do 1200 devices. I have only 1 group with all devices under the same group.
>>
>> Any trick you might have, please let me know!
>
> Typically, the network and, more so, the devices are the slow part. Some
> devices are much slower than others. more parallelism helps a lot - your
> high PAR_COUNT. other thoughts:
>
> - cvs is slow. use svn or git. svn is probably faster; but I have not
> benchmarked the two for the functions that rancid uses.
> - make sure that the rancid user is not process rlimited to less than ~605
> processes; or PAR_COUNT * 2 + 5 or so.
> - perl is a meory pig. if the host/vm has memory pressure, this would be
> something to address.
> - retrieving device output does not require much cpu, but process does use
> some - dont starve it
> - use rancid.conf:NOPIPE=YES; i think this is faster because perl is a pig.
> - if you only need configs, then reduce what is collected to just show version
> and show running. or have one hourly group that collects that, and a daily
> group that collects everything. less processing, and esp many fewer regexes.
>
> multiple groups might help, at least for the SCM part. split your one large
> group into a few. make sure to use a separate cron for each so that they run
> in parallel.
>
> I havent attempted to benchmark or optimize any parts for a while. There was
> a complaint about the start-up time for control_rancid, which seems to me to
> be inconsequential, but I do not know what the users were attempting to do
> with rancid that made this matter. There are other benefits to this, so I've
> started to re-write it; this is not ready yet.
>
> 9 minutes for 1200 devices seems reasonable to me. :)
>
> _______________________________________________
> Rancid-discuss mailing list
> Rancid-discuss@shrubbery.net
> http://www.shrubbery.net/mailman/listinfo/rancid-discuss

_______________________________________________
Rancid-discuss mailing list
Rancid-discuss@shrubbery.net
http://www.shrubbery.net/mailman/listinfo/rancid-discuss

Re: Improving Rancid's processing speed when having 1k+ devices [ In reply to ]

vv.corto at gmail

Jul 26, 2019, 2:34 AM

Post #6 of 9 (1345 views)

On 25 July 2019 at 18:16:48, Scott Granados
(scott.granados@gmail.com(mailto:scott.granados@gmail.com)) wrote:

> I would also recommend running multiple rancid servers maybe scatter them geographically so it’s not a single machine pulling all the weight. Break the work loads up among them.

Great advice which didn't cross my mind. Might have to resort to this
if I want ~ 1m poll times.

On 25 July 2019 at 17:55:31, john heasley (heas@shrubbery.net) wrote:

> - cvs is slow. use svn or git. svn is probably faster; but I have not
benchmarked the two for the functions that rancid uses.

I do use git already. Not sure git itself is to blame for the slowdown though.

> - make sure that the rancid user is not process rlimited to less than ~605
processes; or PAR_COUNT * 2 + 5 or so.

My `ulimit -u` gives "4096". I don't this this is a factor?

> - perl is a meory pig. if the host/vm has memory pressure, this would be
something to address.
> - retrieving device output does not require much cpu, but process does use
some - dont starve it

I have a Xeon 8-core box, and when running it with PAR_COUNT=400 it
runs to 50+ load, but only for a short period (the time it takes to
connect to devices) after ~ 2 minutes it goes back to normal, so I
don't think CPU is really the problem. Furthermore, I have 32G of ram,
and running `watch free -h` it does not look like rancid uses *that*
much memory, maybe ~ 5 G.

> - use rancid.conf:NOPIPE=YES; i think this is faster because perl is a pig.

Tried this, but no difference in time :(

> - if you only need configs, then reduce what is collected to just show version
and show running. or have one hourly group that collects that, and a daily
group that collects everything. less processing, and esp many fewer regexes.

I only need configs and the way rancid is configured already only
pools "show run" (or equivalent).

Seems only real solution might be to break down the amount of hosts
between different machines.

Thanks John.

_______________________________________________
Rancid-discuss mailing list
Rancid-discuss@shrubbery.net
http://www.shrubbery.net/mailman/listinfo/rancid-discuss

Re: Improving Rancid's processing speed when having 1k+ devices [ In reply to ]

Jul 26, 2019, 4:29 AM

Post #7 of 9 (1345 views)

> 9 minutes for 1200 devices seems reasonable to me. :)

Heh - I've got around 3,000. I'm having an issue with PAR that I haven't fully addressed, so I'm still only doing 5 at a time and getting 4- to 5-hour run times. We made a choice at one point to put all "do-diff" groups on one line in cron, that didn’t help at all but haven't yet backed that down. If we were to break that up appropriately, we'd have around 1200 in the largest group, several hundred in a few, and a number of group (about 15 altogether) with <10. We could break things up further, but at some point you have to ust accept large router.db files because there's managerial overhead trying to manage a large number of rancid groups and keeping it synchronized against CDP and LLDP discoveries and CMDB database in a dynamic environment.

Our old server we stood up in 2002 using rancid 1.2 was set to PAR=100 and getting about 45min for the entire suite. We never actually hit 100 simultaneous connections, we maxed out at around 60-70 because by the time the 71st connection was opened the 1st was completing. Of course, that was for a server stood-up in 2002, so take that for whatever it's worth.

Is 9 min too long?

weylin

?On 7/25/19, 12:55 PM, "john heasley" <heas@shrubbery.net> wrote:

Thu, Jul 25, 2019 at 02:29:37PM +0200, Florin Vlad Olariu:
> Well, as per title, is there any way to improve rancid's speed with so many
> devices? At the moment I set PAR_COUNT to 300, so it will connect in
> parallel to 300 devices at a time, but the reality is that most time does
> not seem to be taken by connecting and retrieving config but by what
> happens next in the file processing and git-comitting.
>
> To give you some stats, with current settings it takes around 9 minutes to
> do 1200 devices. I have only 1 group with all devices under the same group.
>
> Any trick you might have, please let me know!

Typically, the network and, more so, the devices are the slow part. Some
devices are much slower than others. more parallelism helps a lot - your
high PAR_COUNT. other thoughts:

- cvs is slow. use svn or git. svn is probably faster; but I have not
benchmarked the two for the functions that rancid uses.
- make sure that the rancid user is not process rlimited to less than ~605
processes; or PAR_COUNT * 2 + 5 or so.
- perl is a meory pig. if the host/vm has memory pressure, this would be
something to address.
- retrieving device output does not require much cpu, but process does use
some - dont starve it
- use rancid.conf:NOPIPE=YES; i think this is faster because perl is a pig.
- if you only need configs, then reduce what is collected to just show version
and show running. or have one hourly group that collects that, and a daily
group that collects everything. less processing, and esp many fewer regexes.

multiple groups might help, at least for the SCM part. split your one large
group into a few. make sure to use a separate cron for each so that they run
in parallel.

I havent attempted to benchmark or optimize any parts for a while. There was
a complaint about the start-up time for control_rancid, which seems to me to
be inconsequential, but I do not know what the users were attempting to do
with rancid that made this matter. There are other benefits to this, so I've
started to re-write it; this is not ready yet.

9 minutes for 1200 devices seems reasonable to me. :)

_______________________________________________
Rancid-discuss mailing list
Rancid-discuss@shrubbery.net
http://www.shrubbery.net/mailman/listinfo/rancid-discuss

Re: Improving Rancid's processing speed when having 1k+ devices [ In reply to ]

heas at shrubbery

Jul 29, 2019, 11:06 AM

Post #8 of 9 (1335 views)

Fri, Jul 26, 2019 at 02:34:49AM -0700, Florin Vlad Olariu:
> On 25 July 2019 at 18:16:48, Scott Granados
> (scott.granados@gmail.com(mailto:scott.granados@gmail.com)) wrote:
>
> > I would also recommend running multiple rancid servers maybe scatter them geographically so it’s not a single machine pulling all the weight. Break the work loads up among them.
>
> Great advice which didn't cross my mind. Might have to resort to this
> if I want ~ 1m poll times.

topologically close servers can help, but I would just run more processes
instead. less mgmt overhead.

> > - make sure that the rancid user is not process rlimited to less than ~605
> processes; or PAR_COUNT * 2 + 5 or so.
>
> My `ulimit -u` gives "4096". I don't this this is a factor?

unlikely. make sure its not others; -n -d. you'd see processes being
killed in the logs

...

Are your configs very large? I have one group of 252 devices that are
scattered around the global totaling 1.2G of on-disk rancid output which
takes about 28m to collect with 16 processes.

_______________________________________________
Rancid-discuss mailing list
Rancid-discuss@shrubbery.net
http://www.shrubbery.net/mailman/listinfo/rancid-discuss

Re: Improving Rancid's processing speed when having 1k+ devices [ In reply to ]

Jul 29, 2019, 2:01 PM

Post #9 of 9 (1335 views)

> topologically close servers can help, but I would just run more processes instead.

Agree in 99% of cases. Though, there are rare niche scenarios where having geographically co-located servers can help. Slow WAN connections ("dial-up"); high latency or high packet loss connections (satellite); unreliable WAN links (ship at sea); and so forth.

weylin

?On 7/29/19, 2:06 PM, "john heasley" <heas@shrubbery.net> wrote:

Fri, Jul 26, 2019 at 02:34:49AM -0700, Florin Vlad Olariu:
> On 25 July 2019 at 18:16:48, Scott Granados
> (scott.granados@gmail.com(mailto:scott.granados@gmail.com)) wrote:
>
> > I would also recommend running multiple rancid servers maybe scatter them geographically so it’s not a single machine pulling all the weight. Break the work loads up among them.
>
> Great advice which didn't cross my mind. Might have to resort to this
> if I want ~ 1m poll times.

topologically close servers can help, but I would just run more processes
instead. less mgmt overhead.

> > - make sure that the rancid user is not process rlimited to less than ~605
> processes; or PAR_COUNT * 2 + 5 or so.
>
> My `ulimit -u` gives "4096". I don't this this is a factor?

unlikely. make sure its not others; -n -d. you'd see processes being
killed in the logs

...

Are your configs very large? I have one group of 252 devices that are
scattered around the global totaling 1.2G of on-disk rancid output which
takes about 28m to collect with 16 processes.

_______________________________________________
Rancid-discuss mailing list
Rancid-discuss@shrubbery.net
http://www.shrubbery.net/mailman/listinfo/rancid-discuss