Mailing List Archive

LACP drops simultaneously across multiple switches/products/versions
We're having a strange issue where LACP will bounce on multiple switches
simultaneously, typically several times in a row.

We previously would see this on our S50n stack when it was our core switch,
but it hadn't happened in over a year. Now that we have a C300 in addition to
the S50n stack we've seen it 5 times in 4 days.

What we've seen so far is two LACP groups from the C300 to our only two S55s
will bounce, then all the LACP groups on the C300 will bounce as well as all
the LACP groups on the S50n stack.

We don't get any CPU watchdog notices, and traces don't show that the LACP
process has restarted.

Has anyone experienced these types of problems? I have an open TAC case
currently but want to get others experiences here.

-Doug
Re: LACP drops simultaneously across multiple switches/products/versions [ In reply to ]
You don't go into detail as to the log messages you see during the
failure, so it's certainly hard to diagnose with anything but
anecdote. However, here's my anecdote....

I have encountered similar sporadic LACP issues across numerous
switches on an extremely large scale. The best Force10 could suggest
was to try using 30 second LACP heartbeat timers, presumably so their
control plane had sufficient time to reply to heartbeat messages. To
be honest, this particular scenario was not acceptable so I didn't
even bother to validate if this actually "fixed" anything.

This is pretty much why we dropped all our layer 2 link aggregation
and moved to L3 ECMP load balancing across links.

In my opinion, a lot of these problems are fundamental design issues
with regards to control plane management.

On Thu, Mar 1, 2012 at 6:34 AM, Doug Warner <doug@warner.fm> wrote:
> We're having a strange issue where LACP will bounce on multiple switches
> simultaneously, typically several times in a row.
>
> We previously would see this on our S50n stack when it was our core switch,
> but it hadn't happened in over a year.  Now that we have a C300 in addition to
> the S50n stack we've seen it 5 times in 4 days.
>
> What we've seen so far is two LACP groups from the C300 to our only two S55s
> will bounce, then all the LACP groups on the C300 will bounce as well as all
> the LACP groups on the S50n stack.
>
> We don't get any CPU watchdog notices, and traces don't show that the LACP
> process has restarted.
>
> Has anyone experienced these types of problems?  I have an open TAC case
> currently but want to get others experiences here.
>
> -Doug
>
>
> _______________________________________________
> force10-nsp mailing list
> force10-nsp@puck.nether.net
> https://puck.nether.net/mailman/listinfo/force10-nsp
>

_______________________________________________
force10-nsp mailing list
force10-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/force10-nsp
Re: LACP drops simultaneously across multiple switches/products/versions [ In reply to ]
So far we've received the same suggestion from F10 to increase the LACP timers
and I agree that it basically means losing the feature we're trying to use.

Unfortunately I don't really have a whole lot of additional logging; I see the
LACP groups ungroup, RSTP changes, then LACP regroups, more RSTP changes, etc.
I *finally* got a CPU interrupt watchdog notice on my S50n stack, but I've
seen this over half a dozen times now with no other error messages.

I appreciate the anecdotal support that others are seeing the same thing.

-Doug

On 03/05/2012 03:37 PM, Matt Hite wrote:
> You don't go into detail as to the log messages you see during the
> failure, so it's certainly hard to diagnose with anything but
> anecdote. However, here's my anecdote....
>
> I have encountered similar sporadic LACP issues across numerous
> switches on an extremely large scale. The best Force10 could suggest
> was to try using 30 second LACP heartbeat timers, presumably so their
> control plane had sufficient time to reply to heartbeat messages. To
> be honest, this particular scenario was not acceptable so I didn't
> even bother to validate if this actually "fixed" anything.
>
> This is pretty much why we dropped all our layer 2 link aggregation
> and moved to L3 ECMP load balancing across links.
>
> In my opinion, a lot of these problems are fundamental design issues
> with regards to control plane management.
>
> On Thu, Mar 1, 2012 at 6:34 AM, Doug Warner <doug@warner.fm> wrote:
>> We're having a strange issue where LACP will bounce on multiple switches
>> simultaneously, typically several times in a row.
>>
>> We previously would see this on our S50n stack when it was our core switch,
>> but it hadn't happened in over a year. Now that we have a C300 in addition to
>> the S50n stack we've seen it 5 times in 4 days.
>>
>> What we've seen so far is two LACP groups from the C300 to our only two S55s
>> will bounce, then all the LACP groups on the C300 will bounce as well as all
>> the LACP groups on the S50n stack.
>>
>> We don't get any CPU watchdog notices, and traces don't show that the LACP
>> process has restarted.
>>
>> Has anyone experienced these types of problems? I have an open TAC case
>> currently but want to get others experiences here.
>>
>> -Doug