Mailing List Archive: High LP CPU After Upgrade 4001a to 54c Multicast

High LP CPU After Upgrade 4001a to 54c Multicast

Jun 3, 2013, 7:32 AM

Post #1 of 4 (2231 views)

We are seeing high CPU on our LPs after upgrading from 4001a to 54c on two
MLXs.

We are using PIM-SM and the mcast process is using a large amount of LP
CPU, but only after the upgrade. We are stable on the same config prior to
the upgrade. Also, the MLX that is the RP for networks with a large number
of multicast streams is the one that has a high CPU. The other core doesn't
have an issue (aside from being unstable because of the other MLX with high
CPU). We are pretty sure it has something to do with multicast routing we
just can't figure out why.

We do have a large number of group/OIF entries spanning multiple physical
ints and ves, but this shouldn't be an issue because of the OIF
optimization feature on the platform...right? On 4001a and 54c we have a
shareabilitiy coefficient / optimization of 98%...So it doesn't seem like a
resource problem...But we can't figure out why the traffic is hitting CPU.

Has anyone seen mcast problems after upgrading or have any troubleshooting
tips?

Re: High LP CPU After Upgrade 4001a to 54c Multicast [ In reply to ]

esk-puck.nether.net at esk

Jun 3, 2013, 8:42 AM

Post #2 of 4 (2160 views)

Permalink

We have seen issues when our MLXes receive multicast traffic for which
there have been no IGMP join messages sent (on edge ports). I'm
assuming that not getting any PIM joins would have the same effect.
There are some applications that do not send IGMP messages if they
expect their traffic to remain on the same L2 domain. Apparently if the
MLX doesn't have an entry for it, it punts it to the LP CPU.

To get an idea of which traffic is hitting the CPU, you can connect to
the LP (rconsole <slot_number>, then enable) and run 'debug packet
capture'. That will show you a few packets as they hit the LP CPU, and
should at least tell you the source IP, interface, and multicast group
for the offending traffic.

HTH,

--
Eldon Koyle
--
BOFH excuse #319:
Your computer hasn't been returning all the bits it gets from the Internet.

On Jun 03 10:32-0400, Walter Meyer wrote:
> We are seeing high CPU on our LPs after upgrading from 4001a to 54c on two
> MLXs.
>
> We are using PIM-SM and the mcast process is using a large amount of LP
> CPU, but only after the upgrade. We are stable on the same config prior to
> the upgrade. Also, the MLX that is the RP for networks with a large number
> of multicast streams is the one that has a high CPU. The other core doesn't
> have an issue (aside from being unstable because of the other MLX with high
> CPU). We are pretty sure it has something to do with multicast routing we
> just can't figure out why.
>
> We do have a large number of group/OIF entries spanning multiple physical
> ints and ves, but this shouldn't be an issue because of the OIF
> optimization feature on the platform...right? On 4001a and 54c we have a
> shareabilitiy coefficient / optimization of 98%...So it doesn't seem like a
> resource problem...But we can't figure out why the traffic is hitting CPU.
>
> Has anyone seen mcast problems after upgrading or have any troubleshooting
> tips?

> _______________________________________________
> foundry-nsp mailing list
> foundry-nsp@puck.nether.net
> http://puck.nether.net/mailman/listinfo/foundry-nsp

_______________________________________________
foundry-nsp mailing list
foundry-nsp@puck.nether.net
http://puck.nether.net/mailman/listinfo/foundry-nsp

Re: High LP CPU After Upgrade 4001a to 54c Multicast [ In reply to ]

Joseph.Kennedy at purchase

Nov 4, 2013, 10:21 PM

Post #3 of 4 (1983 views)

Permalink

The problem for us was so severe that both MLX MP’s were running at 99% CPU and the LP’s were flooding unicast.

After a lot of work testing in a lab environment looking for an issue in multicast routing that fit the symptoms(lol...no it wasn’t easy), I confirmed that the source of the
problem was in 5.2 and above (5.2.00 to 5.4.00d) and processing of IGMP reports. Brocade's code updated
mcache entries for every IGMP report even when a matching mcache OIF entry already existed.

All updates in a given IGMP query window in the problem code could be represented as O(M(N^2))
where M is the number of OIF's and N is the number of group members in a single
group. For example, in an environment with 100 OIF's and 300 group members this equates to 9,000,000
updates per IGMP query window. This is in relation to previous code releases where the
updates could be represented by O(MN) or given the same environment values as above, 30,000 updates per
query window.

Many may not have noticed the issue because they don’t have a large number of OIF’s or large number of group members in a single group.
Some may have run into this previously and just filtered the UPnP/SSDP IPv4 group (239.255.255.250) to resolve it. If you are running PIM-SM,
have upgraded to 5.2.00 or above and afterwards noted periods of abnormally high MP/LP CPU, or you attempted the upgrade
but had to revert due to high MP CPU usage and unicast flooding (as we were seeing) then this may be
the root of your issue.

After reporting the problem to Brocade they provided a fix build and incorporated the fix into 5.4.00e. This problem "should be" resolved in 5.4.00e.
The problem is not specific to running PIM-SM with VRF’s.

Related closed defect information from 5.4.00e:

Defect ID: DEFECT000468056
Technical Severity: Medium
Summary: High MP CPU utilization from IGMP reports after upgrade
Symptom: After upgrading from 4.x to 5.4,high CPU utilization from IGMP reports in VRF
Feature: IPv4-MC PIM-SM Routing
Function: PERFORMANCE
Reported In Release: NI 05.4.00

--JK

We have seen issues when our MLXes receive multicast traffic for which
there have been no IGMP join messages sent (on edge ports). I'm
assuming that not getting any PIM joins would have the same effect.
There are some applications that do not send IGMP messages if they
expect their traffic to remain on the same L2 domain. Apparently if the
MLX doesn't have an entry for it, it punts it to the LP CPU.

To get an idea of which traffic is hitting the CPU, you can connect to
the LP (rconsole <slot_number>, then enable) and run 'debug packet
capture'. That will show you a few packets as they hit the LP CPU, and
should at least tell you the source IP, interface, and multicast group
for the offending traffic.

HTH,

--
Eldon Koyle
--
BOFH excuse #319:
Your computer hasn't been returning all the bits it gets from the Internet.

On Jun 03 10:32-0400, Walter Meyer wrote:
> We are seeing high CPU on our LPs after upgrading from 4001a to 54c on two
> MLXs.
>
> We are using PIM-SM and the mcast process is using a large amount of LP
> CPU, but only after the upgrade. We are stable on the same config prior to
> the upgrade. Also, the MLX that is the RP for networks with a large number
> of multicast streams is the one that has a high CPU. The other core doesn't
> have an issue (aside from being unstable because of the other MLX with high
> CPU). We are pretty sure it has something to do with multicast routing we
> just can't figure out why.
>
> We do have a large number of group/OIF entries spanning multiple physical
> ints and ves, but this shouldn't be an issue because of the OIF
> optimization feature on the platform...right? On 4001a and 54c we have a
> shareabilitiy coefficient / optimization of 98%...So it doesn't seem like a
> resource problem...But we can't figure out why the traffic is hitting CPU.
>
> Has anyone seen mcast problems after upgrading or have any troubleshooting
> tips?

Re: High LP CPU After Upgrade 4001a to 54c Multicast [ In reply to ]

jethro.binks at strath

Nov 5, 2013, 12:57 AM

Post #4 of 4 (1976 views)

Permalink

I think we've probably experienced the issue you described. We have an
MLX core, with other platforms at the distribution layer, and would
experience peaks of very high CPU for a second or two at a time, which
would disrupt OSPF and MRP at least. It appears we were running 5.4.0d at
the time.

However, since upgrading 5.5.0c I've been working with Brocade on another
issue, wherein the standby management card would reset every few minutes
or so.

Filtering unwanted multicast groups both at the distribution layer, and
then later directly at the core interfaces, helped a bit. However the
most effective fix was to remove "ip multicast-nonstop-routing"; as it was
described to me: "The problem is seen when in a specific pattern the
outgoing ports for the groups (239.255.255.250) added and removed ...
Engineering team performed troubleshooting and determined that for some
reason, the OIF tree that is rooted at certain forwarding entries is being
corrupted, either in the middle of traversal, or when there is database
update."

At the moment Brocade are trying to replicate it in their lab environment
to work on a fix. If they sort that, and merge in your defect fix, maybe
we'll finally see the back of the CPU/multicast issues we've been plagued
with.

Jethro.

On Tue, 5 Nov 2013, Kennedy, Joseph wrote:

> The problem for us was so severe that both MLX MP¢s were running at 99%
> CPU and the LP¢s were flooding unicast.
>
> After a lot of work testing in a lab environment looking for an issue in
> multicast routing that fit the symptoms(lol...no it wasn¢t easy), I
> confirmed that the source of the problem was in 5.2 and above (5.2.00 to
> 5.4.00d) and processing of IGMP reports. Brocade's code updated mcache
> entries for every IGMP report even when a matching mcache OIF entry
> already existed.
>
> All updates in a given IGMP query window in the problem code could be
> represented as O(M(N^2)) where M is the number of OIF's and N is the
> number of group members in a single group. For example, in an
> environment with 100 OIF's and 300 group members this equates to
> 9,000,000 updates per IGMP query window. This is in relation to previous
> code releases where the updates could be represented by O(MN) or given
> the same environment values as above, 30,000 updates per query window.
>
> Many may not have noticed the issue because they don¢t have a large
> number of OIF¢s or large number of group members in a single group. Some
> may have run into this previously and just filtered the UPnP/SSDP IPv4
> group (239.255.255.250) to resolve it. If you are running PIM-SM, have
> upgraded to 5.2.00 or above and afterwards noted periods of abnormally
> high MP/LP CPU, or you attempted the upgrade but had to revert due to
> high MP CPU usage and unicast flooding (as we were seeing) then this may
> be the root of your issue.
>
> After reporting the problem to Brocade they provided a fix build and
> incorporated the fix into 5.4.00e. This problem "should be" resolved in
> 5.4.00e. The problem is not specific to running PIM-SM with VRF¢s.
>
> Related closed defect information from 5.4.00e:
>
> Defect ID: DEFECT000468056
> Technical Severity: Medium
> Summary: High MP CPU utilization from IGMP reports after upgrade
> Symptom: After upgrading from 4.x to 5.4,high CPU utilization from IGMP reports in VRF
> Feature: IPv4-MC PIM-SM Routing
> Function: PERFORMANCE
> Reported In Release: NI 05.4.00
>
> --JK
>
> We have seen issues when our MLXes receive multicast traffic for which
> there have been no IGMP join messages sent (on edge ports). I'm
> assuming that not getting any PIM joins would have the same effect.
> There are some applications that do not send IGMP messages if they
> expect their traffic to remain on the same L2 domain. Apparently if the
> MLX doesn't have an entry for it, it punts it to the LP CPU.
>
> To get an idea of which traffic is hitting the CPU, you can connect to
> the LP (rconsole <slot_number>, then enable) and run 'debug packet
> capture'. That will show you a few packets as they hit the LP CPU, and
> should at least tell you the source IP, interface, and multicast group
> for the offending traffic.
>
> HTH,
>
> --
> Eldon Koyle
> --
> BOFH excuse #319:
> Your computer hasn't been returning all the bits it gets from the Internet.
>
> On Jun 03 10:32-0400, Walter Meyer wrote:
> > We are seeing high CPU on our LPs after upgrading from 4001a to 54c on two
> > MLXs.
> >
> > We are using PIM-SM and the mcast process is using a large amount of LP
> > CPU, but only after the upgrade. We are stable on the same config prior to
> > the upgrade. Also, the MLX that is the RP for networks with a large number
> > of multicast streams is the one that has a high CPU. The other core doesn't
> > have an issue (aside from being unstable because of the other MLX with high
> > CPU). We are pretty sure it has something to do with multicast routing we
> > just can't figure out why.
> >
> > We do have a large number of group/OIF entries spanning multiple physical
> > ints and ves, but this shouldn't be an issue because of the OIF
> > optimization feature on the platform...right? On 4001a and 54c we have a
> > shareabilitiy coefficient / optimization of 98%...So it doesn't seem like a
> > resource problem...But we can't figure out why the traffic is hitting CPU.
> >
> > Has anyone seen mcast problems after upgrading or have any troubleshooting
> > tips?
>
>

. . . . . . . . . . . . . . . . . . . . . . . . .
Jethro R Binks, Network Manager,
Information Services Directorate, University Of Strathclyde, Glasgow, UK

The University of Strathclyde is a charitable body, registered in
Scotland, number SC015263.