Mailing List Archive: MLX throughput issues

MLX throughput issues

nethub at gmail

Feb 12, 2015, 9:37 AM

Post #1 of 24 (7807 views)

We are having a strange issue on our MLX running code 5.6.00c. We are
encountering some throughput issues that seem to be randomly impacting
specific networks.

We use the MLX to handle both external BGP and internal VLAN routing. Each
FLS648 is used for Layer 2 VLANs only.

From a server connected by 1 Gbps uplink to a Foundry FLS648 switch, which
is then connected to the MLX on a 10 Gbps port, running a speed test to an
external network is getting 20MB/s.

Connecting the same server directly to the MLX is getting 70MB/s.

Connecting the same server to one of my customer's Juniper EX3200 (which BGP
peers with the MLX) also gets 70MB/s.

Testing to another external network, all three scenarios get 110MB/s.

The path to both test network locations goes through the same IP transit
provider.

We are running NI-MLX-MR with 2GB RAM, NI-MLX-10Gx4 connect to the Foundry
FLS648 by XFP-10G-LR, NI-MLX-1Gx20-GC was used for directly connecting the
server. A separate NI-MLX-10Gx4 connects to our upstream BGP providers.
Customer's Juniper EX3200 connects to the same NI-MLX-10Gx4 as the FLS648.
We take default routes plus full tables from three providers by BGP, but
filter out most of the routes.

The fiber and optics on everything look fine. CPU usage is less than 10% on
the MLX and all line cards and CPU usage at 1% on the FLS648. ARP table on
the MLX is about 12K, and BGP table is about 308K routes.

Any assistance would be appreciated. I suspect there is a setting that
we're missing on the MLX that is causing this issue.

Re: MLX throughput issues [ In reply to ]

frnkblk at iname

Feb 12, 2015, 3:47 PM

Post #2 of 24 (7780 views)

Based on what you described it seems more to be the case that the FLS648 is
dropping throughput from ~70 Mbps to 20 Mbps (I presume you mean bits, not
bytes when you write MB/s).

How do you know that the remote speed server is not maxed out? Or that your
uplink is not maxed out?

Frank

From: foundry-nsp [mailto:foundry-nsp-bounces@puck.nether.net] On Behalf Of
nethub@gmail.com
Sent: Thursday, February 12, 2015 11:38 AM
To: foundry-nsp@puck.nether.net
Subject: [f-nsp] MLX throughput issues

We are having a strange issue on our MLX running code 5.6.00c. We are
encountering some throughput issues that seem to be randomly impacting
specific networks.

We use the MLX to handle both external BGP and internal VLAN routing. Each
FLS648 is used for Layer 2 VLANs only.

From a server connected by 1 Gbps uplink to a Foundry FLS648 switch, which
is then connected to the MLX on a 10 Gbps port, running a speed test to an
external network is getting 20MB/s.

Connecting the same server directly to the MLX is getting 70MB/s.

Connecting the same server to one of my customer's Juniper EX3200 (which BGP
peers with the MLX) also gets 70MB/s.

Testing to another external network, all three scenarios get 110MB/s.

The path to both test network locations goes through the same IP transit
provider.

We are running NI-MLX-MR with 2GB RAM, NI-MLX-10Gx4 connect to the Foundry
FLS648 by XFP-10G-LR, NI-MLX-1Gx20-GC was used for directly connecting the
server. A separate NI-MLX-10Gx4 connects to our upstream BGP providers.
Customer's Juniper EX3200 connects to the same NI-MLX-10Gx4 as the FLS648.
We take default routes plus full tables from three providers by BGP, but
filter out most of the routes.

The fiber and optics on everything look fine. CPU usage is less than 10% on
the MLX and all line cards and CPU usage at 1% on the FLS648. ARP table on
the MLX is about 12K, and BGP table is about 308K routes.

Any assistance would be appreciated. I suspect there is a setting that
we're missing on the MLX that is causing this issue.

Re: MLX throughput issues [ In reply to ]

nethub at gmail

Feb 12, 2015, 4:34 PM

Post #3 of 24 (7772 views)

Thanks for your response, Frank.

I do mean megabytes per second (i.e. 20MB/s = 160 Mbps, 70MB/s = 560 Mbps,
110MB/s = 880 Mbps).

I am thinking that the FLS648 switches are not likely responsible since I
was able to get 110MB/s to another external network with all three scenarios
(server to FLS648 to MLX, server to MLX direct, server to EX3200 to MLX).
The FLS648 is layer 2 only, so I don't see how it would be interfering with
the throughput to one network and not to another. The problem is also
occurring on servers attached to multiple FLS648 that are each directly
connected to the MLX, so it is across different 10G cards, optics, slots on
the MLX chassis, etc.

The remote server doesn't seem to be having any issues since I was able to
get 70MB/s to it from connecting directly to the MLX and from connecting
through the EX3200. It is only from behind the FLS648 that I run into
issues.

As I stated in the first message, the Juniper EX3200 is a downstream BGP
customer that is single homed to our network, so it is on a different ASN
and the communication between my network and his network is layer 3.

Any additional insight would be appreciated.

From: Frank Bulk [mailto:frnkblk@iname.com]
Sent: Thursday, February 12, 2015 6:48 PM
To: nethub@gmail.com; foundry-nsp@puck.nether.net
Subject: RE: [f-nsp] MLX throughput issues

Based on what you described it seems more to be the case that the FLS648 is
dropping throughput from ~70 Mbps to 20 Mbps (I presume you mean bits, not
bytes when you write MB/s).

How do you know that the remote speed server is not maxed out? Or that your
uplink is not maxed out?

Frank

From: foundry-nsp [mailto:foundry-nsp-bounces@puck.nether.net] On Behalf Of
nethub@gmail.com
Sent: Thursday, February 12, 2015 11:38 AM
To: foundry-nsp@puck.nether.net
Subject: [f-nsp] MLX throughput issues

We are having a strange issue on our MLX running code 5.6.00c. We are
encountering some throughput issues that seem to be randomly impacting
specific networks.

We use the MLX to handle both external BGP and internal VLAN routing. Each
FLS648 is used for Layer 2 VLANs only.

From a server connected by 1 Gbps uplink to a Foundry FLS648 switch, which
is then connected to the MLX on a 10 Gbps port, running a speed test to an
external network is getting 20MB/s.

Connecting the same server directly to the MLX is getting 70MB/s.

Connecting the same server to one of my customer's Juniper EX3200 (which BGP
peers with the MLX) also gets 70MB/s.

Testing to another external network, all three scenarios get 110MB/s.

The path to both test network locations goes through the same IP transit
provider.

We are running NI-MLX-MR with 2GB RAM, NI-MLX-10Gx4 connect to the Foundry
FLS648 by XFP-10G-LR, NI-MLX-1Gx20-GC was used for directly connecting the
server. A separate NI-MLX-10Gx4 connects to our upstream BGP providers.
Customer's Juniper EX3200 connects to the same NI-MLX-10Gx4 as the FLS648.
We take default routes plus full tables from three providers by BGP, but
filter out most of the routes.

The fiber and optics on everything look fine. CPU usage is less than 10% on
the MLX and all line cards and CPU usage at 1% on the FLS648. ARP table on
the MLX is about 12K, and BGP table is about 308K routes.

Any assistance would be appreciated. I suspect there is a setting that
we're missing on the MLX that is causing this issue.

Re: MLX throughput issues [ In reply to ]

niels=foundry-nsp at bakker

Feb 12, 2015, 5:38 PM

Post #4 of 24 (7774 views)

* nethub@gmail.com (nethub@gmail.com) [Fri 13 Feb 2015, 01:45 CET]:
>As I stated in the first message, the Juniper EX3200 is a downstream
>BGP customer that is single homed to our network, so it is on a
>different ASN and the communication between my network and his
>network is layer 3.

Are you running that MLX with a full BGP table? 20 MB/sec sounds like
you're forwarding packets over its CPU, perhaps because it ran out of
CAM space.

-- Niels.

--
_______________________________________________
foundry-nsp mailing list
foundry-nsp@puck.nether.net
http://puck.nether.net/mailman/listinfo/foundry-nsp

Re: MLX throughput issues [ In reply to ]

nethub at gmail

Feb 12, 2015, 9:43 PM

Post #5 of 24 (7760 views)

We are only accepting about 300k IPv4 routes currently (we filter to reduce
the table size). We are on the multi-service-2 CAM partition profile and we
have the system-max values for ip-route and ip-cache set to 445K.

Also, we upgraded to 5.6f today to see if that would help but it did not
change anything.

CPU usage is very low across the board (under 10% use on everything), so if
it is routing in software, it isn't causing a jump in CPU load.

-----Original Message-----
From: foundry-nsp [mailto:foundry-nsp-bounces@puck.nether.net] On Behalf Of
Niels Bakker
Sent: Thursday, February 12, 2015 8:38 PM
To: foundry-nsp@puck.nether.net
Subject: Re: [f-nsp] MLX throughput issues

* nethub@gmail.com (nethub@gmail.com) [Fri 13 Feb 2015, 01:45 CET]:
>As I stated in the first message, the Juniper EX3200 is a downstream
>BGP customer that is single homed to our network, so it is on a
>different ASN and the communication between my network and his network
>is layer 3.

Are you running that MLX with a full BGP table? 20 MB/sec sounds like
you're forwarding packets over its CPU, perhaps because it ran out of CAM
space.

-- Niels.

--
_______________________________________________
foundry-nsp mailing list
foundry-nsp@puck.nether.net
http://puck.nether.net/mailman/listinfo/foundry-nsp

_______________________________________________
foundry-nsp mailing list
foundry-nsp@puck.nether.net
http://puck.nether.net/mailman/listinfo/foundry-nsp

Re: MLX throughput issues [ In reply to ]

georgeb at gmail

Feb 12, 2015, 10:16 PM

Post #6 of 24 (7762 views)

Wondering if you might have some imbalance in a LAG somewhere. Where it is
hashing too much traffic to one link of a lag. By default it uses the mac
address of the next layer 2 hop and traffic going to a gateway will all
hash to the same link. Are there any LAGs involved? Is there a major
imbalance of traffic on a LAG in the traffic path?

On Thu, Feb 12, 2015 at 9:43 PM, <nethub@gmail.com> wrote:

> We are only accepting about 300k IPv4 routes currently (we filter to reduce
> the table size). We are on the multi-service-2 CAM partition profile and
> we
> have the system-max values for ip-route and ip-cache set to 445K.
>
> Also, we upgraded to 5.6f today to see if that would help but it did not
> change anything.
>
> CPU usage is very low across the board (under 10% use on everything), so if
> it is routing in software, it isn't causing a jump in CPU load.
>
>
> -----Original Message-----
> From: foundry-nsp [mailto:foundry-nsp-bounces@puck.nether.net] On Behalf
> Of
> Niels Bakker
> Sent: Thursday, February 12, 2015 8:38 PM
> To: foundry-nsp@puck.nether.net
> Subject: Re: [f-nsp] MLX throughput issues
>
> * nethub@gmail.com (nethub@gmail.com) [Fri 13 Feb 2015, 01:45 CET]:
> >As I stated in the first message, the Juniper EX3200 is a downstream
> >BGP customer that is single homed to our network, so it is on a
> >different ASN and the communication between my network and his network
> >is layer 3.
>
> Are you running that MLX with a full BGP table? 20 MB/sec sounds like
> you're forwarding packets over its CPU, perhaps because it ran out of CAM
> space.
>
>
> -- Niels.
>
> --
> _______________________________________________
> foundry-nsp mailing list
> foundry-nsp@puck.nether.net
> http://puck.nether.net/mailman/listinfo/foundry-nsp
>
> _______________________________________________
> foundry-nsp mailing list
> foundry-nsp@puck.nether.net
> http://puck.nether.net/mailman/listinfo/foundry-nsp
>

Re: MLX throughput issues [ In reply to ]

jeroen.wunnink at atrato-ip

Feb 13, 2015, 12:34 AM

Post #7 of 24 (7760 views)

The FLS switches do something weird with packets. I've noticed they
somehow interfere with changing the MSS window size dynamically,
resulting in destinations further away having very poor speed results
compared to destinations close by.

We got rid of those a while ago.

On 12/02/15 17:37, nethub@gmail.com wrote:
>
> We are having a strange issue on our MLX running code 5.6.00c. We are
> encountering some throughput issues that seem to be randomly impacting
> specific networks.
>
> We use the MLX to handle both external BGP and internal VLAN routing.
> Each FLS648 is used for Layer 2 VLANs only.
>
> From a server connected by 1 Gbps uplink to a Foundry FLS648 switch,
> which is then connected to the MLX on a 10 Gbps port, running a speed
> test to an external network is getting 20MB/s.
>
> Connecting the same server directly to the MLX is getting 70MB/s.
>
> Connecting the same server to one of my customer's Juniper EX3200
> (which BGP peers with the MLX) also gets 70MB/s.
>
> Testing to another external network, all three scenarios get 110MB/s.
>
> The path to both test network locations goes through the same IP
> transit provider.
>
> We are running NI-MLX-MR with 2GB RAM, NI-MLX-10Gx4 connect to the
> Foundry FLS648 by XFP-10G-LR, NI-MLX-1Gx20-GC was used for directly
> connecting the server. A separate NI-MLX-10Gx4 connects to our
> upstream BGP providers. Customer’s Juniper EX3200 connects to the
> same NI-MLX-10Gx4 as the FLS648. We take default routes plus full
> tables from three providers by BGP, but filter out most of the routes.
>
> The fiber and optics on everything look fine. CPU usage is less than
> 10% on the MLX and all line cards and CPU usage at 1% on the FLS648.
> ARP table on the MLX is about 12K, and BGP table is about 308K routes.
>
> Any assistance would be appreciated. I suspect there is a setting
> that we’re missing on the MLX that is causing this issue.
>
>
>
> _______________________________________________
> foundry-nsp mailing list
> foundry-nsp@puck.nether.net
> http://puck.nether.net/mailman/listinfo/foundry-nsp

--

Jeroen Wunnink
IP NOC Manager - Hibernia Networks
Main numbers (Ext: 1011): USA +1.908.516.4200 | UK +44.1704.322.300
Netherlands +31.208.200.622 | 24/7 IP NOC Phone: +31.20.82.00.623
jeroen.wunnink@hibernianetworks.com
www.hibernianetworks.com

Re: MLX throughput issues [ In reply to ]

chris.hellkvist at googlemail

Feb 13, 2015, 2:44 AM

Post #8 of 24 (7768 views)

Hey,

this sounds like a good tip. We are seeing a issue very similar to the one
reported in this thread.
Speed to local servers is fine, but to remote servers the speed decreases
depending on the latency to them (while having no overloaded links or
something like that).
In our case the core devices are also MLX(e) boxes, but the servers are not
directly terminating in the MLX, the path to them includes a Cisco with
sup720 for routing and HP switching stuff on the L2 path to the servers.
Jeroen, could you share a bit more insight on the issue you had with
dynamic MSS adjustments? Have you been able to find a way to change the
behaviour of the switches? Have you seen such a issue also with your other
Brocade equipment you have at Hibernia?

Thanks,
Chris

Am Freitag, 13. Februar 2015 schrieb Jeroen Wunnink | Hibernia Networks :

> The FLS switches do something weird with packets. I've noticed they
> somehow interfere with changing the MSS window size dynamically, resulting
> in destinations further away having very poor speed results compared to
> destinations close by.
>
> We got rid of those a while ago.
>
>
> On 12/02/15 17:37, nethub@gmail.com
> <javascript:_e(%7B%7D,'cvml','nethub@gmail.com');> wrote:
>
> We are having a strange issue on our MLX running code 5.6.00c. We are
> encountering some throughput issues that seem to be randomly impacting
> specific networks.
>
>
>
> We use the MLX to handle both external BGP and internal VLAN routing.
> Each FLS648 is used for Layer 2 VLANs only.
>
>
>
> From a server connected by 1 Gbps uplink to a Foundry FLS648 switch, which
> is then connected to the MLX on a 10 Gbps port, running a speed test to an
> external network is getting 20MB/s.
>
>
>
> Connecting the same server directly to the MLX is getting 70MB/s.
>
>
>
> Connecting the same server to one of my customer's Juniper EX3200 (which
> BGP peers with the MLX) also gets 70MB/s.
>
>
>
> Testing to another external network, all three scenarios get 110MB/s.
>
>
>
> The path to both test network locations goes through the same IP transit
> provider.
>
>
>
> We are running NI-MLX-MR with 2GB RAM, NI-MLX-10Gx4 connect to the Foundry
> FLS648 by XFP-10G-LR, NI-MLX-1Gx20-GC was used for directly connecting the
> server. A separate NI-MLX-10Gx4 connects to our upstream BGP providers.
> Customerâ€™s Juniper EX3200 connects to the same NI-MLX-10Gx4 as the FLS648.
> We take default routes plus full tables from three providers by BGP, but
> filter out most of the routes.
>
>
>
> The fiber and optics on everything look fine. CPU usage is less than 10%
> on the MLX and all line cards and CPU usage at 1% on the FLS648. ARP table
> on the MLX is about 12K, and BGP table is about 308K routes.
>
>
>
> Any assistance would be appreciated. I suspect there is a setting that
> weâ€™re missing on the MLX that is causing this issue.
>
>
> _______________________________________________
> foundry-nsp mailing listfoundry-nsp@puck.nether.net <javascript:_e(%7B%7D,'cvml','foundry-nsp@puck.nether.net');>http://puck.nether.net/mailman/listinfo/foundry-nsp
>
>
>
> --
>
> Jeroen Wunnink
> IP NOC Manager - Hibernia Networks
> Main numbers (Ext: 1011): USA +1.908.516.4200 | UK +44.1704.322.300
> Netherlands +31.208.200.622 | 24/7 IP NOC Phone: +31.20.82.00.623jeroen.wunnink@hibernianetworks.com <javascript:_e(%7B%7D,'cvml','jeroen.wunnink@hibernianetworks.com');>www.hibernianetworks.com
>
>

Re: MLX throughput issues [ In reply to ]

nethub at gmail

Feb 13, 2015, 8:05 AM

Post #9 of 24 (7750 views)

We are not using LAG anywhere in our network.

From: G B [mailto:georgeb@gmail.com]
Sent: Friday, February 13, 2015 1:16 AM
To: nethub@gmail.com
Cc: Niels Bakker; foundry-nsp
Subject: Re: [f-nsp] MLX throughput issues

Wondering if you might have some imbalance in a LAG somewhere. Where it is hashing too much traffic to one link of a lag. By default it uses the mac address of the next layer 2 hop and traffic going to a gateway will all hash to the same link. Are there any LAGs involved? Is there a major imbalance of traffic on a LAG in the traffic path?

On Thu, Feb 12, 2015 at 9:43 PM, <nethub@gmail.com> wrote:

We are only accepting about 300k IPv4 routes currently (we filter to reduce
the table size). We are on the multi-service-2 CAM partition profile and we
have the system-max values for ip-route and ip-cache set to 445K.

Also, we upgraded to 5.6f today to see if that would help but it did not
change anything.

CPU usage is very low across the board (under 10% use on everything), so if
it is routing in software, it isn't causing a jump in CPU load.

-----Original Message-----
From: foundry-nsp [mailto:foundry-nsp-bounces@puck.nether.net] On Behalf Of

Niels Bakker
Sent: Thursday, February 12, 2015 8:38 PM
To: foundry-nsp@puck.nether.net
Subject: Re: [f-nsp] MLX throughput issues

* nethub@gmail.com (nethub@gmail.com) [Fri 13 Feb 2015, 01:45 CET]:
>As I stated in the first message, the Juniper EX3200 is a downstream
>BGP customer that is single homed to our network, so it is on a
>different ASN and the communication between my network and his network
>is layer 3.

Are you running that MLX with a full BGP table? 20 MB/sec sounds like
you're forwarding packets over its CPU, perhaps because it ran out of CAM
space.

-- Niels.

--
_______________________________________________
foundry-nsp mailing list
foundry-nsp@puck.nether.net
http://puck.nether.net/mailman/listinfo/foundry-nsp

_______________________________________________
foundry-nsp mailing list
foundry-nsp@puck.nether.net
http://puck.nether.net/mailman/listinfo/foundry-nsp

Re: MLX throughput issues [ In reply to ]

nethub at gmail

Feb 13, 2015, 9:41 AM

Post #10 of 24 (7767 views)

We have three switch fabrics installed, all are under 1% utilized.

From: Jeroen Wunnink | Hibernia Networks [mailto:jeroen.wunnink@atrato.com]
Sent: Friday, February 13, 2015 12:27 PM
To: nethub@gmail.com; 'Jeroen Wunnink | Hibernia Networks'
Subject: Re: [f-nsp] MLX throughput issues

How many switchfabrics do you have in that MLX and how high is the
utilization on them

On 13/02/15 18:12, nethub@gmail.com wrote:

We also tested with a spare Quanta LB4M we have and are seeing about the
same speeds as we are seeing with the FLS648 (around 20MB/s or 160Mbps).

I also reduced the number of routes we are accepting down to about 189K and
that did not make a difference.

From: foundry-nsp [mailto:foundry-nsp-bounces@puck.nether.net] On Behalf Of
Jeroen Wunnink | Hibernia Networks
Sent: Friday, February 13, 2015 3:35 AM
To: foundry-nsp@puck.nether.net
Subject: Re: [f-nsp] MLX throughput issues

The FLS switches do something weird with packets. I've noticed they somehow
interfere with changing the MSS window size dynamically, resulting in
destinations further away having very poor speed results compared to
destinations close by.

We got rid of those a while ago.

On 12/02/15 17:37, nethub@gmail.com wrote:

We are having a strange issue on our MLX running code 5.6.00c. We are
encountering some throughput issues that seem to be randomly impacting
specific networks.

We use the MLX to handle both external BGP and internal VLAN routing. Each
FLS648 is used for Layer 2 VLANs only.

From a server connected by 1 Gbps uplink to a Foundry FLS648 switch, which
is then connected to the MLX on a 10 Gbps port, running a speed test to an
external network is getting 20MB/s.

Connecting the same server directly to the MLX is getting 70MB/s.

Connecting the same server to one of my customer's Juniper EX3200 (which BGP
peers with the MLX) also gets 70MB/s.

Testing to another external network, all three scenarios get 110MB/s.

The path to both test network locations goes through the same IP transit
provider.

We are running NI-MLX-MR with 2GB RAM, NI-MLX-10Gx4 connect to the Foundry
FLS648 by XFP-10G-LR, NI-MLX-1Gx20-GC was used for directly connecting the
server. A separate NI-MLX-10Gx4 connects to our upstream BGP providers.
Customer's Juniper EX3200 connects to the same NI-MLX-10Gx4 as the FLS648.
We take default routes plus full tables from three providers by BGP, but
filter out most of the routes.

The fiber and optics on everything look fine. CPU usage is less than 10% on
the MLX and all line cards and CPU usage at 1% on the FLS648. ARP table on
the MLX is about 12K, and BGP table is about 308K routes.

Any assistance would be appreciated. I suspect there is a setting that
we're missing on the MLX that is causing this issue.

_______________________________________________
foundry-nsp mailing list
foundry-nsp@puck.nether.net
http://puck.nether.net/mailman/listinfo/foundry-nsp

--

Jeroen Wunnink
IP NOC Manager - Hibernia Networks
Main numbers (Ext: 1011): USA +1.908.516.4200 | UK +44.1704.322.300
Netherlands +31.208.200.622 | 24/7 IP NOC Phone: +31.20.82.00.623
jeroen.wunnink@hibernianetworks.com
www.hibernianetworks.com

--

Jeroen Wunnink
IP NOC Manager - Hibernia Networks
Main numbers (Ext: 1011): USA +1.908.516.4200 | UK +44.1704.322.300
Netherlands +31.208.200.622 | 24/7 IP NOC Phone: +31.20.82.00.623
jeroen.wunnink@hibernianetworks.com
www.hibernianetworks.com

Re: MLX throughput issues [ In reply to ]

nethub at gmail

Feb 13, 2015, 11:57 AM

Post #11 of 24 (7769 views)

I also tested with an FESX448 and got the same results as the FLS648 and
Quanta LB4M switches.

From: foundry-nsp [mailto:foundry-nsp-bounces@puck.nether.net] On Behalf Of
Jeroen Wunnink | Hibernia Networks
Sent: Friday, February 13, 2015 3:35 AM
To: foundry-nsp@puck.nether.net
Subject: Re: [f-nsp] MLX throughput issues

The FLS switches do something weird with packets. I've noticed they somehow
interfere with changing the MSS window size dynamically, resulting in
destinations further away having very poor speed results compared to
destinations close by.

We got rid of those a while ago.

On 12/02/15 17:37, nethub@gmail.com wrote:

We are having a strange issue on our MLX running code 5.6.00c. We are
encountering some throughput issues that seem to be randomly impacting
specific networks.

We use the MLX to handle both external BGP and internal VLAN routing. Each
FLS648 is used for Layer 2 VLANs only.

From a server connected by 1 Gbps uplink to a Foundry FLS648 switch, which
is then connected to the MLX on a 10 Gbps port, running a speed test to an
external network is getting 20MB/s.

Connecting the same server directly to the MLX is getting 70MB/s.

Connecting the same server to one of my customer's Juniper EX3200 (which BGP
peers with the MLX) also gets 70MB/s.

Testing to another external network, all three scenarios get 110MB/s.

The path to both test network locations goes through the same IP transit
provider.

We are running NI-MLX-MR with 2GB RAM, NI-MLX-10Gx4 connect to the Foundry
FLS648 by XFP-10G-LR, NI-MLX-1Gx20-GC was used for directly connecting the
server. A separate NI-MLX-10Gx4 connects to our upstream BGP providers.
Customer's Juniper EX3200 connects to the same NI-MLX-10Gx4 as the FLS648.
We take default routes plus full tables from three providers by BGP, but
filter out most of the routes.

The fiber and optics on everything look fine. CPU usage is less than 10% on
the MLX and all line cards and CPU usage at 1% on the FLS648. ARP table on
the MLX is about 12K, and BGP table is about 308K routes.

Any assistance would be appreciated. I suspect there is a setting that
we're missing on the MLX that is causing this issue.

_______________________________________________
foundry-nsp mailing list
foundry-nsp@puck.nether.net
http://puck.nether.net/mailman/listinfo/foundry-nsp

--

Jeroen Wunnink
IP NOC Manager - Hibernia Networks
Main numbers (Ext: 1011): USA +1.908.516.4200 | UK +44.1704.322.300
Netherlands +31.208.200.622 | 24/7 IP NOC Phone: +31.20.82.00.623
jeroen.wunnink@hibernianetworks.com
www.hibernianetworks.com

Re: MLX throughput issues [ In reply to ]

bdflemin at gmail

Feb 13, 2015, 1:23 PM

Post #12 of 24 (7763 views)

Over the years weâ€™ve seen odd issues where one of the switch-fabric-links will â€œwigoutâ€ and some of the data moving between cards will get corrupted. When this happens we power cycle each switch fab one at a time using this process:
1) Shutdown SFM #3
2) Wait 1 minute
3) Power SFM #3 on again
4) Verify all SFM links are up to SFM#3
5) Wait 1 minute
6) Perform steps 1-5 for SFM #2
7) Perform steps 1-5 for SFM #3

Not sure youâ€™re seeing the same issue that we see but the â€œSFM Danceâ€ (as we call it) is a once-every-four-months thing somewhere across our 16 XMR4000 boxes. It can be done with little to no impact if you are patient verify status before moving to the next SFM.

> On Feb 13, 2015, at 11:41 AM, nethub@gmail.com wrote:
>
> We have three switch fabrics installed, all are under 1% utilized.
>
>
> From: Jeroen Wunnink | Hibernia Networks [mailto:jeroen.wunnink@atrato.com <mailto:jeroen.wunnink@atrato.com>]
> Sent: Friday, February 13, 2015 12:27 PM
> To: nethub@gmail.com <mailto:nethub@gmail.com>; 'Jeroen Wunnink | Hibernia Networks'
> Subject: Re: [f-nsp] MLX throughput issues
>
> How many switchfabrics do you have in that MLX and how high is the utilization on them
>
> On 13/02/15 18:12, nethub@gmail.com <mailto:nethub@gmail.com> wrote:
>> We also tested with a spare Quanta LB4M we have and are seeing about the same speeds as we are seeing with the FLS648 (around 20MB/s or 160Mbps).
>>
>> I also reduced the number of routes we are accepting down to about 189K and that did not make a difference.
>>
>>
>> From: foundry-nsp [mailto:foundry-nsp-bounces@puck.nether.net <mailto:foundry-nsp-bounces@puck.nether.net>] On Behalf Of Jeroen Wunnink | Hibernia Networks
>> Sent: Friday, February 13, 2015 3:35 AM
>> To: foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net>
>> Subject: Re: [f-nsp] MLX throughput issues
>>
>> The FLS switches do something weird with packets. I've noticed they somehow interfere with changing the MSS window size dynamically, resulting in destinations further away having very poor speed results compared to destinations close by.
>>
>> We got rid of those a while ago.
>>
>>
>> On 12/02/15 17:37, nethub@gmail.com <mailto:nethub@gmail.com> wrote:
>>> We are having a strange issue on our MLX running code 5.6.00c. We are encountering some throughput issues that seem to be randomly impacting specific networks.
>>>
>>> We use the MLX to handle both external BGP and internal VLAN routing. Each FLS648 is used for Layer 2 VLANs only.
>>>
>>> From a server connected by 1 Gbps uplink to a Foundry FLS648 switch, which is then connected to the MLX on a 10 Gbps port, running a speed test to an external network is getting 20MB/s.
>>>
>>> Connecting the same server directly to the MLX is getting 70MB/s.
>>>
>>> Connecting the same server to one of my customer's Juniper EX3200 (which BGP peers with the MLX) also gets 70MB/s.
>>>
>>> Testing to another external network, all three scenarios get 110MB/s.
>>>
>>> The path to both test network locations goes through the same IP transit provider.
>>>
>>> We are running NI-MLX-MR with 2GB RAM, NI-MLX-10Gx4 connect to the Foundry FLS648 by XFP-10G-LR, NI-MLX-1Gx20-GC was used for directly connecting the server. A separate NI-MLX-10Gx4 connects to our upstream BGP providers. Customerâ€™s Juniper EX3200 connects to the same NI-MLX-10Gx4 as the FLS648. We take default routes plus full tables from three providers by BGP, but filter out most of the routes.
>>>
>>> The fiber and optics on everything look fine. CPU usage is less than 10% on the MLX and all line cards and CPU usage at 1% on the FLS648. ARP table on the MLX is about 12K, and BGP table is about 308K routes.
>>>
>>> Any assistance would be appreciated. I suspect there is a setting that weâ€™re missing on the MLX that is causing this issue.
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> foundry-nsp mailing list
>>> foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net>
>>> http://puck.nether.net/mailman/listinfo/foundry-nsp <http://puck.nether.net/mailman/listinfo/foundry-nsp>
>>
>>
>>
>> --
>>
>> Jeroen Wunnink
>> IP NOC Manager - Hibernia Networks
>> Main numbers (Ext: 1011): USA +1.908.516.4200 | UK +44.1704.322.300
>> Netherlands +31.208.200.622 | 24/7 IP NOC Phone: +31.20.82.00.623
>> jeroen.wunnink@hibernianetworks.com <mailto:jeroen.wunnink@hibernianetworks.com>
>> www.hibernianetworks.com <http://www.hibernianetworks.com/>
>
>
> --
>
> Jeroen Wunnink
> IP NOC Manager - Hibernia Networks
> Main numbers (Ext: 1011): USA +1.908.516.4200 | UK +44.1704.322.300
> Netherlands +31.208.200.622 | 24/7 IP NOC Phone: +31.20.82.00.623
> jeroen.wunnink@hibernianetworks.com <mailto:jeroen.wunnink@hibernianetworks.com>
> www.hibernianetworks.com <http://www.hibernianetworks.com/>_______________________________________________
> foundry-nsp mailing list
> foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net>
> http://puck.nether.net/mailman/listinfo/foundry-nsp <http://puck.nether.net/mailman/listinfo/foundry-nsp>

Re: MLX throughput issues [ In reply to ]

nethub at gmail

Feb 13, 2015, 2:13 PM

Post #13 of 24 (7761 views)

We already tried a full system reboot last night and it didnâ€™t seem to help. Iâ€™ll definitely keep your switch fabric reboot procedure in mind in case we run into that in the future.

I think we may have figured out at least a short-term solution. On the FLS648, we ran the command â€œbuffer-sharing-fullâ€ and immediately we were able to get better speeds. It seems as though the FLS648â€™s buffers may have been causing the issue. Weâ€™ll continue to monitor over the next few days and see if this actually solves the issue.

Thanks everyone for your feedback thus far.

From: Brad Fleming [mailto:bdflemin@gmail.com]
Sent: Friday, February 13, 2015 4:24 PM
To: nethub@gmail.com
Cc: Jeroen Wunnink | Hibernia Networks; foundry-nsp@puck.nether.net
Subject: Re: [f-nsp] MLX throughput issues

Over the years weâ€™ve seen odd issues where one of the switch-fabric-links will â€œwigoutâ€ and some of the data moving between cards will get corrupted. When this happens we power cycle each switch fab one at a time using this process:

1) Shutdown SFM #3

2) Wait 1 minute

3) Power SFM #3 on again

4) Verify all SFM links are up to SFM#3

5) Wait 1 minute

6) Perform steps 1-5 for SFM #2

7) Perform steps 1-5 for SFM #3

Not sure youâ€™re seeing the same issue that we see but the â€œSFM Danceâ€ (as we call it) is a once-every-four-months thing somewhere across our 16 XMR4000 boxes. It can be done with little to no impact if you are patient verify status before moving to the next SFM.

On Feb 13, 2015, at 11:41 AM, nethub@gmail.com wrote:

We have three switch fabrics installed, all are under 1% utilized.

From: Jeroen Wunnink | Hibernia Networks [ <mailto:jeroen.wunnink@atrato.com> mailto:jeroen.wunnink@atrato.com]
Sent: Friday, February 13, 2015 12:27 PM
To: <mailto:nethub@gmail.com> nethub@gmail.com; 'Jeroen Wunnink | Hibernia Networks'
Subject: Re: [f-nsp] MLX throughput issues

How many switchfabrics do you have in that MLX and how high is the utilization on them

On 13/02/15 18:12, <mailto:nethub@gmail.com> nethub@gmail.com wrote:

We also tested with a spare Quanta LB4M we have and are seeing about the same speeds as we are seeing with the FLS648 (around 20MB/s or 160Mbps).

I also reduced the number of routes we are accepting down to about 189K and that did not make a difference.

From: foundry-nsp [ <mailto:foundry-nsp-bounces@puck.nether.net> mailto:foundry-nsp-bounces@puck.nether.net] On Behalf Of Jeroen Wunnink | Hibernia Networks
Sent: Friday, February 13, 2015 3:35 AM
To: <mailto:foundry-nsp@puck.nether.net> foundry-nsp@puck.nether.net
Subject: Re: [f-nsp] MLX throughput issues

The FLS switches do something weird with packets. I've noticed they somehow interfere with changing the MSS window size dynamically, resulting in destinations further away having very poor speed results compared to destinations close by.

We got rid of those a while ago.

On 12/02/15 17:37, <mailto:nethub@gmail.com> nethub@gmail.com wrote:

We are having a strange issue on our MLX running code 5.6.00c. We are encountering some throughput issues that seem to be randomly impacting specific networks.

We use the MLX to handle both external BGP and internal VLAN routing. Each FLS648 is used for Layer 2 VLANs only.

From a server connected by 1 Gbps uplink to a Foundry FLS648 switch, which is then connected to the MLX on a 10 Gbps port, running a speed test to an external network is getting 20MB/s.

Connecting the same server directly to the MLX is getting 70MB/s.

Connecting the same server to one of my customer's Juniper EX3200 (which BGP peers with the MLX) also gets 70MB/s.

Testing to another external network, all three scenarios get 110MB/s.

The path to both test network locations goes through the same IP transit provider.

We are running NI-MLX-MR with 2GB RAM, NI-MLX-10Gx4 connect to the Foundry FLS648 by XFP-10G-LR, NI-MLX-1Gx20-GC was used for directly connecting the server. A separate NI-MLX-10Gx4 connects to our upstream BGP providers. Customerâ€™s Juniper EX3200 connects to the same NI-MLX-10Gx4 as the FLS648. We take default routes plus full tables from three providers by BGP, but filter out most of the routes.

The fiber and optics on everything look fine. CPU usage is less than 10% on the MLX and all line cards and CPU usage at 1% on the FLS648. ARP table on the MLX is about 12K, and BGP table is about 308K routes.

Any assistance would be appreciated. I suspect there is a setting that weâ€™re missing on the MLX that is causing this issue.

_______________________________________________
foundry-nsp mailing list
<mailto:foundry-nsp@puck.nether.net> foundry-nsp@puck.nether.net
<http://puck.nether.net/mailman/listinfo/foundry-nsp> http://puck.nether.net/mailman/listinfo/foundry-nsp

--

Jeroen Wunnink
IP NOC Manager - Hibernia Networks
Main numbers (Ext: 1011): USA +1.908.516.4200 | UK +44.1704.322.300
Netherlands +31.208.200.622 | 24/7 IP NOC Phone: +31.20.82.00.623
<mailto:jeroen.wunnink@hibernianetworks.com> jeroen.wunnink@hibernianetworks.com
<http://www.hibernianetworks.com/> www.hibernianetworks.com

--

Jeroen Wunnink
IP NOC Manager - Hibernia Networks
Main numbers (Ext: 1011): USA +1.908.516.4200 | UK +44.1704.322.300
Netherlands +31.208.200.622 | 24/7 IP NOC Phone: +31.20.82.00.623
<mailto:jeroen.wunnink@hibernianetworks.com> jeroen.wunnink@hibernianetworks.com
<http://www.hibernianetworks.com/> www.hibernianetworks.com

_______________________________________________
foundry-nsp mailing list
<mailto:foundry-nsp@puck.nether.net> foundry-nsp@puck.nether.net
<http://puck.nether.net/mailman/listinfo/foundry-nsp> http://puck.nether.net/mailman/listinfo/foundry-nsp

Re: MLX throughput issues [ In reply to ]

jethro.binks at strath

Feb 16, 2015, 5:07 AM

Post #14 of 24 (7730 views)

On Fri, 13 Feb 2015, Brad Fleming wrote:

> Over the years weâ€™ve seen odd issues where one of the
> switch-fabric-links will â€œwigoutâ€ and some of the data moving between
> cards will get corrupted. When this happens we power cycle each switch
> fab one at a time using this process:
>
> 1) Shutdown SFM #3
> 2) Wait 1 minute
> 3) Power SFM #3 on again
> 4) Verify all SFM links are up to SFM#3
> 5) Wait 1 minute
> 6) Perform steps 1-5 for SFM #2
> 7) Perform steps 1-5 for SFM #3
>
> Not sure youâ€™re seeing the same issue that we see but the â€œSFM Danceâ€
> (as we call it) is a once-every-four-months thing somewhere across our
> 16 XMR4000 boxes. It can be done with little to no impact if you are
> patient verify status before moving to the next SFM.

That's all interesting. What code versions is this? Also, how do you
shutdown the SFMs? I don't recall seeing documentation for that.

Jethro.

>
> > On Feb 13, 2015, at 11:41 AM, nethub@gmail.com wrote:
> >
> > We have three switch fabrics installed, all are under 1% utilized.
> >
> >
> > From: Jeroen Wunnink | Hibernia Networks [mailto:jeroen.wunnink@atrato.com <mailto:jeroen.wunnink@atrato.com>]
> > Sent: Friday, February 13, 2015 12:27 PM
> > To: nethub@gmail.com <mailto:nethub@gmail.com>; 'Jeroen Wunnink | Hibernia Networks'
> > Subject: Re: [f-nsp] MLX throughput issues
> >
> > How many switchfabrics do you have in that MLX and how high is the utilization on them
> >
> > On 13/02/15 18:12, nethub@gmail.com <mailto:nethub@gmail.com> wrote:
> >> We also tested with a spare Quanta LB4M we have and are seeing about the same speeds as we are seeing with the FLS648 (around 20MB/s or 160Mbps).
> >>
> >> I also reduced the number of routes we are accepting down to about 189K and that did not make a difference.
> >>
> >>
> >> From: foundry-nsp [mailto:foundry-nsp-bounces@puck.nether.net <mailto:foundry-nsp-bounces@puck.nether.net>] On Behalf Of Jeroen Wunnink | Hibernia Networks
> >> Sent: Friday, February 13, 2015 3:35 AM
> >> To: foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net>
> >> Subject: Re: [f-nsp] MLX throughput issues
> >>
> >> The FLS switches do something weird with packets. I've noticed they somehow interfere with changing the MSS window size dynamically, resulting in destinations further away having very poor speed results compared to destinations close by.
> >>
> >> We got rid of those a while ago.
> >>
> >>
> >> On 12/02/15 17:37, nethub@gmail.com <mailto:nethub@gmail.com> wrote:
> >>> We are having a strange issue on our MLX running code 5.6.00c. We are encountering some throughput issues that seem to be randomly impacting specific networks.
> >>>
> >>> We use the MLX to handle both external BGP and internal VLAN routing. Each FLS648 is used for Layer 2 VLANs only.
> >>>
> >>> From a server connected by 1 Gbps uplink to a Foundry FLS648 switch, which is then connected to the MLX on a 10 Gbps port, running a speed test to an external network is getting 20MB/s.
> >>>
> >>> Connecting the same server directly to the MLX is getting 70MB/s.
> >>>
> >>> Connecting the same server to one of my customer's Juniper EX3200 (which BGP peers with the MLX) also gets 70MB/s.
> >>>
> >>> Testing to another external network, all three scenarios get 110MB/s.
> >>>
> >>> The path to both test network locations goes through the same IP transit provider.
> >>>
> >>> We are running NI-MLX-MR with 2GB RAM, NI-MLX-10Gx4 connect to the Foundry FLS648 by XFP-10G-LR, NI-MLX-1Gx20-GC was used for directly connecting the server. A separate NI-MLX-10Gx4 connects to our upstream BGP providers. Customerâ€™s Juniper EX3200 connects to the same NI-MLX-10Gx4 as the FLS648. We take default routes plus full tables from three providers by BGP, but filter out most of the routes.
> >>>
> >>> The fiber and optics on everything look fine. CPU usage is less than 10% on the MLX and all line cards and CPU usage at 1% on the FLS648. ARP table on the MLX is about 12K, and BGP table is about 308K routes.
> >>>
> >>> Any assistance would be appreciated. I suspect there is a setting that weâ€™re missing on the MLX that is causing this issue.
> >>>
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> foundry-nsp mailing list
> >>> foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net>
> >>> http://puck.nether.net/mailman/listinfo/foundry-nsp <http://puck.nether.net/mailman/listinfo/foundry-nsp>
> >>
> >>
> >>
> >> --
> >>
> >> Jeroen Wunnink
> >> IP NOC Manager - Hibernia Networks
> >> Main numbers (Ext: 1011): USA +1.908.516.4200 | UK +44.1704.322.300
> >> Netherlands +31.208.200.622 | 24/7 IP NOC Phone: +31.20.82.00.623
> >> jeroen.wunnink@hibernianetworks.com <mailto:jeroen.wunnink@hibernianetworks.com>
> >> www.hibernianetworks.com <http://www.hibernianetworks.com/>
> >
> >
> > --
> >
> > Jeroen Wunnink
> > IP NOC Manager - Hibernia Networks
> > Main numbers (Ext: 1011): USA +1.908.516.4200 | UK +44.1704.322.300
> > Netherlands +31.208.200.622 | 24/7 IP NOC Phone: +31.20.82.00.623
> > jeroen.wunnink@hibernianetworks.com <mailto:jeroen.wunnink@hibernianetworks.com>
> > www.hibernianetworks.com <http://www.hibernianetworks.com/>_______________________________________________
> > foundry-nsp mailing list
> > foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net>
> > http://puck.nether.net/mailman/listinfo/foundry-nsp <http://puck.nether.net/mailman/listinfo/foundry-nsp>
>

. . . . . . . . . . . . . . . . . . . . . . . . .
Jethro R Binks, Network Manager,
Information Services Directorate, University Of Strathclyde, Glasgow, UK

The University of Strathclyde is a charitable body, registered in
Scotland, number SC015263.

Re: MLX throughput issues [ In reply to ]

bdflemin at gmail

Feb 16, 2015, 7:43 AM

Post #15 of 24 (7733 views)

Weâ€™ve seen it since installing the high-capacity switch fabrics into our XMR4000 chassis roughly 4 years ago. We saw it through IronWare 5.4.00d. Iâ€™m not sure what software we were using when they were first installed; probably whatever would have been stable/popular around December 2010.

Command is simply â€œpower-off snm [1-3]â€ then â€œpower-on snm [1-3]â€.

Note that the power-on process causes your management session to hang for a few seconds. The device isnâ€™t broken and packets arenâ€™t getting dropped; itâ€™s just going through checks and echoing back status.

-brad

> On Feb 16, 2015, at 7:07 AM, Jethro R Binks <jethro.binks@strath.ac.uk> wrote:
>
> On Fri, 13 Feb 2015, Brad Fleming wrote:
>
>> Over the years weâ€™ve seen odd issues where one of the
>> switch-fabric-links will â€œwigoutâ€ and some of the data moving between
>> cards will get corrupted. When this happens we power cycle each switch
>> fab one at a time using this process:
>>
>> 1) Shutdown SFM #3
>> 2) Wait 1 minute
>> 3) Power SFM #3 on again
>> 4) Verify all SFM links are up to SFM#3
>> 5) Wait 1 minute
>> 6) Perform steps 1-5 for SFM #2
>> 7) Perform steps 1-5 for SFM #3
>>
>> Not sure youâ€™re seeing the same issue that we see but the â€œSFM Danceâ€
>> (as we call it) is a once-every-four-months thing somewhere across our
>> 16 XMR4000 boxes. It can be done with little to no impact if you are
>> patient verify status before moving to the next SFM.
>
> That's all interesting. What code versions is this? Also, how do you
> shutdown the SFMs? I don't recall seeing documentation for that.
>
> Jethro.
>
>
>>
>>> On Feb 13, 2015, at 11:41 AM, nethub@gmail.com wrote:
>>>
>>> We have three switch fabrics installed, all are under 1% utilized.
>>>
>>>
>>> From: Jeroen Wunnink | Hibernia Networks [mailto:jeroen.wunnink@atrato.com <mailto:jeroen.wunnink@atrato.com>]
>>> Sent: Friday, February 13, 2015 12:27 PM
>>> To: nethub@gmail.com <mailto:nethub@gmail.com>; 'Jeroen Wunnink | Hibernia Networks'
>>> Subject: Re: [f-nsp] MLX throughput issues
>>>
>>> How many switchfabrics do you have in that MLX and how high is the utilization on them
>>>
>>> On 13/02/15 18:12, nethub@gmail.com <mailto:nethub@gmail.com> wrote:
>>>> We also tested with a spare Quanta LB4M we have and are seeing about the same speeds as we are seeing with the FLS648 (around 20MB/s or 160Mbps).
>>>>
>>>> I also reduced the number of routes we are accepting down to about 189K and that did not make a difference.
>>>>
>>>>
>>>> From: foundry-nsp [mailto:foundry-nsp-bounces@puck.nether.net <mailto:foundry-nsp-bounces@puck.nether.net>] On Behalf Of Jeroen Wunnink | Hibernia Networks
>>>> Sent: Friday, February 13, 2015 3:35 AM
>>>> To: foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net>
>>>> Subject: Re: [f-nsp] MLX throughput issues
>>>>
>>>> The FLS switches do something weird with packets. I've noticed they somehow interfere with changing the MSS window size dynamically, resulting in destinations further away having very poor speed results compared to destinations close by.
>>>>
>>>> We got rid of those a while ago.
>>>>
>>>>
>>>> On 12/02/15 17:37, nethub@gmail.com <mailto:nethub@gmail.com> wrote:
>>>>> We are having a strange issue on our MLX running code 5.6.00c. We are encountering some throughput issues that seem to be randomly impacting specific networks.
>>>>>
>>>>> We use the MLX to handle both external BGP and internal VLAN routing. Each FLS648 is used for Layer 2 VLANs only.
>>>>>
>>>>> From a server connected by 1 Gbps uplink to a Foundry FLS648 switch, which is then connected to the MLX on a 10 Gbps port, running a speed test to an external network is getting 20MB/s.
>>>>>
>>>>> Connecting the same server directly to the MLX is getting 70MB/s.
>>>>>
>>>>> Connecting the same server to one of my customer's Juniper EX3200 (which BGP peers with the MLX) also gets 70MB/s.
>>>>>
>>>>> Testing to another external network, all three scenarios get 110MB/s.
>>>>>
>>>>> The path to both test network locations goes through the same IP transit provider.
>>>>>
>>>>> We are running NI-MLX-MR with 2GB RAM, NI-MLX-10Gx4 connect to the Foundry FLS648 by XFP-10G-LR, NI-MLX-1Gx20-GC was used for directly connecting the server. A separate NI-MLX-10Gx4 connects to our upstream BGP providers. Customerâ€™s Juniper EX3200 connects to the same NI-MLX-10Gx4 as the FLS648. We take default routes plus full tables from three providers by BGP, but filter out most of the routes.
>>>>>
>>>>> The fiber and optics on everything look fine. CPU usage is less than 10% on the MLX and all line cards and CPU usage at 1% on the FLS648. ARP table on the MLX is about 12K, and BGP table is about 308K routes.
>>>>>
>>>>> Any assistance would be appreciated. I suspect there is a setting that weâ€™re missing on the MLX that is causing this issue.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> foundry-nsp mailing list
>>>>> foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net>
>>>>> http://puck.nether.net/mailman/listinfo/foundry-nsp <http://puck.nether.net/mailman/listinfo/foundry-nsp>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Jeroen Wunnink
>>>> IP NOC Manager - Hibernia Networks
>>>> Main numbers (Ext: 1011): USA +1.908.516.4200 | UK +44.1704.322.300
>>>> Netherlands +31.208.200.622 | 24/7 IP NOC Phone: +31.20.82.00.623
>>>> jeroen.wunnink@hibernianetworks.com <mailto:jeroen.wunnink@hibernianetworks.com>
>>>> www.hibernianetworks.com <http://www.hibernianetworks.com/>
>>>
>>>
>>> --
>>>
>>> Jeroen Wunnink
>>> IP NOC Manager - Hibernia Networks
>>> Main numbers (Ext: 1011): USA +1.908.516.4200 | UK +44.1704.322.300
>>> Netherlands +31.208.200.622 | 24/7 IP NOC Phone: +31.20.82.00.623
>>> jeroen.wunnink@hibernianetworks.com <mailto:jeroen.wunnink@hibernianetworks.com>
>>> www.hibernianetworks.com <http://www.hibernianetworks.com/>_______________________________________________
>>> foundry-nsp mailing list
>>> foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net>
>>> http://puck.nether.net/mailman/listinfo/foundry-nsp <http://puck.nether.net/mailman/listinfo/foundry-nsp>
>>
>
> . . . . . . . . . . . . . . . . . . . . . . . . .
> Jethro R Binks, Network Manager,
> Information Services Directorate, University Of Strathclyde, Glasgow, UK
>
> The University of Strathclyde is a charitable body, registered in
> Scotland, number SC015263.

_______________________________________________
foundry-nsp mailing list
foundry-nsp@puck.nether.net
http://puck.nether.net/mailman/listinfo/foundry-nsp

Re: MLX throughput issues [ In reply to ]

Feb 16, 2015, 8:08 PM

Post #16 of 24 (7735 views)

Why kind of wigout? And how do you diagnose the corruption? I'm intrigued.

On Mon, Feb 16, 2015 at 8:43 AM, Brad Fleming <bdflemin@gmail.com> wrote:

> Weâ€™ve seen it since installing the high-capacity switch fabrics into our
> XMR4000 chassis roughly 4 years ago. We saw it through IronWare 5.4.00d.
> Iâ€™m not sure what software we were using when they were first installed;
> probably whatever would have been stable/popular around December 2010.
>
> Command is simply â€œpower-off snm [1-3]â€ then â€œpower-on snm [1-3]â€.
>
> Note that the power-on process causes your management session to hang for
> a few seconds. The device isnâ€™t broken and packets arenâ€™t getting dropped;
> itâ€™s just going through checks and echoing back status.
>
> -brad
>
>
> > On Feb 16, 2015, at 7:07 AM, Jethro R Binks <jethro.binks@strath.ac.uk>
> wrote:
> >
> > On Fri, 13 Feb 2015, Brad Fleming wrote:
> >
> >> Over the years weâ€™ve seen odd issues where one of the
> >> switch-fabric-links will â€œwigoutâ€ and some of the data moving between
> >> cards will get corrupted. When this happens we power cycle each switch
> >> fab one at a time using this process:
> >>
> >> 1) Shutdown SFM #3
> >> 2) Wait 1 minute
> >> 3) Power SFM #3 on again
> >> 4) Verify all SFM links are up to SFM#3
> >> 5) Wait 1 minute
> >> 6) Perform steps 1-5 for SFM #2
> >> 7) Perform steps 1-5 for SFM #3
> >>
> >> Not sure youâ€™re seeing the same issue that we see but the â€œSFM Danceâ€
> >> (as we call it) is a once-every-four-months thing somewhere across our
> >> 16 XMR4000 boxes. It can be done with little to no impact if you are
> >> patient verify status before moving to the next SFM.
> >
> > That's all interesting. What code versions is this? Also, how do you
> > shutdown the SFMs? I don't recall seeing documentation for that.
> >
> > Jethro.
> >
> >
> >>
> >>> On Feb 13, 2015, at 11:41 AM, nethub@gmail.com wrote:
> >>>
> >>> We have three switch fabrics installed, all are under 1% utilized.
> >>>
> >>>
> >>> From: Jeroen Wunnink | Hibernia Networks [mailto:
> jeroen.wunnink@atrato.com <mailto:jeroen.wunnink@atrato.com>]
> >>> Sent: Friday, February 13, 2015 12:27 PM
> >>> To: nethub@gmail.com <mailto:nethub@gmail.com>; 'Jeroen Wunnink |
> Hibernia Networks'
> >>> Subject: Re: [f-nsp] MLX throughput issues
> >>>
> >>> How many switchfabrics do you have in that MLX and how high is the
> utilization on them
> >>>
> >>> On 13/02/15 18:12, nethub@gmail.com <mailto:nethub@gmail.com> wrote:
> >>>> We also tested with a spare Quanta LB4M we have and are seeing about
> the same speeds as we are seeing with the FLS648 (around 20MB/s or 160Mbps).
> >>>>
> >>>> I also reduced the number of routes we are accepting down to about
> 189K and that did not make a difference.
> >>>>
> >>>>
> >>>> From: foundry-nsp [mailto:foundry-nsp-bounces@puck.nether.net
> <mailto:foundry-nsp-bounces@puck.nether.net>] On Behalf Of Jeroen Wunnink
> | Hibernia Networks
> >>>> Sent: Friday, February 13, 2015 3:35 AM
> >>>> To: foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net>
> >>>> Subject: Re: [f-nsp] MLX throughput issues
> >>>>
> >>>> The FLS switches do something weird with packets. I've noticed they
> somehow interfere with changing the MSS window size dynamically, resulting
> in destinations further away having very poor speed results compared to
> destinations close by.
> >>>>
> >>>> We got rid of those a while ago.
> >>>>
> >>>>
> >>>> On 12/02/15 17:37, nethub@gmail.com <mailto:nethub@gmail.com> wrote:
> >>>>> We are having a strange issue on our MLX running code 5.6.00c. We
> are encountering some throughput issues that seem to be randomly impacting
> specific networks.
> >>>>>
> >>>>> We use the MLX to handle both external BGP and internal VLAN
> routing. Each FLS648 is used for Layer 2 VLANs only.
> >>>>>
> >>>>> From a server connected by 1 Gbps uplink to a Foundry FLS648 switch,
> which is then connected to the MLX on a 10 Gbps port, running a speed test
> to an external network is getting 20MB/s.
> >>>>>
> >>>>> Connecting the same server directly to the MLX is getting 70MB/s.
> >>>>>
> >>>>> Connecting the same server to one of my customer's Juniper EX3200
> (which BGP peers with the MLX) also gets 70MB/s.
> >>>>>
> >>>>> Testing to another external network, all three scenarios get 110MB/s.
> >>>>>
> >>>>> The path to both test network locations goes through the same IP
> transit provider.
> >>>>>
> >>>>> We are running NI-MLX-MR with 2GB RAM, NI-MLX-10Gx4 connect to the
> Foundry FLS648 by XFP-10G-LR, NI-MLX-1Gx20-GC was used for directly
> connecting the server. A separate NI-MLX-10Gx4 connects to our upstream
> BGP providers. Customerâ€™s Juniper EX3200 connects to the same NI-MLX-10Gx4
> as the FLS648. We take default routes plus full tables from three
> providers by BGP, but filter out most of the routes.
> >>>>>
> >>>>> The fiber and optics on everything look fine. CPU usage is less
> than 10% on the MLX and all line cards and CPU usage at 1% on the FLS648.
> ARP table on the MLX is about 12K, and BGP table is about 308K routes.
> >>>>>
> >>>>> Any assistance would be appreciated. I suspect there is a setting
> that weâ€™re missing on the MLX that is causing this issue.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> foundry-nsp mailing list
> >>>>> foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net>
> >>>>> http://puck.nether.net/mailman/listinfo/foundry-nsp <
> http://puck.nether.net/mailman/listinfo/foundry-nsp>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>>
> >>>> Jeroen Wunnink
> >>>> IP NOC Manager - Hibernia Networks
> >>>> Main numbers (Ext: 1011): USA +1.908.516.4200 | UK +44.1704.322.300
> >>>> Netherlands +31.208.200.622 | 24/7 IP NOC Phone: +31.20.82.00.623
> >>>> jeroen.wunnink@hibernianetworks.com <mailto:
> jeroen.wunnink@hibernianetworks.com>
> >>>> www.hibernianetworks.com <http://www.hibernianetworks.com/>
> >>>
> >>>
> >>> --
> >>>
> >>> Jeroen Wunnink
> >>> IP NOC Manager - Hibernia Networks
> >>> Main numbers (Ext: 1011): USA +1.908.516.4200 | UK +44.1704.322.300
> >>> Netherlands +31.208.200.622 | 24/7 IP NOC Phone: +31.20.82.00.623
> >>> jeroen.wunnink@hibernianetworks.com <mailto:
> jeroen.wunnink@hibernianetworks.com>
> >>> www.hibernianetworks.com <http://www.hibernianetworks.com/
> >_______________________________________________
> >>> foundry-nsp mailing list
> >>> foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net>
> >>> http://puck.nether.net/mailman/listinfo/foundry-nsp <
> http://puck.nether.net/mailman/listinfo/foundry-nsp>
> >>
> >
> > . . . . . . . . . . . . . . . . . . . . . . . . .
> > Jethro R Binks, Network Manager,
> > Information Services Directorate, University Of Strathclyde, Glasgow, UK
> >
> > The University of Strathclyde is a charitable body, registered in
> > Scotland, number SC015263.
>
>
> _______________________________________________
> foundry-nsp mailing list
> foundry-nsp@puck.nether.net
> http://puck.nether.net/mailman/listinfo/foundry-nsp

Re: MLX throughput issues [ In reply to ]

jethro.binks at strath

Feb 17, 2015, 12:58 AM

Post #17 of 24 (7733 views)

On Mon, 16 Feb 2015, Brad Fleming wrote:

> Weâ€™ve seen it since installing the high-capacity switch fabrics into our
> XMR4000 chassis roughly 4 years ago. We saw it through IronWare 5.4.00d.
> Iâ€™m not sure what software we were using when they were first installed;
> probably whatever would have been stable/popular around December 2010.
>
> Command is simply â€œpower-off snm [1-3]â€ then â€œpower-on snm [1-3]â€.

Ah I see it ... I was looking for "SFM"s not "SNM"s!

I also echo the poster's questions about how you notice the corruption.
I have a suspicion I may be seeing similar things; particularly so with
UDP-based transactions like NTP and RADIUS which could pass through such a
chassis. But I also suffer on CPU spikes with mcast traffic on that
chassis too which has always been an issue for me.

Thanks.

Jethro.

>
> Note that the power-on process causes your management session to hang
> for a few seconds. The device isnâ€™t broken and packets arenâ€™t getting
> dropped; itâ€™s just going through checks and echoing back status.
>
> -brad
>
>
> > On Feb 16, 2015, at 7:07 AM, Jethro R Binks <jethro.binks@strath.ac.uk> wrote:
> >
> > On Fri, 13 Feb 2015, Brad Fleming wrote:
> >
> >> Over the years weâ€™ve seen odd issues where one of the
> >> switch-fabric-links will â€œwigoutâ€ and some of the data moving between
> >> cards will get corrupted. When this happens we power cycle each switch
> >> fab one at a time using this process:
> >>
> >> 1) Shutdown SFM #3
> >> 2) Wait 1 minute
> >> 3) Power SFM #3 on again
> >> 4) Verify all SFM links are up to SFM#3
> >> 5) Wait 1 minute
> >> 6) Perform steps 1-5 for SFM #2
> >> 7) Perform steps 1-5 for SFM #3
> >>
> >> Not sure youâ€™re seeing the same issue that we see but the â€œSFM Danceâ€
> >> (as we call it) is a once-every-four-months thing somewhere across our
> >> 16 XMR4000 boxes. It can be done with little to no impact if you are
> >> patient verify status before moving to the next SFM.
> >
> > That's all interesting. What code versions is this? Also, how do you
> > shutdown the SFMs? I don't recall seeing documentation for that.
> >
> > Jethro.
> >
> >
> >>
> >>> On Feb 13, 2015, at 11:41 AM, nethub@gmail.com wrote:
> >>>
> >>> We have three switch fabrics installed, all are under 1% utilized.
> >>>
> >>>
> >>> From: Jeroen Wunnink | Hibernia Networks [mailto:jeroen.wunnink@atrato.com <mailto:jeroen.wunnink@atrato.com>]
> >>> Sent: Friday, February 13, 2015 12:27 PM
> >>> To: nethub@gmail.com <mailto:nethub@gmail.com>; 'Jeroen Wunnink | Hibernia Networks'
> >>> Subject: Re: [f-nsp] MLX throughput issues
> >>>
> >>> How many switchfabrics do you have in that MLX and how high is the utilization on them
> >>>
> >>> On 13/02/15 18:12, nethub@gmail.com <mailto:nethub@gmail.com> wrote:
> >>>> We also tested with a spare Quanta LB4M we have and are seeing about the same speeds as we are seeing with the FLS648 (around 20MB/s or 160Mbps).
> >>>>
> >>>> I also reduced the number of routes we are accepting down to about 189K and that did not make a difference.
> >>>>
> >>>>
> >>>> From: foundry-nsp [mailto:foundry-nsp-bounces@puck.nether.net <mailto:foundry-nsp-bounces@puck.nether.net>] On Behalf Of Jeroen Wunnink | Hibernia Networks
> >>>> Sent: Friday, February 13, 2015 3:35 AM
> >>>> To: foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net>
> >>>> Subject: Re: [f-nsp] MLX throughput issues
> >>>>
> >>>> The FLS switches do something weird with packets. I've noticed they somehow interfere with changing the MSS window size dynamically, resulting in destinations further away having very poor speed results compared to destinations close by.
> >>>>
> >>>> We got rid of those a while ago.
> >>>>
> >>>>
> >>>> On 12/02/15 17:37, nethub@gmail.com <mailto:nethub@gmail.com> wrote:
> >>>>> We are having a strange issue on our MLX running code 5.6.00c. We are encountering some throughput issues that seem to be randomly impacting specific networks.
> >>>>>
> >>>>> We use the MLX to handle both external BGP and internal VLAN routing. Each FLS648 is used for Layer 2 VLANs only.
> >>>>>
> >>>>> From a server connected by 1 Gbps uplink to a Foundry FLS648 switch, which is then connected to the MLX on a 10 Gbps port, running a speed test to an external network is getting 20MB/s.
> >>>>>
> >>>>> Connecting the same server directly to the MLX is getting 70MB/s.
> >>>>>
> >>>>> Connecting the same server to one of my customer's Juniper EX3200 (which BGP peers with the MLX) also gets 70MB/s.
> >>>>>
> >>>>> Testing to another external network, all three scenarios get 110MB/s.
> >>>>>
> >>>>> The path to both test network locations goes through the same IP transit provider.
> >>>>>
> >>>>> We are running NI-MLX-MR with 2GB RAM, NI-MLX-10Gx4 connect to the Foundry FLS648 by XFP-10G-LR, NI-MLX-1Gx20-GC was used for directly connecting the server. A separate NI-MLX-10Gx4 connects to our upstream BGP providers. Customerâ€™s Juniper EX3200 connects to the same NI-MLX-10Gx4 as the FLS648. We take default routes plus full tables from three providers by BGP, but filter out most of the routes.
> >>>>>
> >>>>> The fiber and optics on everything look fine. CPU usage is less than 10% on the MLX and all line cards and CPU usage at 1% on the FLS648. ARP table on the MLX is about 12K, and BGP table is about 308K routes.
> >>>>>
> >>>>> Any assistance would be appreciated. I suspect there is a setting that weâ€™re missing on the MLX that is causing this issue.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> foundry-nsp mailing list
> >>>>> foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net>
> >>>>> http://puck.nether.net/mailman/listinfo/foundry-nsp <http://puck.nether.net/mailman/listinfo/foundry-nsp>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>>
> >>>> Jeroen Wunnink
> >>>> IP NOC Manager - Hibernia Networks
> >>>> Main numbers (Ext: 1011): USA +1.908.516.4200 | UK +44.1704.322.300
> >>>> Netherlands +31.208.200.622 | 24/7 IP NOC Phone: +31.20.82.00.623
> >>>> jeroen.wunnink@hibernianetworks.com <mailto:jeroen.wunnink@hibernianetworks.com>
> >>>> www.hibernianetworks.com <http://www.hibernianetworks.com/>
> >>>
> >>>
> >>> --
> >>>
> >>> Jeroen Wunnink
> >>> IP NOC Manager - Hibernia Networks
> >>> Main numbers (Ext: 1011): USA +1.908.516.4200 | UK +44.1704.322.300
> >>> Netherlands +31.208.200.622 | 24/7 IP NOC Phone: +31.20.82.00.623
> >>> jeroen.wunnink@hibernianetworks.com <mailto:jeroen.wunnink@hibernianetworks.com>
> >>> www.hibernianetworks.com <http://www.hibernianetworks.com/>_______________________________________________
> >>> foundry-nsp mailing list
> >>> foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net>
> >>> http://puck.nether.net/mailman/listinfo/foundry-nsp <http://puck.nether.net/mailman/listinfo/foundry-nsp>
> >>
> >
> > . . . . . . . . . . . . . . . . . . . . . . . . .
> > Jethro R Binks, Network Manager,
> > Information Services Directorate, University Of Strathclyde, Glasgow, UK
> >
> > The University of Strathclyde is a charitable body, registered in
> > Scotland, number SC015263.
>
>

. . . . . . . . . . . . . . . . . . . . . . . . .
Jethro R Binks, Network Manager,
Information Services Directorate, University Of Strathclyde, Glasgow, UK

The University of Strathclyde is a charitable body, registered in
Scotland, number SC015263.

Re: MLX throughput issues [ In reply to ]

bdflemin at gmail

Feb 17, 2015, 1:10 PM

Post #18 of 24 (7715 views)

The common symptoms for us are alarms of TM errors / resets. Weâ€™ve been told on multiple TAC cases that logs indicating transmit TM errors are likely caused by problems in one of the SFM links / lanes. Weâ€™ve been told that resetting the SFMs one at a time will clear the issue.

Symptoms during the issue is that 1/3rd of the traffic moving from one TM to another TM will simply get dropped. So we see TCP globally start to throttle like crazy and if enough errors count up the TM will simply reset. After the TM reset is seems a 50/50 chance the box will remain stable or go back to dropping packets within ~20mins. So when we see a TM reset we simply do the SFM Dance no matter what.

> On Feb 16, 2015, at 10:08 PM, Josh Galvez <josh@zevlag.com> wrote:
>
> Why kind of wigout? And how do you diagnose the corruption? I'm intrigued.
>
> On Mon, Feb 16, 2015 at 8:43 AM, Brad Fleming <bdflemin@gmail.com <mailto:bdflemin@gmail.com>> wrote:
> Weâ€™ve seen it since installing the high-capacity switch fabrics into our XMR4000 chassis roughly 4 years ago. We saw it through IronWare 5.4.00d. Iâ€™m not sure what software we were using when they were first installed; probably whatever would have been stable/popular around December 2010.
>
> Command is simply â€œpower-off snm [1-3]â€ then â€œpower-on snm [1-3]â€.
>
> Note that the power-on process causes your management session to hang for a few seconds. The device isnâ€™t broken and packets arenâ€™t getting dropped; itâ€™s just going through checks and echoing back status.
>
> -brad
>
>
> > On Feb 16, 2015, at 7:07 AM, Jethro R Binks <jethro.binks@strath.ac.uk <mailto:jethro.binks@strath.ac.uk>> wrote:
> >
> > On Fri, 13 Feb 2015, Brad Fleming wrote:
> >
> >> Over the years weâ€™ve seen odd issues where one of the
> >> switch-fabric-links will â€œwigoutâ€ and some of the data moving between
> >> cards will get corrupted. When this happens we power cycle each switch
> >> fab one at a time using this process:
> >>
> >> 1) Shutdown SFM #3
> >> 2) Wait 1 minute
> >> 3) Power SFM #3 on again
> >> 4) Verify all SFM links are up to SFM#3
> >> 5) Wait 1 minute
> >> 6) Perform steps 1-5 for SFM #2
> >> 7) Perform steps 1-5 for SFM #3
> >>
> >> Not sure youâ€™re seeing the same issue that we see but the â€œSFM Danceâ€
> >> (as we call it) is a once-every-four-months thing somewhere across our
> >> 16 XMR4000 boxes. It can be done with little to no impact if you are
> >> patient verify status before moving to the next SFM.
> >
> > That's all interesting. What code versions is this? Also, how do you
> > shutdown the SFMs? I don't recall seeing documentation for that.
> >
> > Jethro.
> >
> >
> >>
> >>> On Feb 13, 2015, at 11:41 AM, nethub@gmail.com <mailto:nethub@gmail.com> wrote:
> >>>
> >>> We have three switch fabrics installed, all are under 1% utilized.
> >>>
> >>>
> >>> From: Jeroen Wunnink | Hibernia Networks [mailto:jeroen.wunnink@atrato.com <mailto:jeroen.wunnink@atrato.com> <mailto:jeroen.wunnink@atrato.com <mailto:jeroen.wunnink@atrato.com>>]
> >>> Sent: Friday, February 13, 2015 12:27 PM
> >>> To: nethub@gmail.com <mailto:nethub@gmail.com> <mailto:nethub@gmail.com <mailto:nethub@gmail.com>>; 'Jeroen Wunnink | Hibernia Networks'
> >>> Subject: Re: [f-nsp] MLX throughput issues
> >>>
> >>> How many switchfabrics do you have in that MLX and how high is the utilization on them
> >>>
> >>> On 13/02/15 18:12, nethub@gmail.com <mailto:nethub@gmail.com> <mailto:nethub@gmail.com <mailto:nethub@gmail.com>> wrote:
> >>>> We also tested with a spare Quanta LB4M we have and are seeing about the same speeds as we are seeing with the FLS648 (around 20MB/s or 160Mbps).
> >>>>
> >>>> I also reduced the number of routes we are accepting down to about 189K and that did not make a difference.
> >>>>
> >>>>
> >>>> From: foundry-nsp [mailto:foundry-nsp-bounces@puck.nether.net <mailto:foundry-nsp-bounces@puck.nether.net> <mailto:foundry-nsp-bounces@puck.nether.net <mailto:foundry-nsp-bounces@puck.nether.net>>] On Behalf Of Jeroen Wunnink | Hibernia Networks
> >>>> Sent: Friday, February 13, 2015 3:35 AM
> >>>> To: foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net> <mailto:foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net>>
> >>>> Subject: Re: [f-nsp] MLX throughput issues
> >>>>
> >>>> The FLS switches do something weird with packets. I've noticed they somehow interfere with changing the MSS window size dynamically, resulting in destinations further away having very poor speed results compared to destinations close by.
> >>>>
> >>>> We got rid of those a while ago.
> >>>>
> >>>>
> >>>> On 12/02/15 17:37, nethub@gmail.com <mailto:nethub@gmail.com> <mailto:nethub@gmail.com <mailto:nethub@gmail.com>> wrote:
> >>>>> We are having a strange issue on our MLX running code 5.6.00c. We are encountering some throughput issues that seem to be randomly impacting specific networks.
> >>>>>
> >>>>> We use the MLX to handle both external BGP and internal VLAN routing. Each FLS648 is used for Layer 2 VLANs only.
> >>>>>
> >>>>> From a server connected by 1 Gbps uplink to a Foundry FLS648 switch, which is then connected to the MLX on a 10 Gbps port, running a speed test to an external network is getting 20MB/s.
> >>>>>
> >>>>> Connecting the same server directly to the MLX is getting 70MB/s.
> >>>>>
> >>>>> Connecting the same server to one of my customer's Juniper EX3200 (which BGP peers with the MLX) also gets 70MB/s.
> >>>>>
> >>>>> Testing to another external network, all three scenarios get 110MB/s.
> >>>>>
> >>>>> The path to both test network locations goes through the same IP transit provider.
> >>>>>
> >>>>> We are running NI-MLX-MR with 2GB RAM, NI-MLX-10Gx4 connect to the Foundry FLS648 by XFP-10G-LR, NI-MLX-1Gx20-GC was used for directly connecting the server. A separate NI-MLX-10Gx4 connects to our upstream BGP providers. Customerâ€™s Juniper EX3200 connects to the same NI-MLX-10Gx4 as the FLS648. We take default routes plus full tables from three providers by BGP, but filter out most of the routes.
> >>>>>
> >>>>> The fiber and optics on everything look fine. CPU usage is less than 10% on the MLX and all line cards and CPU usage at 1% on the FLS648. ARP table on the MLX is about 12K, and BGP table is about 308K routes.
> >>>>>
> >>>>> Any assistance would be appreciated. I suspect there is a setting that weâ€™re missing on the MLX that is causing this issue.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> foundry-nsp mailing list
> >>>>> foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net> <mailto:foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net>>
> >>>>> http://puck.nether.net/mailman/listinfo/foundry-nsp <http://puck.nether.net/mailman/listinfo/foundry-nsp> <http://puck.nether.net/mailman/listinfo/foundry-nsp <http://puck.nether.net/mailman/listinfo/foundry-nsp>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>>
> >>>> Jeroen Wunnink
> >>>> IP NOC Manager - Hibernia Networks
> >>>> Main numbers (Ext: 1011): USA +1.908.516.4200 <tel:%2B1.908.516.4200> | UK +44.1704.322.300 <tel:%2B44.1704.322.300>
> >>>> Netherlands +31.208.200.622 <tel:%2B31.208.200.622> | 24/7 IP NOC Phone: +31.20.82.00.623 <tel:%2B31.20.82.00.623>
> >>>> jeroen.wunnink@hibernianetworks.com <mailto:jeroen.wunnink@hibernianetworks.com> <mailto:jeroen.wunnink@hibernianetworks.com <mailto:jeroen.wunnink@hibernianetworks.com>>
> >>>> www.hibernianetworks.com <http://www.hibernianetworks.com/> <http://www.hibernianetworks.com/ <http://www.hibernianetworks.com/>>
> >>>
> >>>
> >>> --
> >>>
> >>> Jeroen Wunnink
> >>> IP NOC Manager - Hibernia Networks
> >>> Main numbers (Ext: 1011): USA +1.908.516.4200 <tel:%2B1.908.516.4200> | UK +44.1704.322.300 <tel:%2B44.1704.322.300>
> >>> Netherlands +31.208.200.622 <tel:%2B31.208.200.622> | 24/7 IP NOC Phone: +31.20.82.00.623 <tel:%2B31.20.82.00.623>
> >>> jeroen.wunnink@hibernianetworks.com <mailto:jeroen.wunnink@hibernianetworks.com> <mailto:jeroen.wunnink@hibernianetworks.com <mailto:jeroen.wunnink@hibernianetworks.com>>
> >>> www.hibernianetworks.com <http://www.hibernianetworks.com/> <http://www.hibernianetworks.com/ <http://www.hibernianetworks.com/>>_______________________________________________
> >>> foundry-nsp mailing list
> >>> foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net> <mailto:foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net>>
> >>> http://puck.nether.net/mailman/listinfo/foundry-nsp <http://puck.nether.net/mailman/listinfo/foundry-nsp> <http://puck.nether.net/mailman/listinfo/foundry-nsp <http://puck.nether.net/mailman/listinfo/foundry-nsp>>
> >>
> >
> > . . . . . . . . . . . . . . . . . . . . . . . . .
> > Jethro R Binks, Network Manager,
> > Information Services Directorate, University Of Strathclyde, Glasgow, UK
> >
> > The University of Strathclyde is a charitable body, registered in
> > Scotland, number SC015263.
>
>
> _______________________________________________
> foundry-nsp mailing list
> foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net>
> http://puck.nether.net/mailman/listinfo/foundry-nsp <http://puck.nether.net/mailman/listinfo/foundry-nsp>

Re: MLX throughput issues [ In reply to ]

frnkblk at iname

Feb 17, 2015, 10:45 PM

Post #19 of 24 (7698 views)

So donâ€™t errors like this suggest replacing the hardware?

Frank

From: foundry-nsp [mailto:foundry-nsp-bounces@puck.nether.net] On Behalf Of Brad Fleming
Sent: Tuesday, February 17, 2015 3:10 PM
To: Josh Galvez
Cc: foundry-nsp@puck.nether.net
Subject: Re: [f-nsp] MLX throughput issues

The common symptoms for us are alarms of TM errors / resets. Weâ€™ve been told on multiple TAC cases that logs indicating transmit TM errors are likely caused by problems in one of the SFM links / lanes. Weâ€™ve been told that resetting the SFMs one at a time will clear the issue.

Symptoms during the issue is that 1/3rd of the traffic moving from one TM to another TM will simply get dropped. So we see TCP globally start to throttle like crazy and if enough errors count up the TM will simply reset. After the TM reset is seems a 50/50 chance the box will remain stable or go back to dropping packets within ~20mins. So when we see a TM reset we simply do the SFM Dance no matter what.

On Feb 16, 2015, at 10:08 PM, Josh Galvez <josh@zevlag.com <mailto:josh@zevlag.com> > wrote:

Why kind of wigout? And how do you diagnose the corruption? I'm intrigued.

On Mon, Feb 16, 2015 at 8:43 AM, Brad Fleming <bdflemin@gmail.com <mailto:bdflemin@gmail.com> > wrote:

Weâ€™ve seen it since installing the high-capacity switch fabrics into our XMR4000 chassis roughly 4 years ago. We saw it through IronWare 5.4.00d. Iâ€™m not sure what software we were using when they were first installed; probably whatever would have been stable/popular around December 2010.

Command is simply â€œpower-off snm [1-3]â€ then â€œpower-on snm [1-3]â€.

Note that the power-on process causes your management session to hang for a few seconds. The device isnâ€™t broken and packets arenâ€™t getting dropped; itâ€™s just going through checks and echoing back status.

-brad

> On Feb 16, 2015, at 7:07 AM, Jethro R Binks <jethro.binks@strath.ac.uk <mailto:jethro.binks@strath.ac.uk> > wrote:
>
> On Fri, 13 Feb 2015, Brad Fleming wrote:
>
>> Over the years weâ€™ve seen odd issues where one of the
>> switch-fabric-links will â€œwigoutâ€ and some of the data moving between
>> cards will get corrupted. When this happens we power cycle each switch
>> fab one at a time using this process:
>>
>> 1) Shutdown SFM #3
>> 2) Wait 1 minute
>> 3) Power SFM #3 on again
>> 4) Verify all SFM links are up to SFM#3
>> 5) Wait 1 minute
>> 6) Perform steps 1-5 for SFM #2
>> 7) Perform steps 1-5 for SFM #3
>>
>> Not sure youâ€™re seeing the same issue that we see but the â€œSFM Danceâ€
>> (as we call it) is a once-every-four-months thing somewhere across our
>> 16 XMR4000 boxes. It can be done with little to no impact if you are
>> patient verify status before moving to the next SFM.
>
> That's all interesting. What code versions is this? Also, how do you
> shutdown the SFMs? I don't recall seeing documentation for that.
>
> Jethro.
>
>
>>
>>> On Feb 13, 2015, at 11:41 AM, nethub@gmail.com <mailto:nethub@gmail.com> wrote:
>>>
>>> We have three switch fabrics installed, all are under 1% utilized.
>>>
>>>
>>> From: Jeroen Wunnink | Hibernia Networks [mailto:jeroen.wunnink@atrato.com <mailto:jeroen.wunnink@atrato.com> <mailto:jeroen.wunnink@atrato.com <mailto:jeroen.wunnink@atrato.com> >]
>>> Sent: Friday, February 13, 2015 12:27 PM
>>> To: nethub@gmail.com <mailto:nethub@gmail.com> <mailto:nethub@gmail.com <mailto:nethub@gmail.com> >; 'Jeroen Wunnink | Hibernia Networks'
>>> Subject: Re: [f-nsp] MLX throughput issues
>>>
>>> How many switchfabrics do you have in that MLX and how high is the utilization on them
>>>
>>> On 13/02/15 18:12, nethub@gmail.com <mailto:nethub@gmail.com> <mailto:nethub@gmail.com <mailto:nethub@gmail.com> > wrote:
>>>> We also tested with a spare Quanta LB4M we have and are seeing about the same speeds as we are seeing with the FLS648 (around 20MB/s or 160Mbps).
>>>>
>>>> I also reduced the number of routes we are accepting down to about 189K and that did not make a difference.
>>>>
>>>>
>>>> From: foundry-nsp [mailto:foundry-nsp-bounces@puck.nether.net <mailto:foundry-nsp-bounces@puck.nether.net> <mailto:foundry-nsp-bounces@puck.nether.net <mailto:foundry-nsp-bounces@puck.nether.net> >] On Behalf Of Jeroen Wunnink | Hibernia Networks
>>>> Sent: Friday, February 13, 2015 3:35 AM
>>>> To: foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net> <mailto:foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net> >
>>>> Subject: Re: [f-nsp] MLX throughput issues
>>>>
>>>> The FLS switches do something weird with packets. I've noticed they somehow interfere with changing the MSS window size dynamically, resulting in destinations further away having very poor speed results compared to destinations close by.
>>>>
>>>> We got rid of those a while ago.
>>>>
>>>>
>>>> On 12/02/15 17:37, nethub@gmail.com <mailto:nethub@gmail.com> <mailto:nethub@gmail.com <mailto:nethub@gmail.com> > wrote:
>>>>> We are having a strange issue on our MLX running code 5.6.00c. We are encountering some throughput issues that seem to be randomly impacting specific networks.
>>>>>
>>>>> We use the MLX to handle both external BGP and internal VLAN routing. Each FLS648 is used for Layer 2 VLANs only.
>>>>>
>>>>> From a server connected by 1 Gbps uplink to a Foundry FLS648 switch, which is then connected to the MLX on a 10 Gbps port, running a speed test to an external network is getting 20MB/s.
>>>>>
>>>>> Connecting the same server directly to the MLX is getting 70MB/s.
>>>>>
>>>>> Connecting the same server to one of my customer's Juniper EX3200 (which BGP peers with the MLX) also gets 70MB/s.
>>>>>
>>>>> Testing to another external network, all three scenarios get 110MB/s.
>>>>>
>>>>> The path to both test network locations goes through the same IP transit provider.
>>>>>
>>>>> We are running NI-MLX-MR with 2GB RAM, NI-MLX-10Gx4 connect to the Foundry FLS648 by XFP-10G-LR, NI-MLX-1Gx20-GC was used for directly connecting the server. A separate NI-MLX-10Gx4 connects to our upstream BGP providers. Customerâ€™s Juniper EX3200 connects to the same NI-MLX-10Gx4 as the FLS648. We take default routes plus full tables from three providers by BGP, but filter out most of the routes.
>>>>>
>>>>> The fiber and optics on everything look fine. CPU usage is less than 10% on the MLX and all line cards and CPU usage at 1% on the FLS648. ARP table on the MLX is about 12K, and BGP table is about 308K routes.
>>>>>
>>>>> Any assistance would be appreciated. I suspect there is a setting that weâ€™re missing on the MLX that is causing this issue.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> foundry-nsp mailing list
>>>>> foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net> <mailto:foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net> >
>>>>> http://puck.nether.net/mailman/listinfo/foundry-nsp <http://puck.nether.net/mailman/listinfo/foundry-nsp>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Jeroen Wunnink
>>>> IP NOC Manager - Hibernia Networks
>>>> Main numbers (Ext: 1011): USA +1.908.516.4200 <tel:%2B1.908.516.4200> | UK +44.1704.322.300 <tel:%2B44.1704.322.300>
>>>> Netherlands +31.208.200.622 <tel:%2B31.208.200.622> | 24/7 IP NOC Phone: +31.20.82.00.623 <tel:%2B31.20.82.00.623>
>>>> jeroen.wunnink@hibernianetworks.com <mailto:jeroen.wunnink@hibernianetworks.com> <mailto:jeroen.wunnink@hibernianetworks.com <mailto:jeroen.wunnink@hibernianetworks.com> >
>>>> www.hibernianetworks.com <http://www.hibernianetworks.com/> <http://www.hibernianetworks.com/>
>>>
>>>
>>> --
>>>
>>> Jeroen Wunnink
>>> IP NOC Manager - Hibernia Networks
>>> Main numbers (Ext: 1011): USA +1.908.516.4200 <tel:%2B1.908.516.4200> | UK +44.1704.322.300 <tel:%2B44.1704.322.300>
>>> Netherlands +31.208.200.622 <tel:%2B31.208.200.622> | 24/7 IP NOC Phone: +31.20.82.00.623 <tel:%2B31.20.82.00.623>
>>> jeroen.wunnink@hibernianetworks.com <mailto:jeroen.wunnink@hibernianetworks.com> <mailto:jeroen.wunnink@hibernianetworks.com <mailto:jeroen.wunnink@hibernianetworks.com> >
>>> www.hibernianetworks.com <http://www.hibernianetworks.com/> <http://www.hibernianetworks.com/>_______________________________________________
>>> foundry-nsp mailing list
>>> foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net> <mailto:foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net> >
>>> http://puck.nether.net/mailman/listinfo/foundry-nsp <http://puck.nether.net/mailman/listinfo/foundry-nsp>
>>
>
> . . . . . . . . . . . . . . . . . . . . . . . . .
> Jethro R Binks, Network Manager,
> Information Services Directorate, University Of Strathclyde, Glasgow, UK
>
> The University of Strathclyde is a charitable body, registered in
> Scotland, number SC015263.

_______________________________________________
foundry-nsp mailing list
foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net>

http://puck.nether.net/mailman/listinfo/foundry-nsp

Re: MLX throughput issues [ In reply to ]

bdflemin at gmail

Feb 18, 2015, 7:50 AM

Post #20 of 24 (7705 views)

TAC replaced hSFMs and line cards the first couple times but weâ€™ve seen this issue at least once on every node in the network. The ones where we replaced every module (SFM, mgmt, port cards, even PSUs) have still had at least one event. So Iâ€™m not even sure what hardware weâ€™d replace at this point. That lead us to thinking a config problem since each box uses the same template but after a lengthy audit with TAC nobody could find anything. It happens infrequently enough that we grew to just live with it.

> On Feb 18, 2015, at 12:45 AM, Frank Bulk <frnkblk@iname.com> wrote:
>
> So donâ€™t errors like this suggest replacing the hardware?
>
> Frank
>
> From: foundry-nsp [mailto:foundry-nsp-bounces@puck.nether.net] On Behalf Of Brad Fleming
> Sent: Tuesday, February 17, 2015 3:10 PM
> To: Josh Galvez
> Cc: foundry-nsp@puck.nether.net
> Subject: Re: [f-nsp] MLX throughput issues
>
> The common symptoms for us are alarms of TM errors / resets. Weâ€™ve been told on multiple TAC cases that logs indicating transmit TM errors are likely caused by problems in one of the SFM links / lanes. Weâ€™ve been told that resetting the SFMs one at a time will clear the issue.
>
> Symptoms during the issue is that 1/3rd of the traffic moving from one TM to another TM will simply get dropped. So we see TCP globally start to throttle like crazy and if enough errors count up the TM will simply reset. After the TM reset is seems a 50/50 chance the box will remain stable or go back to dropping packets within ~20mins. So when we see a TM reset we simply do the SFM Dance no matter what.
>
>
>> On Feb 16, 2015, at 10:08 PM, Josh Galvez <josh@zevlag.com <mailto:josh@zevlag.com>> wrote:
>>
>> Why kind of wigout? And how do you diagnose the corruption? I'm intrigued.
>>
>> On Mon, Feb 16, 2015 at 8:43 AM, Brad Fleming <bdflemin@gmail.com <mailto:bdflemin@gmail.com>> wrote:
>>> Weâ€™ve seen it since installing the high-capacity switch fabrics into our XMR4000 chassis roughly 4 years ago. We saw it through IronWare 5.4.00d. Iâ€™m not sure what software we were using when they were first installed; probably whatever would have been stable/popular around December 2010.
>>>
>>> Command is simply â€œpower-off snm [1-3]â€ then â€œpower-on snm [1-3]â€.
>>>
>>> Note that the power-on process causes your management session to hang for a few seconds. The device isnâ€™t broken and packets arenâ€™t getting dropped; itâ€™s just going through checks and echoing back status.
>>>
>>> -brad
>>>
>>>
>>> > On Feb 16, 2015, at 7:07 AM, Jethro R Binks <jethro.binks@strath.ac.uk <mailto:jethro.binks@strath.ac.uk>> wrote:
>>> >
>>> > On Fri, 13 Feb 2015, Brad Fleming wrote:
>>> >
>>> >> Over the years weâ€™ve seen odd issues where one of the
>>> >> switch-fabric-links will â€œwigoutâ€ and some of the data moving between
>>> >> cards will get corrupted. When this happens we power cycle each switch
>>> >> fab one at a time using this process:
>>> >>
>>> >> 1) Shutdown SFM #3
>>> >> 2) Wait 1 minute
>>> >> 3) Power SFM #3 on again
>>> >> 4) Verify all SFM links are up to SFM#3
>>> >> 5) Wait 1 minute
>>> >> 6) Perform steps 1-5 for SFM #2
>>> >> 7) Perform steps 1-5 for SFM #3
>>> >>
>>> >> Not sure youâ€™re seeing the same issue that we see but the â€œSFM Danceâ€
>>> >> (as we call it) is a once-every-four-months thing somewhere across our
>>> >> 16 XMR4000 boxes. It can be done with little to no impact if you are
>>> >> patient verify status before moving to the next SFM.
>>> >
>>> > That's all interesting. What code versions is this? Also, how do you
>>> > shutdown the SFMs? I don't recall seeing documentation for that.
>>> >
>>> > Jethro.
>>> >
>>> >
>>> >>
>>> >>> On Feb 13, 2015, at 11:41 AM, nethub@gmail.com <mailto:nethub@gmail.com> wrote:
>>> >>>
>>> >>> We have three switch fabrics installed, all are under 1% utilized.
>>> >>>
>>> >>>
>>> >>> From: Jeroen Wunnink | Hibernia Networks [mailto:jeroen.wunnink@atrato.com <mailto:jeroen.wunnink@atrato.com> <mailto:jeroen.wunnink@atrato.com <mailto:jeroen.wunnink@atrato.com>>]
>>> >>> Sent: Friday, February 13, 2015 12:27 PM
>>> >>> To: nethub@gmail.com <mailto:nethub@gmail.com> <mailto:nethub@gmail.com <mailto:nethub@gmail.com>>; 'Jeroen Wunnink | Hibernia Networks'
>>> >>> Subject: Re: [f-nsp] MLX throughput issues
>>> >>>
>>> >>> How many switchfabrics do you have in that MLX and how high is the utilization on them
>>> >>>
>>> >>> On 13/02/15 18:12, nethub@gmail.com <mailto:nethub@gmail.com> <mailto:nethub@gmail.com <mailto:nethub@gmail.com>> wrote:
>>> >>>> We also tested with a spare Quanta LB4M we have and are seeing about the same speeds as we are seeing with the FLS648 (around 20MB/s or 160Mbps).
>>> >>>>
>>> >>>> I also reduced the number of routes we are accepting down to about 189K and that did not make a difference.
>>> >>>>
>>> >>>>
>>> >>>> From: foundry-nsp [mailto:foundry-nsp-bounces@puck.nether.net <mailto:foundry-nsp-bounces@puck.nether.net> <mailto:foundry-nsp-bounces@puck.nether.net <mailto:foundry-nsp-bounces@puck.nether.net>>] On Behalf Of Jeroen Wunnink | Hibernia Networks
>>> >>>> Sent: Friday, February 13, 2015 3:35 AM
>>> >>>> To: foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net> <mailto:foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net>>
>>> >>>> Subject: Re: [f-nsp] MLX throughput issues
>>> >>>>
>>> >>>> The FLS switches do something weird with packets. I've noticed they somehow interfere with changing the MSS window size dynamically, resulting in destinations further away having very poor speed results compared to destinations close by.
>>> >>>>
>>> >>>> We got rid of those a while ago.
>>> >>>>
>>> >>>>
>>> >>>> On 12/02/15 17:37, nethub@gmail.com <mailto:nethub@gmail.com> <mailto:nethub@gmail.com <mailto:nethub@gmail.com>> wrote:
>>> >>>>> We are having a strange issue on our MLX running code 5.6.00c. We are encountering some throughput issues that seem to be randomly impacting specific networks.
>>> >>>>>
>>> >>>>> We use the MLX to handle both external BGP and internal VLAN routing. Each FLS648 is used for Layer 2 VLANs only.
>>> >>>>>
>>> >>>>> From a server connected by 1 Gbps uplink to a Foundry FLS648 switch, which is then connected to the MLX on a 10 Gbps port, running a speed test to an external network is getting 20MB/s.
>>> >>>>>
>>> >>>>> Connecting the same server directly to the MLX is getting 70MB/s.
>>> >>>>>
>>> >>>>> Connecting the same server to one of my customer's Juniper EX3200 (which BGP peers with the MLX) also gets 70MB/s.
>>> >>>>>
>>> >>>>> Testing to another external network, all three scenarios get 110MB/s.
>>> >>>>>
>>> >>>>> The path to both test network locations goes through the same IP transit provider.
>>> >>>>>
>>> >>>>> We are running NI-MLX-MR with 2GB RAM, NI-MLX-10Gx4 connect to the Foundry FLS648 by XFP-10G-LR, NI-MLX-1Gx20-GC was used for directly connecting the server. A separate NI-MLX-10Gx4 connects to our upstream BGP providers. Customerâ€™s Juniper EX3200 connects to the same NI-MLX-10Gx4 as the FLS648. We take default routes plus full tables from three providers by BGP, but filter out most of the routes.
>>> >>>>>
>>> >>>>> The fiber and optics on everything look fine. CPU usage is less than 10% on the MLX and all line cards and CPU usage at 1% on the FLS648. ARP table on the MLX is about 12K, and BGP table is about 308K routes.
>>> >>>>>
>>> >>>>> Any assistance would be appreciated. I suspect there is a setting that weâ€™re missing on the MLX that is causing this issue.
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> _______________________________________________
>>> >>>>> foundry-nsp mailing list
>>> >>>>> foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net> <mailto:foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net>>
>>> >>>>> http://puck.nether.net/mailman/listinfo/foundry-nsp <http://puck.nether.net/mailman/listinfo/foundry-nsp> <http://puck.nether.net/mailman/listinfo/foundry-nsp <http://puck.nether.net/mailman/listinfo/foundry-nsp>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> --
>>> >>>>
>>> >>>> Jeroen Wunnink
>>> >>>> IP NOC Manager - Hibernia Networks
>>> >>>> Main numbers (Ext: 1011): USA +1.908.516.4200 <tel:%2B1.908.516.4200> | UK +44.1704.322.300 <tel:%2B44.1704.322.300>
>>> >>>> Netherlands +31.208.200.622 <tel:%2B31.208.200.622> | 24/7 IP NOC Phone: +31.20.82.00.623 <tel:%2B31.20.82.00.623>
>>> >>>> jeroen.wunnink@hibernianetworks.com <mailto:jeroen.wunnink@hibernianetworks.com> <mailto:jeroen.wunnink@hibernianetworks.com <mailto:jeroen.wunnink@hibernianetworks.com>>
>>> >>>> www.hibernianetworks.com <http://www.hibernianetworks.com/> <http://www.hibernianetworks.com/ <http://www.hibernianetworks.com/>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>>
>>> >>> Jeroen Wunnink
>>> >>> IP NOC Manager - Hibernia Networks
>>> >>> Main numbers (Ext: 1011): USA +1.908.516.4200 <tel:%2B1.908.516.4200> | UK +44.1704.322.300 <tel:%2B44.1704.322.300>
>>> >>> Netherlands +31.208.200.622 <tel:%2B31.208.200.622> | 24/7 IP NOC Phone: +31.20.82.00.623 <tel:%2B31.20.82.00.623>
>>> >>> jeroen.wunnink@hibernianetworks.com <mailto:jeroen.wunnink@hibernianetworks.com> <mailto:jeroen.wunnink@hibernianetworks.com <mailto:jeroen.wunnink@hibernianetworks.com>>
>>> >>> www.hibernianetworks.com <http://www.hibernianetworks.com/> <http://www.hibernianetworks.com/ <http://www.hibernianetworks.com/>>_______________________________________________
>>> >>> foundry-nsp mailing list
>>> >>> foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net> <mailto:foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net>>
>>> >>> http://puck.nether.net/mailman/listinfo/foundry-nsp <http://puck.nether.net/mailman/listinfo/foundry-nsp> <http://puck.nether.net/mailman/listinfo/foundry-nsp <http://puck.nether.net/mailman/listinfo/foundry-nsp>>
>>> >>
>>> >
>>> > . . . . . . . . . . . . . . . . . . . . . . . . .
>>> > Jethro R Binks, Network Manager,
>>> > Information Services Directorate, University Of Strathclyde, Glasgow, UK
>>> >
>>> > The University of Strathclyde is a charitable body, registered in
>>> > Scotland, number SC015263.
>>>
>>>
>>> _______________________________________________
>>> foundry-nsp mailing list
>>> foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net>
>>> http://puck.nether.net/mailman/listinfo/foundry-nsp <http://puck.nether.net/mailman/listinfo/foundry-nsp>

Re: MLX throughput issues [ In reply to ]

jeroen.wunnink at atrato-ip

Feb 18, 2015, 8:32 AM

Post #21 of 24 (7704 views)

Try physically pulling the SFM's one by one rather then just
powercycling them. Yes there is a difference there :-)

Also, 5400d is fairly buggy, there were some major issues with CRC
checksums in hSFM's and on 100G cards. We have fairly good results with
5500e

On 18/02/15 16:50, Brad Fleming wrote:
> TAC replaced hSFMs and line cards the first couple times but we’ve
> seen this issue at least once on every node in the network. The ones
> where we replaced every module (SFM, mgmt, port cards, even PSUs) have
> still had at least one event. So I’m not even sure what hardware we’d
> replace at this point. That lead us to thinking a config problem since
> each box uses the same template but after a lengthy audit with TAC
> nobody could find anything. It happens infrequently enough that we
> grew to just live with it.
>
>
>
>> On Feb 18, 2015, at 12:45 AM, Frank Bulk <frnkblk@iname.com
>> <mailto:frnkblk@iname.com>> wrote:
>>
>> So don’t errors like this suggest replacing the hardware?
>> Frank
>> *From:*foundry-nsp [mailto:foundry-nsp-bounces@puck.nether.net]*On
>> Behalf Of*Brad Fleming
>> *Sent:*Tuesday, February 17, 2015 3:10 PM
>> *To:*Josh Galvez
>> *Cc:*foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net>
>> *Subject:*Re: [f-nsp] MLX throughput issues
>> The common symptoms for us are alarms of TM errors / resets. We’ve
>> been told on multiple TAC cases that logs indicating transmit TM
>> errors are likely caused by problems in one of the SFM links / lanes.
>> We’ve been told that resetting the SFMs one at a time will clear the
>> issue.
>> Symptoms during the issue is that 1/3rd of the traffic moving from
>> one TM to another TM will simply get dropped. So we see TCP globally
>> start to throttle like crazy and if enough errors count up the TM
>> will simply reset. After the TM reset is seems a 50/50 chance the box
>> will remain stable or go back to dropping packets within ~20mins. So
>> when we see a TM reset we simply do the SFM Dance no matter what.
>>> On Feb 16, 2015, at 10:08 PM, Josh Galvez <josh@zevlag.com
>>> <mailto:josh@zevlag.com>> wrote:
>>> Why kind of wigout? And how do you diagnose the corruption? I'm
>>> intrigued.
>>> On Mon, Feb 16, 2015 at 8:43 AM, Brad Fleming <bdflemin@gmail.com
>>> <mailto:bdflemin@gmail.com>> wrote:
>>>> We’ve seen it since installing the high-capacity switch fabrics
>>>> into our XMR4000 chassis roughly 4 years ago. We saw it through
>>>> IronWare 5.4.00d. I’m not sure what software we were using when
>>>> they were first installed; probably whatever would have been
>>>> stable/popular around December 2010.
>>>>
>>>> Command is simply “power-off snm [1-3]” then “power-on snm [1-3]”.
>>>>
>>>> Note that the power-on process causes your management session to
>>>> hang for a few seconds. The device isn’t broken and packets aren’t
>>>> getting dropped; it’s just going through checks and echoing back
>>>> status.
>>>>
>>>> -brad
>>>>
>>>>
>>>> > On Feb 16, 2015, at 7:07 AM, Jethro R Binks
>>>> <jethro.binks@strath.ac.uk <mailto:jethro.binks@strath.ac.uk>> wrote:
>>>> >
>>>> > On Fri, 13 Feb 2015, Brad Fleming wrote:
>>>> >
>>>> >> Over the years we’ve seen odd issues where one of the
>>>> >> switch-fabric-links will “wigout” and some of the data moving
>>>> between
>>>> >> cards will get corrupted. When this happens we power cycle each
>>>> switch
>>>> >> fab one at a time using this process:
>>>> >>
>>>> >> 1) Shutdown SFM #3
>>>> >> 2) Wait 1 minute
>>>> >> 3) Power SFM #3 on again
>>>> >> 4) Verify all SFM links are up to SFM#3
>>>> >> 5) Wait 1 minute
>>>> >> 6) Perform steps 1-5 for SFM #2
>>>> >> 7) Perform steps 1-5 for SFM #3
>>>> >>
>>>> >> Not sure you’re seeing the same issue that we see but the “SFM
>>>> Dance”
>>>> >> (as we call it) is a once-every-four-months thing somewhere
>>>> across our
>>>> >> 16 XMR4000 boxes. It can be done with little to no impact if you are
>>>> >> patient verify status before moving to the next SFM.
>>>> >
>>>> > That's all interesting. What code versions is this? Also, how do you
>>>> > shutdown the SFMs? I don't recall seeing documentation for that.
>>>> >
>>>> > Jethro.
>>>> >
>>>> >
>>>> >>
>>>> >>> On Feb 13, 2015, at 11:41 AM,nethub@gmail.com
>>>> <mailto:nethub@gmail.com>wrote:
>>>> >>>
>>>> >>> We have three switch fabrics installed, all are under 1% utilized.
>>>> >>>
>>>> >>>
>>>> >>> From: Jeroen Wunnink | Hibernia Networks
>>>> [mailto:jeroen.wunnink@atrato.com
>>>> <mailto:jeroen.wunnink@atrato.com><mailto:jeroen.wunnink@atrato.com
>>>> <mailto:jeroen.wunnink@atrato.com>>]
>>>> >>> Sent: Friday, February 13, 2015 12:27 PM
>>>> >>> To:nethub@gmail.com
>>>> <mailto:nethub@gmail.com><mailto:nethub@gmail.com
>>>> <mailto:nethub@gmail.com>>; 'Jeroen Wunnink | Hibernia Networks'
>>>> >>> Subject: Re: [f-nsp] MLX throughput issues
>>>> >>>
>>>> >>> How many switchfabrics do you have in that MLX and how high is
>>>> the utilization on them
>>>> >>>
>>>> >>> On 13/02/15 18:12,nethub@gmail.com
>>>> <mailto:nethub@gmail.com><mailto:nethub@gmail.com
>>>> <mailto:nethub@gmail.com>> wrote:
>>>> >>>> We also tested with a spare Quanta LB4M we have and are seeing
>>>> about the same speeds as we are seeing with the FLS648 (around
>>>> 20MB/s or 160Mbps).
>>>> >>>>
>>>> >>>> I also reduced the number of routes we are accepting down to
>>>> about 189K and that did not make a difference.
>>>> >>>>
>>>> >>>>
>>>> >>>> From: foundry-nsp [mailto:foundry-nsp-bounces@puck.nether.net
>>>> <mailto:foundry-nsp-bounces@puck.nether.net><mailto:foundry-nsp-bounces@puck.nether.net
>>>> <mailto:foundry-nsp-bounces@puck.nether.net>>] On Behalf Of Jeroen
>>>> Wunnink | Hibernia Networks
>>>> >>>> Sent: Friday, February 13, 2015 3:35 AM
>>>> >>>> To:foundry-nsp@puck.nether.net
>>>> <mailto:foundry-nsp@puck.nether.net><mailto:foundry-nsp@puck.nether.net
>>>> <mailto:foundry-nsp@puck.nether.net>>
>>>> >>>> Subject: Re: [f-nsp] MLX throughput issues
>>>> >>>>
>>>> >>>> The FLS switches do something weird with packets. I've noticed
>>>> they somehow interfere with changing the MSS window size
>>>> dynamically, resulting in destinations further away having very
>>>> poor speed results compared to destinations close by.
>>>> >>>>
>>>> >>>> We got rid of those a while ago.
>>>> >>>>
>>>> >>>>
>>>> >>>> On 12/02/15 17:37,nethub@gmail.com
>>>> <mailto:nethub@gmail.com><mailto:nethub@gmail.com
>>>> <mailto:nethub@gmail.com>> wrote:
>>>> >>>>> We are having a strange issue on our MLX running code
>>>> 5.6.00c. We are encountering some throughput issues that seem to
>>>> be randomly impacting specific networks.
>>>> >>>>>
>>>> >>>>> We use the MLX to handle both external BGP and internal VLAN
>>>> routing. Each FLS648 is used for Layer 2 VLANs only.
>>>> >>>>>
>>>> >>>>> From a server connected by 1 Gbps uplink to a Foundry FLS648
>>>> switch, which is then connected to the MLX on a 10 Gbps port,
>>>> running a speed test to an external network is getting 20MB/s.
>>>> >>>>>
>>>> >>>>> Connecting the same server directly to the MLX is getting 70MB/s.
>>>> >>>>>
>>>> >>>>> Connecting the same server to one of my customer's Juniper
>>>> EX3200 (which BGP peers with the MLX) also gets 70MB/s.
>>>> >>>>>
>>>> >>>>> Testing to another external network, all three scenarios get
>>>> 110MB/s.
>>>> >>>>>
>>>> >>>>> The path to both test network locations goes through the same
>>>> IP transit provider.
>>>> >>>>>
>>>> >>>>> We are running NI-MLX-MR with 2GB RAM, NI-MLX-10Gx4 connect
>>>> to the Foundry FLS648 by XFP-10G-LR, NI-MLX-1Gx20-GC was used for
>>>> directly connecting the server. A separate NI-MLX-10Gx4 connects to
>>>> our upstream BGP providers. Customer’s Juniper EX3200 connects to
>>>> the same NI-MLX-10Gx4 as the FLS648. We take default routes plus
>>>> full tables from three providers by BGP, but filter out most of the
>>>> routes.
>>>> >>>>>
>>>> >>>>> The fiber and optics on everything look fine. CPU usage is
>>>> less than 10% on the MLX and all line cards and CPU usage at 1% on
>>>> the FLS648. ARP table on the MLX is about 12K, and BGP table is
>>>> about 308K routes.
>>>> >>>>>
>>>> >>>>> Any assistance would be appreciated. I suspect there is a
>>>> setting that we’re missing on the MLX that is causing this issue.
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> _______________________________________________
>>>> >>>>> foundry-nsp mailing list
>>>> >>>>>foundry-nsp@puck.nether.net
>>>> <mailto:foundry-nsp@puck.nether.net><mailto:foundry-nsp@puck.nether.net
>>>> <mailto:foundry-nsp@puck.nether.net>>
>>>> >>>>>http://puck.nether.net/mailman/listinfo/foundry-nsp<http://puck.nether.net/mailman/listinfo/foundry-nsp>
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> >>>> --
>>>> >>>>
>>>> >>>> Jeroen Wunnink
>>>> >>>> IP NOC Manager - Hibernia Networks
>>>> >>>> Main numbers (Ext: 1011): USA+1.908.516.4200
>>>> <tel:%2B1.908.516.4200>| UK+44.1704.322.300 <tel:%2B44.1704.322.300>
>>>> >>>> Netherlands+31.208.200.622 <tel:%2B31.208.200.622>| 24/7 IP
>>>> NOC Phone:+31.20.82.00.623 <tel:%2B31.20.82.00.623>
>>>> >>>>jeroen.wunnink@hibernianetworks.com
>>>> <mailto:jeroen.wunnink@hibernianetworks.com><mailto:jeroen.wunnink@hibernianetworks.com
>>>> <mailto:jeroen.wunnink@hibernianetworks.com>>
>>>> >>>>www.hibernianetworks.com
>>>> <http://www.hibernianetworks.com/><http://www.hibernianetworks.com/>
>>>> >>>
>>>> >>>
>>>> >>> --
>>>> >>>
>>>> >>> Jeroen Wunnink
>>>> >>> IP NOC Manager - Hibernia Networks
>>>> >>> Main numbers (Ext: 1011): USA+1.908.516.4200
>>>> <tel:%2B1.908.516.4200>| UK+44.1704.322.300 <tel:%2B44.1704.322.300>
>>>> >>> Netherlands+31.208.200.622 <tel:%2B31.208.200.622>| 24/7 IP NOC
>>>> Phone:+31.20.82.00.623 <tel:%2B31.20.82.00.623>
>>>> >>>jeroen.wunnink@hibernianetworks.com
>>>> <mailto:jeroen.wunnink@hibernianetworks.com><mailto:jeroen.wunnink@hibernianetworks.com
>>>> <mailto:jeroen.wunnink@hibernianetworks.com>>
>>>> >>>www.hibernianetworks.com
>>>> <http://www.hibernianetworks.com/><http://www.hibernianetworks.com/>_______________________________________________
>>>> >>> foundry-nsp mailing list
>>>> >>>foundry-nsp@puck.nether.net
>>>> <mailto:foundry-nsp@puck.nether.net><mailto:foundry-nsp@puck.nether.net
>>>> <mailto:foundry-nsp@puck.nether.net>>
>>>> >>>http://puck.nether.net/mailman/listinfo/foundry-nsp<http://puck.nether.net/mailman/listinfo/foundry-nsp>
>>>> >>
>>>> >
>>>> > . . . . . . . . . . . . . . . . . . . . . .
>>>> . . .
>>>> > Jethro R Binks, Network Manager,
>>>> > Information Services Directorate, University Of Strathclyde,
>>>> Glasgow, UK
>>>> >
>>>> > The University of Strathclyde is a charitable body, registered in
>>>> > Scotland, number SC015263.
>>>>
>>>>
>>>> _______________________________________________
>>>> foundry-nsp mailing list
>>>> foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net>
>>>> http://puck.nether.net/mailman/listinfo/foundry-nsp
>
>
>
> _______________________________________________
> foundry-nsp mailing list
> foundry-nsp@puck.nether.net
> http://puck.nether.net/mailman/listinfo/foundry-nsp

--

Jeroen Wunnink
IP NOC Manager - Hibernia Networks
Main numbers (Ext: 1011): USA +1.908.516.4200 | UK +44.1704.322.300
Netherlands +31.208.200.622 | 24/7 IP NOC Phone: +31.20.82.00.623
jeroen.wunnink@hibernianetworks.com
www.hibernianetworks.com

Re: MLX throughput issues [ In reply to ]

wouter at widexs

Feb 18, 2015, 9:13 AM

Post #22 of 24 (7714 views)

Just a general 'watch out' with regards to 5.5e I'd like to share, as we've had very bad results with 5.5 (including 5.5e)
with regards to IPv6 in combination with IS-IS
(though TAC also mentioned it was on OSPF as well, but don't see that in the release notes)

DEFECT000500944

Router-A has for example a path towards eg. the IPv6 loopback address of Router-B via eth1/4
Now eth1/4 on Router-A goes down.

IS-IS picks for example eth 3/4 as the new best path towards the loopback.
Now, a 'show ipv6 route <loopback-of-Router-B>' correctly shows the new interface (eth 3/4)...

Yet the IPv6 cache keeps a stale entry towards eth 1/4... and traffic from Router-A towards
Router-B's loopback is now blackholed.... as it still tries to send it out eth 1/4 (even when it's down)

This affected in our case for example IPv6 iBGP sessions.... so far for redundant links, etc.

Physical or ve does not matter.
Also, if I recall correctly, nasty stuff also happened when you simply made a metric change of the IS-IS path.
(Eg. when eth 3/4 is towards Router-C, and Router-C has a best path towards Router-B's loopback via the link to Router-A)

I believe it was only problematic for 'local' traffic from Router-A, and not for transit traffic - but not 100% sure anymore.

Fixed in 5.6.something, not sure if they ever fixed it in 5.5

Simplest workaround once you are affected is executing the 'hidden' command (DEFECT000503937 ) : clear ipv6 cache

Best regards,

Wouter

From: foundry-nsp [mailto:foundry-nsp-bounces@puck.nether.net] On Behalf Of Jeroen Wunnink | Hibernia Networks
Sent: Wednesday, February 18, 2015 17:32
To: Brad Fleming; Frank Bulk
Cc: foundry-nsp@puck.nether.net
Subject: Re: [f-nsp] MLX throughput issues

Try physically pulling the SFM's one by one rather then just powercycling them. Yes there is a difference there :-)

Also, 5400d is fairly buggy, there were some major issues with CRC checksums in hSFM's and on 100G cards. We have fairly good results with 5500e

On 18/02/15 16:50, Brad Fleming wrote:
TAC replaced hSFMs and line cards the first couple times but we've seen this issue at least once on every node in the network. The ones where we replaced every module (SFM, mgmt, port cards, even PSUs) have still had at least one event. So I'm not even sure what hardware we'd replace at this point. That lead us to thinking a config problem since each box uses the same template but after a lengthy audit with TAC nobody could find anything. It happens infrequently enough that we grew to just live with it.

On Feb 18, 2015, at 12:45 AM, Frank Bulk <frnkblk@iname.com<mailto:frnkblk@iname.com>> wrote:

So don't errors like this suggest replacing the hardware?

Frank

From: foundry-nsp [mailto:foundry-nsp-bounces@puck.nether.net] On Behalf Of Brad Fleming
Sent: Tuesday, February 17, 2015 3:10 PM
To: Josh Galvez
Cc: foundry-nsp@puck.nether.net<mailto:foundry-nsp@puck.nether.net>
Subject: Re: [f-nsp] MLX throughput issues

The common symptoms for us are alarms of TM errors / resets. We've been told on multiple TAC cases that logs indicating transmit TM errors are likely caused by problems in one of the SFM links / lanes. We've been told that resetting the SFMs one at a time will clear the issue.

Symptoms during the issue is that 1/3rd of the traffic moving from one TM to another TM will simply get dropped. So we see TCP globally start to throttle like crazy and if enough errors count up the TM will simply reset. After the TM reset is seems a 50/50 chance the box will remain stable or go back to dropping packets within ~20mins. So when we see a TM reset we simply do the SFM Dance no matter what.

On Feb 16, 2015, at 10:08 PM, Josh Galvez <josh@zevlag.com<mailto:josh@zevlag.com>> wrote:

Why kind of wigout? And how do you diagnose the corruption? I'm intrigued.

On Mon, Feb 16, 2015 at 8:43 AM, Brad Fleming <bdflemin@gmail.com<mailto:bdflemin@gmail.com>> wrote:
We've seen it since installing the high-capacity switch fabrics into our XMR4000 chassis roughly 4 years ago. We saw it through IronWare 5.4.00d. I'm not sure what software we were using when they were first installed; probably whatever would have been stable/popular around December 2010.

Command is simply "power-off snm [1-3]" then "power-on snm [1-3]".

Note that the power-on process causes your management session to hang for a few seconds. The device isn't broken and packets aren't getting dropped; it's just going through checks and echoing back status.

-brad

> On Feb 16, 2015, at 7:07 AM, Jethro R Binks <jethro.binks@strath.ac.uk<mailto:jethro.binks@strath.ac.uk>> wrote:
>
> On Fri, 13 Feb 2015, Brad Fleming wrote:
>
>> Over the years we've seen odd issues where one of the
>> switch-fabric-links will "wigout" and some of the data moving between
>> cards will get corrupted. When this happens we power cycle each switch
>> fab one at a time using this process:
>>
>> 1) Shutdown SFM #3
>> 2) Wait 1 minute
>> 3) Power SFM #3 on again
>> 4) Verify all SFM links are up to SFM#3
>> 5) Wait 1 minute
>> 6) Perform steps 1-5 for SFM #2
>> 7) Perform steps 1-5 for SFM #3
>>
>> Not sure you're seeing the same issue that we see but the "SFM Dance"
>> (as we call it) is a once-every-four-months thing somewhere across our
>> 16 XMR4000 boxes. It can be done with little to no impact if you are
>> patient verify status before moving to the next SFM.
>
> That's all interesting. What code versions is this? Also, how do you
> shutdown the SFMs? I don't recall seeing documentation for that.
>
> Jethro.
>
>
>>
>>> On Feb 13, 2015, at 11:41 AM, nethub@gmail.com<mailto:nethub@gmail.com> wrote:
>>>
>>> We have three switch fabrics installed, all are under 1% utilized.
>>>
>>>
>>> From: Jeroen Wunnink | Hibernia Networks [mailto:jeroen.wunnink@atrato.com<mailto:jeroen.wunnink@atrato.com> <mailto:jeroen.wunnink@atrato.com<mailto:jeroen.wunnink@atrato.com>>]
>>> Sent: Friday, February 13, 2015 12:27 PM
>>> To: nethub@gmail.com<mailto:nethub@gmail.com> <mailto:nethub@gmail.com<mailto:nethub@gmail.com>>; 'Jeroen Wunnink | Hibernia Networks'
>>> Subject: Re: [f-nsp] MLX throughput issues
>>>
>>> How many switchfabrics do you have in that MLX and how high is the utilization on them
>>>
>>> On 13/02/15 18:12, nethub@gmail.com<mailto:nethub@gmail.com> <mailto:nethub@gmail.com<mailto:nethub@gmail.com>> wrote:
>>>> We also tested with a spare Quanta LB4M we have and are seeing about the same speeds as we are seeing with the FLS648 (around 20MB/s or 160Mbps).
>>>>
>>>> I also reduced the number of routes we are accepting down to about 189K and that did not make a difference.
>>>>
>>>>
>>>> From: foundry-nsp [mailto:foundry-nsp-bounces@puck.nether.net<mailto:foundry-nsp-bounces@puck.nether.net> <mailto:foundry-nsp-bounces@puck.nether.net<mailto:foundry-nsp-bounces@puck.nether.net>>] On Behalf Of Jeroen Wunnink | Hibernia Networks
>>>> Sent: Friday, February 13, 2015 3:35 AM
>>>> To: foundry-nsp@puck.nether.net<mailto:foundry-nsp@puck.nether.net> <mailto:foundry-nsp@puck.nether.net<mailto:foundry-nsp@puck.nether.net>>
>>>> Subject: Re: [f-nsp] MLX throughput issues
>>>>
>>>> The FLS switches do something weird with packets. I've noticed they somehow interfere with changing the MSS window size dynamically, resulting in destinations further away having very poor speed results compared to destinations close by.
>>>>
>>>> We got rid of those a while ago.
>>>>
>>>>
>>>> On 12/02/15 17:37, nethub@gmail.com<mailto:nethub@gmail.com> <mailto:nethub@gmail.com<mailto:nethub@gmail.com>> wrote:
>>>>> We are having a strange issue on our MLX running code 5.6.00c. We are encountering some throughput issues that seem to be randomly impacting specific networks.
>>>>>
>>>>> We use the MLX to handle both external BGP and internal VLAN routing. Each FLS648 is used for Layer 2 VLANs only.
>>>>>
>>>>> From a server connected by 1 Gbps uplink to a Foundry FLS648 switch, which is then connected to the MLX on a 10 Gbps port, running a speed test to an external network is getting 20MB/s.
>>>>>
>>>>> Connecting the same server directly to the MLX is getting 70MB/s.
>>>>>
>>>>> Connecting the same server to one of my customer's Juniper EX3200 (which BGP peers with the MLX) also gets 70MB/s.
>>>>>
>>>>> Testing to another external network, all three scenarios get 110MB/s.
>>>>>
>>>>> The path to both test network locations goes through the same IP transit provider.
>>>>>
>>>>> We are running NI-MLX-MR with 2GB RAM, NI-MLX-10Gx4 connect to the Foundry FLS648 by XFP-10G-LR, NI-MLX-1Gx20-GC was used for directly connecting the server. A separate NI-MLX-10Gx4 connects to our upstream BGP providers. Customer's Juniper EX3200 connects to the same NI-MLX-10Gx4 as the FLS648. We take default routes plus full tables from three providers by BGP, but filter out most of the routes.
>>>>>
>>>>> The fiber and optics on everything look fine. CPU usage is less than 10% on the MLX and all line cards and CPU usage at 1% on the FLS648. ARP table on the MLX is about 12K, and BGP table is about 308K routes.
>>>>>
>>>>> Any assistance would be appreciated. I suspect there is a setting that we're missing on the MLX that is causing this issue.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> foundry-nsp mailing list
>>>>> foundry-nsp@puck.nether.net<mailto:foundry-nsp@puck.nether.net> <mailto:foundry-nsp@puck.nether.net<mailto:foundry-nsp@puck.nether.net>>
>>>>> http://puck.nether.net/mailman/listinfo/foundry-nsp <http://puck.nether.net/mailman/listinfo/foundry-nsp>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Jeroen Wunnink
>>>> IP NOC Manager - Hibernia Networks
>>>> Main numbers (Ext: 1011): USA +1.908.516.4200<tel:%2B1.908.516.4200> | UK +44.1704.322.300<tel:%2B44.1704.322.300>
>>>> Netherlands +31.208.200.622<tel:%2B31.208.200.622> | 24/7 IP NOC Phone: +31.20.82.00.623<tel:%2B31.20.82.00.623>
>>>> jeroen.wunnink@hibernianetworks.com<mailto:jeroen.wunnink@hibernianetworks.com> <mailto:jeroen.wunnink@hibernianetworks.com<mailto:jeroen.wunnink@hibernianetworks.com>>
>>>> www.hibernianetworks.com<http://www.hibernianetworks.com/> <http://www.hibernianetworks.com/>
>>>
>>>
>>> --
>>>
>>> Jeroen Wunnink
>>> IP NOC Manager - Hibernia Networks
>>> Main numbers (Ext: 1011): USA +1.908.516.4200<tel:%2B1.908.516.4200> | UK +44.1704.322.300<tel:%2B44.1704.322.300>
>>> Netherlands +31.208.200.622<tel:%2B31.208.200.622> | 24/7 IP NOC Phone: +31.20.82.00.623<tel:%2B31.20.82.00.623>
>>> jeroen.wunnink@hibernianetworks.com<mailto:jeroen.wunnink@hibernianetworks.com> <mailto:jeroen.wunnink@hibernianetworks.com<mailto:jeroen.wunnink@hibernianetworks.com>>
>>> www.hibernianetworks.com<http://www.hibernianetworks.com/> <http://www.hibernianetworks.com/>_______________________________________________
>>> foundry-nsp mailing list
>>> foundry-nsp@puck.nether.net<mailto:foundry-nsp@puck.nether.net> <mailto:foundry-nsp@puck.nether.net<mailto:foundry-nsp@puck.nether.net>>
>>> http://puck.nether.net/mailman/listinfo/foundry-nsp <http://puck.nether.net/mailman/listinfo/foundry-nsp>
>>
>
> . . . . . . . . . . . . . . . . . . . . . . . . .
> Jethro R Binks, Network Manager,
> Information Services Directorate, University Of Strathclyde, Glasgow, UK
>
> The University of Strathclyde is a charitable body, registered in
> Scotland, number SC015263.

_______________________________________________
foundry-nsp mailing list
foundry-nsp@puck.nether.net<mailto:foundry-nsp@puck.nether.net>
http://puck.nether.net/mailman/listinfo/foundry-nsp

_______________________________________________

foundry-nsp mailing list

foundry-nsp@puck.nether.net<mailto:foundry-nsp@puck.nether.net>

http://puck.nether.net/mailman/listinfo/foundry-nsp

--

Jeroen Wunnink

IP NOC Manager - Hibernia Networks

Main numbers (Ext: 1011): USA +1.908.516.4200 | UK +44.1704.322.300

Netherlands +31.208.200.622 | 24/7 IP NOC Phone: +31.20.82.00.623

jeroen.wunnink@hibernianetworks.com<mailto:jeroen.wunnink@hibernianetworks.com>

www.hibernianetworks.com<http://www.hibernianetworks.com>

Re: MLX throughput issues [ In reply to ]

wsmith at brocade

Feb 20, 2015, 7:09 PM

Post #23 of 24 (7675 views)

Hello All,

Sorry to reply late, but it seems like you were hitting the buffer limit for a port domain (group of ports). I donâ€™t have an FLS in front of me (flying ATM) so I canâ€™t confirm, but I think weâ€™re breaking up the buffer space into reserved segments for each port group. The reasoning behind this is that Is keeps â€œslow drainâ€ devices on a single interface from using up all available buffer space for the switch. The down side is that if a port exhausts its allotted buffers, it can cause slow downs.

Over the years weâ€™ve gone back in forth over whether its better to ship with shared buffers enabled; I think it would generate the same amount of TAC requests no matter what we do. Although the FLS isnâ€™t as beefy as the FCX or ICX, it should still have some nobs you can turn to increase performance. This should be in the config guide.

Iâ€™d try to narrow down what device or devices is causing buffer pressure on the switch and consider enabling ethernet pause-frames (flow control) on the switch and neighboring devices. Thereâ€™s also different QOS setting that can switch from stick queues to weighted round-robin (and other types) to help make better use of the buffers on the uplink ports.

Sorry youâ€™re running into this. The FLS is a very good campus access switch platform (good latency and minimal oversubscription, for a good cost), but my view is that itâ€™s not the best switch to front-end server connections or heavy I/O. Others may disagree with me on this though.

Wilbur

From: "nethub@gmail.com<mailto:nethub@gmail.com>"
Date: Friday, February 13, 2015 at 4:13 PM
To: Brad Fleming
Cc: 'Jeroen Wunnink | Hibernia Networks', "foundry-nsp@puck.nether.net<mailto:foundry-nsp@puck.nether.net>"
Subject: Re: [f-nsp] MLX throughput issues

We already tried a full system reboot last night and it didnâ€™t seem to help. Iâ€™ll definitely keep your switch fabric reboot procedure in mind in case we run into that in the future.

I think we may have figured out at least a short-term solution. On the FLS648, we ran the command â€œbuffer-sharing-fullâ€ and immediately we were able to get better speeds. It seems as though the FLS648â€™s buffers may have been causing the issue. Weâ€™ll continue to monitor over the next few days and see if this actually solves the issue.

Thanks everyone for your feedback thus far.

From: Brad Fleming [mailto:bdflemin@gmail.com]
Sent: Friday, February 13, 2015 4:24 PM
To: nethub@gmail.com<mailto:nethub@gmail.com>
Cc: Jeroen Wunnink | Hibernia Networks; foundry-nsp@puck.nether.net<mailto:foundry-nsp@puck.nether.net>
Subject: Re: [f-nsp] MLX throughput issues

Over the years weâ€™ve seen odd issues where one of the switch-fabric-links will â€œwigoutâ€ and some of the data moving between cards will get corrupted. When this happens we power cycle each switch fab one at a time using this process:
1) Shutdown SFM #3
2) Wait 1 minute
3) Power SFM #3 on again
4) Verify all SFM links are up to SFM#3
5) Wait 1 minute
6) Perform steps 1-5 for SFM #2
7) Perform steps 1-5 for SFM #3

Not sure youâ€™re seeing the same issue that we see but the â€œSFM Danceâ€ (as we call it) is a once-every-four-months thing somewhere across our 16 XMR4000 boxes. It can be done with little to no impact if you are patient verify status before moving to the next SFM.

On Feb 13, 2015, at 11:41 AM, nethub@gmail.com<mailto:nethub@gmail.com> wrote:

We have three switch fabrics installed, all are under 1% utilized.

From: Jeroen Wunnink | Hibernia Networks [mailto:jeroen.wunnink@atrato.com]
Sent: Friday, February 13, 2015 12:27 PM
To: nethub@gmail.com<mailto:nethub@gmail.com>; 'Jeroen Wunnink | Hibernia Networks'
Subject: Re: [f-nsp] MLX throughput issues

How many switchfabrics do you have in that MLX and how high is the utilization on them

On 13/02/15 18:12, nethub@gmail.com<mailto:nethub@gmail.com> wrote:
We also tested with a spare Quanta LB4M we have and are seeing about the same speeds as we are seeing with the FLS648 (around 20MB/s or 160Mbps).

I also reduced the number of routes we are accepting down to about 189K and that did not make a difference.

From: foundry-nsp [mailto:foundry-nsp-bounces@puck.nether.net] On Behalf Of Jeroen Wunnink | Hibernia Networks
Sent: Friday, February 13, 2015 3:35 AM
To: foundry-nsp@puck.nether.net<mailto:foundry-nsp@puck.nether.net>
Subject: Re: [f-nsp] MLX throughput issues

The FLS switches do something weird with packets. I've noticed they somehow interfere with changing the MSS window size dynamically, resulting in destinations further away having very poor speed results compared to destinations close by.

We got rid of those a while ago.

On 12/02/15 17:37, nethub@gmail.com<mailto:nethub@gmail.com> wrote:
We are having a strange issue on our MLX running code 5.6.00c. We are encountering some throughput issues that seem to be randomly impacting specific networks.

We use the MLX to handle both external BGP and internal VLAN routing. Each FLS648 is used for Layer 2 VLANs only.

From a server connected by 1 Gbps uplink to a Foundry FLS648 switch, which is then connected to the MLX on a 10 Gbps port, running a speed test to an external network is getting 20MB/s.

Connecting the same server directly to the MLX is getting 70MB/s.

Connecting the same server to one of my customer's Juniper EX3200 (which BGP peers with the MLX) also gets 70MB/s.

Testing to another external network, all three scenarios get 110MB/s.

The path to both test network locations goes through the same IP transit provider.

We are running NI-MLX-MR with 2GB RAM, NI-MLX-10Gx4 connect to the Foundry FLS648 by XFP-10G-LR, NI-MLX-1Gx20-GC was used for directly connecting the server. A separate NI-MLX-10Gx4 connects to our upstream BGP providers. Customerâ€™s Juniper EX3200 connects to the same NI-MLX-10Gx4 as the FLS648. We take default routes plus full tables from three providers by BGP, but filter out most of the routes.

The fiber and optics on everything look fine. CPU usage is less than 10% on the MLX and all line cards and CPU usage at 1% on the FLS648. ARP table on the MLX is about 12K, and BGP table is about 308K routes.

Any assistance would be appreciated. I suspect there is a setting that weâ€™re missing on the MLX that is causing this issue.

_______________________________________________

foundry-nsp mailing list

foundry-nsp@puck.nether.net<mailto:foundry-nsp@puck.nether.net>

http://puck.nether.net/mailman/listinfo/foundry-nsp

--

Jeroen Wunnink

IP NOC Manager - Hibernia Networks

Main numbers (Ext: 1011): USA +1.908.516.4200 | UK +44.1704.322.300

Netherlands +31.208.200.622 | 24/7 IP NOC Phone: +31.20.82.00.623

jeroen.wunnink@hibernianetworks.com<mailto:jeroen.wunnink@hibernianetworks.com>

www.hibernianetworks.com<http://www.hibernianetworks.com/>

--

Jeroen Wunnink

IP NOC Manager - Hibernia Networks

Main numbers (Ext: 1011): USA +1.908.516.4200 | UK +44.1704.322.300

Netherlands +31.208.200.622 | 24/7 IP NOC Phone: +31.20.82.00.623

jeroen.wunnink@hibernianetworks.com<mailto:jeroen.wunnink@hibernianetworks.com>

www.hibernianetworks.com<http://www.hibernianetworks.com/>
_______________________________________________
foundry-nsp mailing list
foundry-nsp@puck.nether.net<mailto:foundry-nsp@puck.nether.net>
http://puck.nether.net/mailman/listinfo/foundry-nsp

Re: MLX throughput issues [ In reply to ]

nethub at gmail

Feb 23, 2015, 9:58 AM

Post #24 of 24 (7683 views)

Wilbur,

Do you have a recommended configuration for the FLS648 buffers? The â€œbuffer-sharing-fullâ€ does help on some of the switches, but it does not help on others. I tried enabling flow-control but it did not help. Iâ€™m not sure how to get acceptable performance.

We are running a single 10 Gbps uplink to each FLS648, and then have 40 servers connected by 1 Gbps ports. The average aggregate bandwidth for the whole switch is typically well under 1 Gbps, but we do burst above that occasionally which is why we have a 10 Gbps uplink.

We are only using the FLS648 switches for layer 2 with VLANs. All layer 3 is handled by the upstream device (MLX).

We are running software version 7.2.02 on the FLS648 switches.

If you need any other information, please let me know.

Thanks.

From: Wilbur Smith [mailto:wsmith@brocade.com]
Sent: Friday, February 20, 2015 10:09 PM
To: nethub@gmail.com; Brad Fleming
Cc: 'Jeroen Wunnink | Hibernia Networks'; foundry-nsp@puck.nether.net
Subject: Re: [f-nsp] MLX throughput issues

Hello All,

Sorry to reply late, but it seems like you were hitting the buffer limit for a port domain (group of ports). I donâ€™t have an FLS in front of me (flying ATM) so I canâ€™t confirm, but I think weâ€™re breaking up the buffer space into reserved segments for each port group. The reasoning behind this is that Is keeps â€œslow drainâ€ devices on a single interface from using up all available buffer space for the switch. The down side is that if a port exhausts its allotted buffers, it can cause slow downs.

Over the years weâ€™ve gone back in forth over whether its better to ship with shared buffers enabled; I think it would generate the same amount of TAC requests no matter what we do. Although the FLS isnâ€™t as beefy as the FCX or ICX, it should still have some nobs you can turn to increase performance. This should be in the config guide.

Iâ€™d try to narrow down what device or devices is causing buffer pressure on the switch and consider enabling ethernet pause-frames (flow control) on the switch and neighboring devices. Thereâ€™s also different QOS setting that can switch from stick queues to weighted round-robin (and other types) to help make better use of the buffers on the uplink ports.

Sorry youâ€™re running into this. The FLS is a very good campus access switch platform (good latency and minimal oversubscription, for a good cost), but my view is that itâ€™s not the best switch to front-end server connections or heavy I/O. Others may disagree with me on this though.

Wilbur

From: "nethub@gmail.com"
Date: Friday, February 13, 2015 at 4:13 PM
To: Brad Fleming
Cc: 'Jeroen Wunnink | Hibernia Networks', "foundry-nsp@puck.nether.net"
Subject: Re: [f-nsp] MLX throughput issues

We already tried a full system reboot last night and it didnâ€™t seem to help. Iâ€™ll definitely keep your switch fabric reboot procedure in mind in case we run into that in the future.

I think we may have figured out at least a short-term solution. On the FLS648, we ran the command â€œbuffer-sharing-fullâ€ and immediately we were able to get better speeds. It seems as though the FLS648â€™s buffers may have been causing the issue. Weâ€™ll continue to monitor over the next few days and see if this actually solves the issue.

Thanks everyone for your feedback thus far.

From: Brad Fleming [mailto:bdflemin@gmail.com]
Sent: Friday, February 13, 2015 4:24 PM
To: nethub@gmail.com
Cc: Jeroen Wunnink | Hibernia Networks; foundry-nsp@puck.nether.net
Subject: Re: [f-nsp] MLX throughput issues

Over the years weâ€™ve seen odd issues where one of the switch-fabric-links will â€œwigoutâ€ and some of the data moving between cards will get corrupted. When this happens we power cycle each switch fab one at a time using this process:

1) Shutdown SFM #3

2) Wait 1 minute

3) Power SFM #3 on again

4) Verify all SFM links are up to SFM#3

5) Wait 1 minute

6) Perform steps 1-5 for SFM #2

7) Perform steps 1-5 for SFM #3

Not sure youâ€™re seeing the same issue that we see but the â€œSFM Danceâ€ (as we call it) is a once-every-four-months thing somewhere across our 16 XMR4000 boxes. It can be done with little to no impact if you are patient verify status before moving to the next SFM.

On Feb 13, 2015, at 11:41 AM, nethub@gmail.com wrote:

We have three switch fabrics installed, all are under 1% utilized.

From: Jeroen Wunnink | Hibernia Networks [ <mailto:jeroen.wunnink@atrato.com> mailto:jeroen.wunnink@atrato.com]
Sent: Friday, February 13, 2015 12:27 PM
To: <mailto:nethub@gmail.com> nethub@gmail.com; 'Jeroen Wunnink | Hibernia Networks'
Subject: Re: [f-nsp] MLX throughput issues

How many switchfabrics do you have in that MLX and how high is the utilization on them

On 13/02/15 18:12, <mailto:nethub@gmail.com> nethub@gmail.com wrote:

We also tested with a spare Quanta LB4M we have and are seeing about the same speeds as we are seeing with the FLS648 (around 20MB/s or 160Mbps).

I also reduced the number of routes we are accepting down to about 189K and that did not make a difference.

From: foundry-nsp [ <mailto:foundry-nsp-bounces@puck.nether.net> mailto:foundry-nsp-bounces@puck.nether.net] On Behalf Of Jeroen Wunnink | Hibernia Networks
Sent: Friday, February 13, 2015 3:35 AM
To: <mailto:foundry-nsp@puck.nether.net> foundry-nsp@puck.nether.net
Subject: Re: [f-nsp] MLX throughput issues

The FLS switches do something weird with packets. I've noticed they somehow interfere with changing the MSS window size dynamically, resulting in destinations further away having very poor speed results compared to destinations close by.

We got rid of those a while ago.

On 12/02/15 17:37, <mailto:nethub@gmail.com> nethub@gmail.com wrote:

We are having a strange issue on our MLX running code 5.6.00c. We are encountering some throughput issues that seem to be randomly impacting specific networks.

We use the MLX to handle both external BGP and internal VLAN routing. Each FLS648 is used for Layer 2 VLANs only.

From a server connected by 1 Gbps uplink to a Foundry FLS648 switch, which is then connected to the MLX on a 10 Gbps port, running a speed test to an external network is getting 20MB/s.

Connecting the same server directly to the MLX is getting 70MB/s.

Connecting the same server to one of my customer's Juniper EX3200 (which BGP peers with the MLX) also gets 70MB/s.

Testing to another external network, all three scenarios get 110MB/s.

The path to both test network locations goes through the same IP transit provider.

We are running NI-MLX-MR with 2GB RAM, NI-MLX-10Gx4 connect to the Foundry FLS648 by XFP-10G-LR, NI-MLX-1Gx20-GC was used for directly connecting the server. A separate NI-MLX-10Gx4 connects to our upstream BGP providers. Customerâ€™s Juniper EX3200 connects to the same NI-MLX-10Gx4 as the FLS648. We take default routes plus full tables from three providers by BGP, but filter out most of the routes.

The fiber and optics on everything look fine. CPU usage is less than 10% on the MLX and all line cards and CPU usage at 1% on the FLS648. ARP table on the MLX is about 12K, and BGP table is about 308K routes.

Any assistance would be appreciated. I suspect there is a setting that weâ€™re missing on the MLX that is causing this issue.

_______________________________________________
foundry-nsp mailing list
<mailto:foundry-nsp@puck.nether.net> foundry-nsp@puck.nether.net
<http://puck.nether.net/mailman/listinfo/foundry-nsp> http://puck.nether.net/mailman/listinfo/foundry-nsp

--

Jeroen Wunnink
IP NOC Manager - Hibernia Networks
Main numbers (Ext: 1011): USA +1.908.516.4200 | UK +44.1704.322.300
Netherlands +31.208.200.622 | 24/7 IP NOC Phone: +31.20.82.00.623
<mailto:jeroen.wunnink@hibernianetworks.com> jeroen.wunnink@hibernianetworks.com
<http://www.hibernianetworks.com/> www.hibernianetworks.com

--

Jeroen Wunnink
IP NOC Manager - Hibernia Networks
Main numbers (Ext: 1011): USA +1.908.516.4200 | UK +44.1704.322.300
Netherlands +31.208.200.622 | 24/7 IP NOC Phone: +31.20.82.00.623
<mailto:jeroen.wunnink@hibernianetworks.com> jeroen.wunnink@hibernianetworks.com
<http://www.hibernianetworks.com/> www.hibernianetworks.com

_______________________________________________
foundry-nsp mailing list
<mailto:foundry-nsp@puck.nether.net> foundry-nsp@puck.nether.net
<http://puck.nether.net/mailman/listinfo/foundry-nsp> http://puck.nether.net/mailman/listinfo/foundry-nsp