Mailing List Archive

FCX and target path 8.0.10m (and an aside)
I thought I was doing the right thing by upgrading a couple of my slightly
aging FCXs to target path release 8.0.10m, which tested fine on an
unstacked unit with a single OSPF peering.

The ones I am running on are stacks of two, each with two 10Gb/s
connections to core, one OSPF peering on each.

Since the upgrade, both stacks suffer packet loss every 2 minutes (just
about exactly) for about 5-10 seconds, demonstrated by pinging either a
host through the stack, or an interface on the stack. There are no log
messages or changes in OSPF status or spanning tree activity. When it
happens, of course a remote session to the box stalls for the same period.

Shutting down either one of the OSPF links doesn't make a difference.
CPU never changes from 1%. No errors on ints. I've used dm commands to
catch packets going to CPU at about the right time and see nothing
particularly alarming and certainly no flooding of anything.

This only started after the upgrade to 8.0.10m on each of them. I have
other FCX stacks on other code versions not exhibiting this issue.

Some of the comments in this thread seem to be reflective of my issue:

https://www.reddit.com/r/networking/comments/4j47uo/brocade_is_ruining_my_week_i_need_help_to/

I'm a little dismayed to get these problems on a Target Path release,
which I assumed would be pretty sound. I've been eyeing a potential
upgrade to something in the 8.0.30 (recommendations?), with the usual
added excitement of bringing a fresh set of bugs.

Before I consider reporting it, I wondered if anyone had any useful
observations or suggestions.

And, as an aside, I wonder how we're all getting along in our new homes
for our dissociated Brocade family now. Very sad to see the assets of a
once good company scattered to the four winds like this.

Jethro.

. . . . . . . . . . . . . . . . . . . . . . . . .
Jethro R Binks, Network Manager,
Information Services Directorate, University Of Strathclyde, Glasgow, UK

The University of Strathclyde is a charitable body, registered in
Scotland, number SC015263.
_______________________________________________
foundry-nsp mailing list
foundry-nsp@puck.nether.net
http://puck.nether.net/mailman/listinfo/foundry-nsp
Re: FCX and target path 8.0.10m (and an aside) [ In reply to ]
The silence was deafening!

So bit of a development with this. We had three stack failure events
which required a hard reboot to sort. We made the decision to upgrade to
8.0.30q (we also replaced the CX4 cable, just in case it is degraded in
some way). Upgrade was all fine.

Initially after the reboot, we didn't see the ping loss issues. But over
the past few hours it has started to creep in again, much the same as
previously. I've not re-done all the tests like shutting down one then
the other ospf interface to see if it makes any difference to the problem,
but my gut feeling is it will be just the same.

Anyone any thoughts? Could there be some sort of hardware failure in one
of the units that might cause these symptoms? Maybe I might have more
diagnostic tools available. What might also be interesting is trying to
downgrade back to the 7.4 version we were running previously, where we
didn't see these issues. But that's more service-affecting downtime.

Jethro.



On Fri, 16 Feb 2018, Jethro R Binks wrote:

> I thought I was doing the right thing by upgrading a couple of my slightly
> aging FCXs to target path release 8.0.10m, which tested fine on an
> unstacked unit with a single OSPF peering.
>
> The ones I am running on are stacks of two, each with two 10Gb/s
> connections to core, one OSPF peering on each.
>
> Since the upgrade, both stacks suffer packet loss every 2 minutes (just
> about exactly) for about 5-10 seconds, demonstrated by pinging either a
> host through the stack, or an interface on the stack. There are no log
> messages or changes in OSPF status or spanning tree activity. When it
> happens, of course a remote session to the box stalls for the same period.
>
> Shutting down either one of the OSPF links doesn't make a difference.
> CPU never changes from 1%. No errors on ints. I've used dm commands to
> catch packets going to CPU at about the right time and see nothing
> particularly alarming and certainly no flooding of anything.
>
> This only started after the upgrade to 8.0.10m on each of them. I have
> other FCX stacks on other code versions not exhibiting this issue.
>
> Some of the comments in this thread seem to be reflective of my issue:
>
> https://www.reddit.com/r/networking/comments/4j47uo/brocade_is_ruining_my_week_i_need_help_to/
>
> I'm a little dismayed to get these problems on a Target Path release,
> which I assumed would be pretty sound. I've been eyeing a potential
> upgrade to something in the 8.0.30 (recommendations?), with the usual
> added excitement of bringing a fresh set of bugs.
>
> Before I consider reporting it, I wondered if anyone had any useful
> observations or suggestions.
>
> And, as an aside, I wonder how we're all getting along in our new homes
> for our dissociated Brocade family now. Very sad to see the assets of a
> once good company scattered to the four winds like this.
>
> Jethro.
>
> . . . . . . . . . . . . . . . . . . . . . . . . .
> Jethro R Binks, Network Manager,
> Information Services Directorate, University Of Strathclyde, Glasgow, UK
>
> The University of Strathclyde is a charitable body, registered in
> Scotland, number SC015263.
> _______________________________________________
> foundry-nsp mailing list
> foundry-nsp@puck.nether.net
> http://puck.nether.net/mailman/listinfo/foundry-nsp
>

. . . . . . . . . . . . . . . . . . . . . . . . .
Jethro R Binks, Network Manager,
Information Services Directorate, University Of Strathclyde, Glasgow, UK

The University of Strathclyde is a charitable body, registered in
Scotland, number SC015263.
_______________________________________________
foundry-nsp mailing list
foundry-nsp@puck.nether.net
http://puck.nether.net/mailman/listinfo/foundry-nsp
Re: FCX and target path 8.0.10m (and an aside) [ In reply to ]
Personally, I never moved away from 7.4.

Best regards.



> Le 22 févr. 2018 à 11:16, Jethro R Binks <jethro.binks@strath.ac.uk> a écrit :
>
> The silence was deafening!
>
> So bit of a development with this. We had three stack failure events
> which required a hard reboot to sort. We made the decision to upgrade to
> 8.0.30q (we also replaced the CX4 cable, just in case it is degraded in
> some way). Upgrade was all fine.
>
> Initially after the reboot, we didn't see the ping loss issues. But over
> the past few hours it has started to creep in again, much the same as
> previously. I've not re-done all the tests like shutting down one then
> the other ospf interface to see if it makes any difference to the problem,
> but my gut feeling is it will be just the same.
>
> Anyone any thoughts? Could there be some sort of hardware failure in one
> of the units that might cause these symptoms? Maybe I might have more
> diagnostic tools available. What might also be interesting is trying to
> downgrade back to the 7.4 version we were running previously, where we
> didn't see these issues. But that's more service-affecting downtime.
>
> Jethro.
>
>
>
>> On Fri, 16 Feb 2018, Jethro R Binks wrote:
>>
>> I thought I was doing the right thing by upgrading a couple of my slightly
>> aging FCXs to target path release 8.0.10m, which tested fine on an
>> unstacked unit with a single OSPF peering.
>>
>> The ones I am running on are stacks of two, each with two 10Gb/s
>> connections to core, one OSPF peering on each.
>>
>> Since the upgrade, both stacks suffer packet loss every 2 minutes (just
>> about exactly) for about 5-10 seconds, demonstrated by pinging either a
>> host through the stack, or an interface on the stack. There are no log
>> messages or changes in OSPF status or spanning tree activity. When it
>> happens, of course a remote session to the box stalls for the same period.
>>
>> Shutting down either one of the OSPF links doesn't make a difference.
>> CPU never changes from 1%. No errors on ints. I've used dm commands to
>> catch packets going to CPU at about the right time and see nothing
>> particularly alarming and certainly no flooding of anything.
>>
>> This only started after the upgrade to 8.0.10m on each of them. I have
>> other FCX stacks on other code versions not exhibiting this issue.
>>
>> Some of the comments in this thread seem to be reflective of my issue:
>>
>> https://www.reddit.com/r/networking/comments/4j47uo/brocade_is_ruining_my_week_i_need_help_to/
>>
>> I'm a little dismayed to get these problems on a Target Path release,
>> which I assumed would be pretty sound. I've been eyeing a potential
>> upgrade to something in the 8.0.30 (recommendations?), with the usual
>> added excitement of bringing a fresh set of bugs.
>>
>> Before I consider reporting it, I wondered if anyone had any useful
>> observations or suggestions.
>>
>> And, as an aside, I wonder how we're all getting along in our new homes
>> for our dissociated Brocade family now. Very sad to see the assets of a
>> once good company scattered to the four winds like this.
>>
>> Jethro.
>>
>> . . . . . . . . . . . . . . . . . . . . . . . . .
>> Jethro R Binks, Network Manager,
>> Information Services Directorate, University Of Strathclyde, Glasgow, UK
>>
>> The University of Strathclyde is a charitable body, registered in
>> Scotland, number SC015263.
>> _______________________________________________
>> foundry-nsp mailing list
>> foundry-nsp@puck.nether.net
>> http://puck.nether.net/mailman/listinfo/foundry-nsp
>>
>
> . . . . . . . . . . . . . . . . . . . . . . . . .
> Jethro R Binks, Network Manager,
> Information Services Directorate, University Of Strathclyde, Glasgow, UK
>
> The University of Strathclyde is a charitable body, registered in
> Scotland, number SC015263.
> _______________________________________________
> foundry-nsp mailing list
> foundry-nsp@puck.nether.net
> http://puck.nether.net/mailman/listinfo/foundry-nsp
_______________________________________________
foundry-nsp mailing list
foundry-nsp@puck.nether.net
http://puck.nether.net/mailman/listinfo/foundry-nsp
Re: FCX and target path 8.0.10m (and an aside) [ In reply to ]
Recent 803 patch releases work a lot better than 801 or 802 code tree
for us. However, we still have 802 ICX6450 with high uptime that we are
going to upgrade to 803q now:

STACKID 4 system uptime is 1035 days 2 hours 55 minutes 59 seconds
STACKID 1 system uptime is 1037 days 35 minutes 54 seconds
STACKID 2 system uptime is 518 days 46 minutes 27 seconds
STACKID 3 system uptime is 1036 days 23 hours 32 minutes 13 seconds
STACKID 5 system uptime is 323 days 15 hours 13 minutes 20 seconds
STACKID 6 system uptime is 14 hours 46 minutes 27 seconds
The system started at 11:13:42 GMT+01 Wed Apr 22 2015

We see the SSH server on 8020c become inresponsive periodically. We did
not run 801 code tree to long and upgraded to 802 years ago.

If you cannot reach the host via ping from the CLI, this is a connection
problem, not a forwarding problem.
If you can reach the host via CLI ping, but not from other hosts, this
would be a forwarding problem, because the Fastiron doesn't need to
route/switch its own host connections.

If you don't see to many CPU packets with dm raw, this fits low CPU%, as
too many packets hitting the CPU will increase CPU%. This doesn't seem
to be your problem here.

You can check with dm ipv4 hw-route / dm ipv4 hw-arp if HW entries are
programmed correctly while the host is unreachable.
Do you have independent management connectivity to the FCX to check the
status while it stops routing?


Best regards,

Franz Georg Köhler

_______________________________________________
foundry-nsp mailing list
foundry-nsp@puck.nether.net
http://puck.nether.net/mailman/listinfo/foundry-nsp
Re: FCX and target path 8.0.10m (and an aside) [ In reply to ]
On Thu, 22 Feb 2018, Kennedy, Joseph wrote:

> What is connected to these stacks from a client perspective?

Edge stacks :) Mostly HPE/Comware stuff of various vintages.

> Are you running PIM on the interfaces? IGMP snooping? Have you checked

PIM yes, IGMP snooping usually no - the VLANs are mostly routed on this
switch so it is the IGMP querier. I think I see where you're going with
this: over many Foundry/Brocade platforms we've had issues with multicast.

> your IGMP and multicast groups and how many group members are present?

Only a couple of hundred groups usually.

> Do you have any link-local groups showing up? Assuming active IGMP
> querier configuration of some kind, does the loss line up with any of
> the IGMP interface timers reaching 0?

There are link-local groups present for sure. We have a core MLX which
sees far more groups (it is on the path to the RP) and shows related CPU
issues, not enough to be a problem at the moment. We're gradually rolling
out filtering of groups at the routed distribution layer to cut down on
this (although the effectiveness of this seems a bit hit and miss on some
platforms). The UPNP group is particularly pernicious, with ttls != 1.

The ping loss was much less noticeable over this weekend, when there is
less activity on campus.

> Do your OSPF adjacencies look stable throughout or do they transition
> during the events? You said you notice loss through the stack but do you
> note any loss from the stack itself to the core uplinks?

OSPF adjacencies totally stable.

When the packet loss happens, it affects pings to the loopback address,
the OSPF interface addresses, but it seems not so much addresses through
it. So more control plane than data plane.

On a previous stack failure we flipped the stacking cable to the opposite
ports (there are two in the stack) but it failed again after that. We
upgraded from tyarget path 8.0.10m to 8.0.30q, but it made absolutely no
difference. Am also looking at the procedure to downgrade to 7.4 and see
if the problem persists, although my gut feeling is that it is
hardware-based. The ping loss problem occurs on another stacked pair that
was 'upgraded' to 8.0.10m at the same time, but that doesn't exhibit the
stack failure issue.

I'm now monitoring the MIB so hopefully will get SMS alert on failures and
have rigged up remote access to the power so we can re-power them remotely
without a visit. This will bide us some time, but essentially we're
looking at bringing forward a replacement we were likely to execute this
summer.

Good questions Joseph thanks!

Addendum, Tuesday morning:

No failures to this point since Friday for us. However the other FCX
stack that was upgraded at the same time and exhibited the ping loss issue
has now also experienced the stack break issue. It has been running 15
days (although not sure why it rebooted then). Curiously, last week, a
third FCX stack broke itself apart too, but that one is running 07400m and
has been stable for a long long time. I had been minded to think we just
had a hardware issue on our most problematic stack, but with the two
others now showing the same symptoms, I'm starting to worry about whether
the stack issue is being provoked by some traffic. Seems hard to believe,
but I've suspected it for other equipment in the past .... ho hum.

Jethro.


>
> --JK
>
> -----Original Message-----
> From: foundry-nsp [mailto:foundry-nsp-bounces@puck.nether.net] On Behalf Of Jethro R Binks
> Sent: Thursday, February 22, 2018 5:16 AM
> To: foundry-nsp@puck.nether.net
> Subject: Re: [f-nsp] FCX and target path 8.0.10m (and an aside)
>
> The silence was deafening!
>
> So bit of a development with this. We had three stack failure events
> which required a hard reboot to sort. We made the decision to upgrade
> to 8.0.30q (we also replaced the CX4 cable, just in case it is degraded
> in some way). Upgrade was all fine.
>
> Initially after the reboot, we didn't see the ping loss issues. But
> over the past few hours it has started to creep in again, much the same
> as previously. I've not re-done all the tests like shutting down one
> then the other ospf interface to see if it makes any difference to the
> problem, but my gut feeling is it will be just the same.
>
> Anyone any thoughts? Could there be some sort of hardware failure in
> one of the units that might cause these symptoms? Maybe I might have
> more diagnostic tools available. What might also be interesting is
> trying to downgrade back to the 7.4 version we were running previously,
> where we didn't see these issues. But that's more service-affecting
> downtime.
>
> Jethro.
>
>
>
> On Fri, 16 Feb 2018, Jethro R Binks wrote:
>
> > I thought I was doing the right thing by upgrading a couple of my
> > slightly aging FCXs to target path release 8.0.10m, which tested fine
> > on an unstacked unit with a single OSPF peering.
> >
> > The ones I am running on are stacks of two, each with two 10Gb/s
> > connections to core, one OSPF peering on each.
> >
> > Since the upgrade, both stacks suffer packet loss every 2 minutes
> > (just about exactly) for about 5-10 seconds, demonstrated by pinging
> > either a host through the stack, or an interface on the stack. There
> > are no log messages or changes in OSPF status or spanning tree
> > activity. When it happens, of course a remote session to the box stalls for the same period.
> >
> > Shutting down either one of the OSPF links doesn't make a difference.
> > CPU never changes from 1%. No errors on ints. I've used dm commands
> > to catch packets going to CPU at about the right time and see nothing
> > particularly alarming and certainly no flooding of anything.
> >
> > This only started after the upgrade to 8.0.10m on each of them. I
> > have other FCX stacks on other code versions not exhibiting this issue.
> >
> > Some of the comments in this thread seem to be reflective of my issue:
> >
> > https://www.reddit.com/r/networking/comments/4j47uo/brocade_is_ruining
> > _my_week_i_need_help_to/
> >
> > I'm a little dismayed to get these problems on a Target Path release,
> > which I assumed would be pretty sound. I've been eyeing a potential
> > upgrade to something in the 8.0.30 (recommendations?), with the usual
> > added excitement of bringing a fresh set of bugs.
> >
> > Before I consider reporting it, I wondered if anyone had any useful
> > observations or suggestions.
> >
> > And, as an aside, I wonder how we're all getting along in our new
> > homes for our dissociated Brocade family now. Very sad to see the
> > assets of a once good company scattered to the four winds like this.
> >
> > Jethro.
> >
> > . . . . . . . . . . . . . . . . . . . . . . . . .
> > Jethro R Binks, Network Manager,
> > Information Services Directorate, University Of Strathclyde, Glasgow,
> > UK
> >
> > The University of Strathclyde is a charitable body, registered in
> > Scotland, number SC015263.
> > _______________________________________________
> > foundry-nsp mailing list
> > foundry-nsp@puck.nether.net
> > http://puck.nether.net/mailman/listinfo/foundry-nsp
> >
>
> . . . . . . . . . . . . . . . . . . . . . . . . .
> Jethro R Binks, Network Manager,
> Information Services Directorate, University Of Strathclyde, Glasgow, UK
>
> The University of Strathclyde is a charitable body, registered in Scotland, number SC015263.
> _______________________________________________
> foundry-nsp mailing list
> foundry-nsp@puck.nether.net
> http://puck.nether.net/mailman/listinfo/foundry-nsp
>

. . . . . . . . . . . . . . . . . . . . . . . . .
Jethro R Binks, Network Manager,
Information Services Directorate, University Of Strathclyde, Glasgow, UK

The University of Strathclyde is a charitable body, registered in
Scotland, number SC015263.
_______________________________________________
foundry-nsp mailing list
foundry-nsp@puck.nether.net
http://puck.nether.net/mailman/listinfo/foundry-nsp