Mailing List Archive

Health Monitoring: TM Egress data errors detected
Hello,


we recently saw this error message on our MLX-16, this caused packet
loss on all routed & switched connections.

First time, we replaced the line cards (which did not change anything)
and rebooted the box afterwards (which solved the problem).
Second time, we powered one SFM down & up, which also seem to have
solved the problem so far.

System: Health Monitoring: TM Egress data errors detected on LP 4/TM 0
System: Health Monitoring: TM Egress data errors detected on LP 3/TM 0
System: Health Monitoring: TM Egress data errors detected on LP 2/TM 0
System: Health Monitoring: TM Egress data errors detected on LP 1/TM 0


Does anybody know if this is a hardware or software related problem?



Best regards,

Franz Georg Köhler
_______________________________________________
foundry-nsp mailing list
foundry-nsp@puck.nether.net
http://puck.nether.net/mailman/listinfo/foundry-nsp
Re: Health Monitoring: TM Egress data errors detected [ In reply to ]
* lists@openunix.de (Franz Georg Köhler) [Mon 07 Oct 2013, 15:42 CEST]:
>Does anybody know if this is a hardware or software related problem?

Yes.

Longer answer: It's both. It's hardware not being capable of
forwarding traffic over certain lines connecting linecards to the
switch fabrics across the backplane, and these lines not being
properly taken out of service. Or queues aren't emptied, possibly
because you have an overloaded ports, or because packets are stuck
in transit due to a software bug.

There are a bunch of counters in "show tm stat all" (the 'all' is
a hidden command) that should not go up rapidly.

You don't mention what software release you're running, but I suggest
upgrading to the latest 5.4 or 5.5 if you're not on that already.


-- Niels.

--
_______________________________________________
foundry-nsp mailing list
foundry-nsp@puck.nether.net
http://puck.nether.net/mailman/listinfo/foundry-nsp
Re: Health Monitoring: TM Egress data errors detected [ In reply to ]
Am 07.10.13 16:52, schrieb Niels Bakker:
> You don't mention what software release you're running, but I suggest
> upgrading to the latest 5.4 or 5.5 if you're not on that already.


Hello,


this box is running: 5.4.0cT163

We have all our Netirons at this version and did only see this behaviour
on this single box.

We did not see any problem before the upgrade to 5.4, though.

I will upgrade this box soon and see if this problem persists - as we
have already replaced the line cards in question, there are only SFM &
chassis left as source, if this is hardware related.


Here are the tm counters of the first two modules:

>show tm stat all
--------- Ports 1/1 - 1/2 ---------
Ingress Counters:
Total Ingress Pkt Count: 3319716055168
EnQue Pkt Count: 3317296080787
EnQue Byte Count: 385472845703104
DeQue Pkt Count: 3317295663387
DeQue Byte Count: 385472805491152
TotalQue Discard Pkt Count: 2419974383
TotalQue Discard Byte Count: 552121774560
Oldest Discard Pkt Count: 417400
Oldest Discard Byte Count: 40211952
Flow Status Message Count: 0
Transmit Data Cell Count: 0
TDM_A Pkt Count: 0
TDM_B Pkt Count: 0
Programmable Ingress Counters:
[Queue Select: 8000, Queue Mask 0x0007]
Prg QDP EnQue Pkt Count: 0
Prg QDP EnQue Byte Count: 0
Prg QDP DeQue Pkt Count: 0
Prg QDP DeQue Byte Count: 0
Prg QDP Head Delete Pkt Count: 30
Prg QDP Head Delete Byte Count: 6704
Prg QDP Tail Delete Pkt Count: 0
Prg QDP Tail Delete Byte Count: 0
Prg Flow Status Message Count: 0
Egress Counters:
EnQue Pkt Count: 3673545381958
EnQue Byte Count: 397150033260832
Discard Pkt Count: 93
Discard Byte Count: 9312
EGQ Segment Error Count: 68004
EGQ Fragment Error Count: 65101
Port63 Error Pkt Count: 0
Pkt Header Error Pkt Count: 0
Pkt Lost Due to Buffer Full Pkt Count: 0
Reassem Err Discard Pkt Count: 43214
Reassem Err Discard Fragment(32B) Count: 3
TDM_A Lost Pkt Count: 0
TDM_B Lost Pkt Count: 0
Programmable Egress Counters:
[.Port Id for Enque: 0 (Disable), Port Id for Discard: 0 (Disable)]
Prg EGQ EnQue Pkt Count: 3672980211629
Prg EGQ EnQue Byte Count: 397142075142816
Prg EGQ Discard Pkt Count: 0
Prg EGQ Discard Byte Count: 0

--------- Ports 1/3 - 1/4 ---------
Ingress Counters:
Total Ingress Pkt Count: 771525102172
EnQue Pkt Count: 771525102172
EnQue Byte Count: 248846069503488
DeQue Pkt Count: 771525102172
DeQue Byte Count: 248846069503376
TotalQue Discard Pkt Count: 0
TotalQue Discard Byte Count: 0
Oldest Discard Pkt Count: 0
Oldest Discard Byte Count: 0
Flow Status Message Count: 0
Transmit Data Cell Count: 0
TDM_A Pkt Count: 0
TDM_B Pkt Count: 0
Programmable Ingress Counters:
[Queue Select: 8000, Queue Mask 0x0007]
Prg QDP EnQue Pkt Count: 0
Prg QDP EnQue Byte Count: 0
Prg QDP DeQue Pkt Count: 0
Prg QDP DeQue Byte Count: 0
Prg QDP Head Delete Pkt Count: 0
Prg QDP Head Delete Byte Count: 0
Prg QDP Tail Delete Pkt Count: 0
Prg QDP Tail Delete Byte Count: 0
Prg Flow Status Message Count: 0
Egress Counters:
EnQue Pkt Count: 276983915197
EnQue Byte Count: 46731993577056
Discard Pkt Count: 168090
Discard Byte Count: 42637856
EGQ Segment Error Count: 1717
EGQ Fragment Error Count: 5700
Port63 Error Pkt Count: 0
Pkt Header Error Pkt Count: 0
Pkt Lost Due to Buffer Full Pkt Count: 0
Reassem Err Discard Pkt Count: 456
Reassem Err Discard Fragment(32B) Count: 0
TDM_A Lost Pkt Count: 0
TDM_B Lost Pkt Count: 0
Programmable Egress Counters:
[.Port Id for Enque: 0 (Disable), Port Id for Discard: 0 (Disable)]
Prg EGQ EnQue Pkt Count: 276959731219
Prg EGQ EnQue Byte Count: 46729521099296
Prg EGQ Discard Pkt Count: 0
Prg EGQ Discard Byte Count: 0

--------- Ports 2/1 - 2/2 ---------
Ingress Counters:
Total Ingress Pkt Count: 3342733079381
EnQue Pkt Count: 3340087500072
EnQue Byte Count: 385116773819632
DeQue Pkt Count: 3340087083917
DeQue Byte Count: 385116733784800
TotalQue Discard Pkt Count: 2645579310
TotalQue Discard Byte Count: 480014746000
Oldest Discard Pkt Count: 416157
Oldest Discard Byte Count: 40034944
Flow Status Message Count: 0
Transmit Data Cell Count: 0
TDM_A Pkt Count: 0
TDM_B Pkt Count: 0
Programmable Ingress Counters:
[Queue Select: 8000, Queue Mask 0x0007]
Prg QDP EnQue Pkt Count: 0
Prg QDP EnQue Byte Count: 0
Prg QDP DeQue Pkt Count: 0
Prg QDP DeQue Byte Count: 0
Prg QDP Head Delete Pkt Count: 0
Prg QDP Head Delete Byte Count: 0
Prg QDP Tail Delete Pkt Count: 0
Prg QDP Tail Delete Byte Count: 0
Prg Flow Status Message Count: 0
Egress Counters:
EnQue Pkt Count: 3701200226884
EnQue Byte Count: 396575257132608
Discard Pkt Count: 34041
Discard Byte Count: 3435008
EGQ Segment Error Count: 60426
EGQ Fragment Error Count: 67579
Port63 Error Pkt Count: 0
Pkt Header Error Pkt Count: 0
Pkt Lost Due to Buffer Full Pkt Count: 0
Reassem Err Discard Pkt Count: 57834
Reassem Err Discard Fragment(32B) Count: 5
TDM_A Lost Pkt Count: 0
TDM_B Lost Pkt Count: 0
Programmable Egress Counters:
[.Port Id for Enque: 0 (Disable), Port Id for Discard: 0 (Disable)]
Prg EGQ EnQue Pkt Count: 3700644643304
Prg EGQ EnQue Byte Count: 396545353474624
Prg EGQ Discard Pkt Count: 0
Prg EGQ Discard Byte Count: 0

--------- Ports 2/3 - 2/4 ---------
Ingress Counters:
Total Ingress Pkt Count: 771840219871
EnQue Pkt Count: 771840219871
EnQue Byte Count: 248784467871456
DeQue Pkt Count: 771840219871
DeQue Byte Count: 248784467871456
TotalQue Discard Pkt Count: 0
TotalQue Discard Byte Count: 0
Oldest Discard Pkt Count: 0
Oldest Discard Byte Count: 0
Flow Status Message Count: 0
Transmit Data Cell Count: 0
TDM_A Pkt Count: 0
TDM_B Pkt Count: 0
Programmable Ingress Counters:
[Queue Select: 8000, Queue Mask 0x0007]
Prg QDP EnQue Pkt Count: 0
Prg QDP EnQue Byte Count: 0
Prg QDP DeQue Pkt Count: 0
Prg QDP DeQue Byte Count: 0
Prg QDP Head Delete Pkt Count: 0
Prg QDP Head Delete Byte Count: 0
Prg QDP Tail Delete Pkt Count: 0
Prg QDP Tail Delete Byte Count: 0
Prg Flow Status Message Count: 0
Egress Counters:
EnQue Pkt Count: 279687567020
EnQue Byte Count: 49570895027776
Discard Pkt Count: 92790
Discard Byte Count: 17163328
EGQ Segment Error Count: 1680
EGQ Fragment Error Count: 5269
Port63 Error Pkt Count: 0
Pkt Header Error Pkt Count: 0
Pkt Lost Due to Buffer Full Pkt Count: 0
Reassem Err Discard Pkt Count: 441
Reassem Err Discard Fragment(32B) Count: 0
TDM_A Lost Pkt Count: 0
TDM_B Lost Pkt Count: 0
Programmable Egress Counters:
[.Port Id for Enque: 0 (Disable), Port Id for Discard: 0 (Disable)]
Prg EGQ EnQue Pkt Count: 279637915732
Prg EGQ EnQue Byte Count: 49565697750048
Prg EGQ Discard Pkt Count: 0
Prg EGQ Discard Byte Count: 0




Best regards,

Franz Georg Köhler
_______________________________________________
foundry-nsp mailing list
foundry-nsp@puck.nether.net
http://puck.nether.net/mailman/listinfo/foundry-nsp
Re: Health Monitoring: TM Egress data errors detected [ In reply to ]
This might be a hardware problem. It took me a couple weeks trying to figure out what's wrong. When you see this error try checking your switch fabrics. Disabling the one at a time to see if the errors go away. It is rare that all lp cards are reporting errors at once.

power-off sfm 1
_______________________________________________
foundry-nsp mailing list
foundry-nsp@puck.nether.net
http://puck.nether.net/mailman/listinfo/foundry-nsp
Re: Health Monitoring: TM Egress data errors detected [ In reply to ]
Hello,

I recently saw the same error on a different maschine with same software
(5.4.0cT163) which leads to the assumption that this is software related.

I will now update to 5.4.0e.

Does anybody else on this list experience those TM problems with 5.4.0
software?

As I read the release notes, I found 2 related issues fixed until 5.4.0e:

Defect ID: DEFECT000451346
Summary: Stuck queues seen on queue 8000/8003 which are not getting
flushed out.
Symptom: This can cause ingress TM Tail deletes.
Probability: Low
Feature: TM/SFM
Reported In Release: NI 05.4.00

Defect ID: DEFECT000451688
Technical Severity: Medium
Summary: After upgrading an MLXe-32 to NI 05.4.00b, SFM links go down
and TM errors are seen.
Symptom: SFMlinkdisabledbysystemhealthmonitorerrors.
Probability: High
Feature: TM/SFM
Function: Health monitoring
Reported In Release: NI 05.4.00


These are the counters for the line card in question:

Egress Counters:
EnQue Pkt Count: 5846944940322
EnQue Byte Count: 292844419542304
Discard Pkt Count: 0
Discard Byte Count: 0
EGQ Segment Error Count: 197928
EGQ Fragment Error Count: 196802
Port63 Error Pkt Count: 771
Pkt Header Error Pkt Count: 232507
Pkt Lost Due to Buffer Full Pkt Count: 0
Reassem Err Discard Pkt Count: 195465
Reassem Err Discard Fragment(32B) Count: 1584
TDM_A Lost Pkt Count: 0
TDM_B Lost Pkt Count: 0



Best regards,

Franz Georg Köhler
_______________________________________________
foundry-nsp mailing list
foundry-nsp@puck.nether.net
http://puck.nether.net/mailman/listinfo/foundry-nsp
Re: Health Monitoring: TM Egress data errors detected [ In reply to ]
Hello Folks,
The TM error counters you are seeing can be a bit cryptic. The TM (traffic
manager) is the chip that interfaces with the switch fabric (SFM) though
the SFM Links. When you see the errors on for Ingress or Egress, that is
relative to the chip, not the physical port. I the the reassembly errors
are the key in the output; the TM is having trouble putting back together
data that is originating from the backplane.

The MLX uses a CLOS fabric (every chip is exactly 2 hops for every
possible connection) so packets are split into “cells” as they are
forwarded across the backplane. When we see these error in the TM, it
usually means we have a hardware error between the TM and SFM. This could
be from a failing TM, a bad SERDES, or a bad SFM Link. In 5.4, the MLX
should be watching for any flapping SFM Links and attempting to
dynamically retune the SERDES to stop the error, but that won’t help if
you have a faulty TM.

Of course, an underlying code issue or bug could be causing these errors,
but I would try power cycling the module, reinserting it into the slot, or
making sure all the SFM Links and Fes (Fabric Elements in the SFM) are not
showing any problems.

If you’ve got support, I would check with Brocade TAC. They can run some
remote debug that will help isolate. Most of the output is in hex though,
so its a bit hard to cover here.

-WilburWilbur Smith
SE Ninja, Brocade
wilbur.k.smith@gmail.com
wsmith@brocade.com


Disclosure: While I am a Brocade employee, my participation
in this community is a personal choice and not directed by my employer. And
information or recommendation I provide are my own and not an official
recommendation
from Brocade. Sorry folks, just need to make sure you know I’m ‘doin this
“off
the clock”!


On 11/7/13, 2:11 AM, "Franz Georg Köhler" <lists@openunix.de> wrote:

>Reassem Err Discard


_______________________________________________
foundry-nsp mailing list
foundry-nsp@puck.nether.net
http://puck.nether.net/mailman/listinfo/foundry-nsp