Mailing List Archive

PFE forwarding bug - PR1380183
I hit PR1380183 last week on an MX960.



https://prsearch.juniper.net/InfoCenter/index?page=prcontent
<https://prsearch.juniper.net/InfoCenter/index?page=prcontent&id=PR1380183>
&id=PR1380183



I currently run 17.4R2-S1.2 on all my MX960's.



The PR mentions a fix in 5 different versions of Junos.



Should I stick with the current train I'm in ?




Resolved In


Release

junos


18.4R1

x


18.4R2

x


17.4R3

x


19.1R1

x


17.4R2-S2

x



...17.4R2-S2 is closest to what I'm currently using.



- Aaron



_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: PFE forwarding bug - PR1380183 [ In reply to ]
Unless you need some feature/function that is in some 18.x or 19.x, and everything else in 17.4 is good for you, then I would suggest using 17.4R2-S[latest], which is S6 right now. S2 and above contain the fix for this PR.

My 2 cents worth.

Rich

Richard McGovern
Sr Sales Engineer, Juniper Networks
978-618-3342

I’d rather be lucky than good, as I know I am not good
I don’t make the news, I just report it


?On 8/19/19, 11:36 AM, "Aaron Gould" <aaron1@gvtc.com> wrote:

I hit PR1380183 last week on an MX960.



https://prsearch.juniper.net/InfoCenter/index?page=prcontent
<https://prsearch.juniper.net/InfoCenter/index?page=prcontent&id=PR1380183>
&id=PR1380183



I currently run 17.4R2-S1.2 on all my MX960's.



The PR mentions a fix in 5 different versions of Junos.



Should I stick with the current train I'm in ?




Resolved In


Release

junos


18.4R1

x


18.4R2

x


17.4R3

x


19.1R1

x


17.4R2-S2

x



...17.4R2-S2 is closest to what I'm currently using.



- Aaron






_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: PFE forwarding bug - PR1380183 [ In reply to ]
Thanks Rich, similar to the guidance from my Juniper account SE. ...also 17.4R3 is being released in September but I understand that once you jump R releases, you get into new features with potential for new bugs correct ? In other words, am I correct that the next S (service) release is the safest and least changes as possible to the existing train of code you are currently running ?

(I just read this as a refresher for my understanding)
https://forums.juniper.net/t5/Junos/Current-JUNOS-Release-numbers-explained/td-p/58396


-Aaron


_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: PFE forwarding bug - PR1380183 [ In reply to ]
Yes, I would say in general that staying on the same R release is safest as no new features will be introduced, and yes it is often with new feature development that bugs get created. I have found that often times affecting not the new feature, but other standard features. This is IMHO, not necessarily the opinion of Juniper as a company.

That release number explanation is from 2010, and things have changed dramatically since then. Currently the numbering for XX.YR#-S# is:

XX = the year
Y equals the quarter (y = 1, 2, 3 or 4)
R is release number. Now with each R release, new features are generally introduced. R changes are no longer SW improvements only, but a combination of SW improvements and new features for feature acceleration. This is needed with the rate of change within the industry.
S is now the SW improvement only vehicle - replaced the old R, which under prior guidelines, could not include new features . Therefore S something is now like 90+% of the time the JTAC recommended version to use.

My suggest is pick the XX.YR# you want, go to SR pulldown, and ALWAYS use the latest S release for that stream.

I also suggest listening to your account SE, more than anyone else -__

Rich

PS - X releases are a branch from the mainline, and are specific to a product family. Done generally to release a new product where XX.Y is not on time or too late to support, or for specific work for some product family stability. X streams will always eventually branch back into the mainline at some point in time.

Richard McGovern
Sr Sales Engineer, Juniper Networks
978-618-3342

I’d rather be lucky than good, as I know I am not good
I don’t make the news, I just report it


?On 8/20/19, 9:28 AM, "Aaron Gould" <aaron1@gvtc.com> wrote:

Thanks Rich, similar to the guidance from my Juniper account SE. ...also 17.4R3 is being released in September but I understand that once you jump R releases, you get into new features with potential for new bugs correct ? In other words, am I correct that the next S (service) release is the safest and least changes as possible to the existing train of code you are currently running ?

(I just read this as a refresher for my understanding)
https://forums.juniper.net/t5/Junos/Current-JUNOS-Release-numbers-explained/td-p/58396


-Aaron




_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: PFE forwarding bug - PR1380183 [ In reply to ]
> Aaron Gould
> Sent: Monday, August 19, 2019 4:36 PM
>
> I hit PR1380183 last week on an MX960.
>
>
>
> https://prsearch.juniper.net/InfoCenter/index?page=prcontent
> <https://prsearch.juniper.net/InfoCenter/index?page=prcontent&id=PR138
> 0183>
> &id=PR1380183
>
> I currently run 17.4R2-S1.2 on all my MX960's.
>
> The PR mentions a fix in 5 different versions of Junos.
>
> Should I stick with the current train I'm in ?
>
I'd ask your SE if JSU can be used to fix the bug, that in my opinion is the
safest approach as it limits the possibility of introducing regression bugs
(fixes in S release breaking other stuff).
Also wondering what the flap rate must be in order to trigger the bug -would
the exponential hold down timer for interface state changes help avoid
running into conditions when this bug is triggered please?

adam


_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: PFE forwarding bug - PR1380183 [ In reply to ]
Sure Adam,

here's logs after interface flap which JTAC and myself determined was the
initial bug trigger... xe-11/0/0:2 is a member of a (2) 10 gig lag ae...
xe-11/0/0:2 and xe-0/0/0:2 make the 20 gig bundle... only xe-11/0/0:2
flapped and I never saw traffic come back to that interface even though it
was up... no packets exited that interface after this bug hit... lacp
errors/timeout seen from that time stamp.

{master}
agould@my-mx-960> show log messages.0.gz | find "Aug 14 21"
Aug 14 21:07:36 my-mx-960 mib2d[15194]: SNMP_TRAP_LINK_DOWN: ifIndex 672,
ifAdminStatus up(1), ifOperStatus down(2), ifName xe-11/0/0:2
Aug 14 21:07:37 my-mx-960 mib2d[15194]: SNMP_TRAP_LINK_UP: ifIndex 672,
ifAdminStatus up(1), ifOperStatus up(1), ifName xe-11/0/0:2
Aug 14 21:07:39 my-mx-960 mib2d[15194]: SNMP_TRAP_LINK_DOWN: ifIndex 672,
ifAdminStatus up(1), ifOperStatus down(2), ifName xe-11/0/0:2
Aug 14 21:07:42 my-mx-960 mib2d[15194]: SNMP_TRAP_LINK_UP: ifIndex 672,
ifAdminStatus up(1), ifOperStatus up(1), ifName xe-11/0/0:2
Aug 14 21:07:50 my-mx-960 fpc11 mqss_wo_coreif_conn_cfcnt_wait_for_zero:
Timeout occured while waiting for CFCNT value to become zero - conn_num 2
Aug 14 21:07:50 my-mx-960 fpc11 mqss_stream_out_disable: Waiting for CFCNT
value to become zero for WO connection failed - status 29, conn_num 2
Aug 14 21:07:50 my-mx-960 fpc11 mqss_ifd_link_up_down_handler: Disabling
PHY stream for egress side failed - status 29, instance 0, phy_stream 1094
Aug 14 21:07:50 my-mx-960 fpc11 pfe_ifd_link_updown: Handling IFD link DOWN
failed - status 29, ifd xe-11/0/0:2
Aug 14 21:07:50 my-mx-960 mib2d[15194]: SNMP_TRAP_LINK_DOWN: ifIndex 672,
ifAdminStatus up(1), ifOperStatus down(2), ifName xe-11/0/0:2
Aug 14 21:07:50 my-mx-960 mib2d[15194]: SNMP_TRAP_LINK_UP: ifIndex 672,
ifAdminStatus up(1), ifOperStatus up(1), ifName xe-11/0/0:2
Aug 14 21:07:53 my-mx-960 lacpd[13424]: LACPD_TIMEOUT: xe-11/0/0:2: lacp
current while timer expired current Receive State: CURRENT
Aug 14 21:08:23 my-mx-960 lacpd[13424]: LACPD_TIMEOUT: xe-11/0/0:2: lacp
current while timer expired current Receive State: CURRENT
Aug 14 21:10:22 my-mx-960 last message repeated 3 times
Aug 14 21:14:52 my-mx-960 last message repeated 9 times
Aug 14 21:15:22 my-mx-960 lacpd[13424]: LACPD_TIMEOUT: xe-11/0/0:2: lacp
current while timer expired current Receive State: CURRENT
Aug 14 21:15:52 my-mx-960 lacpd[13424]: LACPD_TIMEOUT: xe-11/0/0:2: lacp
current while timer expired current Receive State: CURRENT
Aug 14 21:17:52 my-mx-960 last message repeated 4 times
Aug 14 21:23:52 my-mx-960 last message repeated 12 times
Aug 14 21:24:07 my-mx-960 fpc11
mqss_wo_coreif_conn_credits_wait_for_init_value: Timeout occured while
waiting for available credits value to become initial value - conn_num 2,
credits 0, init_credits 3
Aug 14 21:24:07 my-mx-960 fpc11
mqss_stream_phy_stream_out_wanio_cr_flush_start: Waiting for available
credits value to become initial value for WO connection failed - status 29,
conn_num 2
Aug 14 21:24:07 my-mx-960 fpc11 mqss_stream_phy_stream_out_wanio_cr_flush:
Starting traffic flush using WANIO core flush for PHY stream failed - status
29, stream_num 1094, chmac_speed 0, pr_stream 32
Aug 14 21:24:07 my-mx-960 fpc11 mqss_stream_out_disable_wanio_ea: Starting
traffic flush for PHY stream using WANIO core failed - status 29, stream_num
1094
Aug 14 21:24:07 my-mx-960 fpc11 mqss_stream_out_disable_wanio: Performing
egress PHY stream disable operations for WANIO failed - status 29,
stream_num 1094
Aug 14 21:24:07 my-mx-960 fpc11 mqss_stream_out_disable: Performing egress
PHY stream disable operations for WANIO failed - status 29, stream_num 1094
Aug 14 21:24:07 my-mx-960 fpc11 mqss_ifd_link_up_down_handler: Disabling
PHY stream for egress side failed - status 29, instance 0, phy_stream 1094
Aug 14 21:24:07 my-mx-960 fpc11 pfe_ifd_link_updown: Handling IFD link DOWN
failed - status 29, ifd xe-11/0/0:2
Aug 14 21:24:07 my-mx-960 mib2d[15194]: SNMP_TRAP_LINK_DOWN: ifIndex 672,
ifAdminStatus up(1), ifOperStatus down(2), ifName xe-11/0/0:2
Aug 14 21:24:08 my-mx-960 mib2d[15194]: SNMP_TRAP_LINK_UP: ifIndex 672,
ifAdminStatus up(1), ifOperStatus up(1), ifName xe-11/0/0:2
Aug 14 21:24:23 my-mx-960 lacpd[13424]: LACPD_TIMEOUT: xe-11/0/0:2: lacp
current while timer expired current Receive State: CURRENT
Aug 14 21:24:53 my-mx-960 lacpd[13424]: LACPD_TIMEOUT: xe-11/0/0:2: lacp
current while timer expired current Receive State: CURRENT
Aug 14 21:26:53 my-mx-960 last message repeated 4 times
Aug 14 21:29:53 my-mx-960 last message repeated 6 times
Aug 14 21:30:23 my-mx-960 lacpd[13424]: LACPD_TIMEOUT: xe-11/0/0:2: lacp
current while timer expired current Receive State: CURRENT
Aug 14 21:30:53 my-mx-960 lacpd[13424]: LACPD_TIMEOUT: xe-11/0/0:2: lacp
current while timer expired current Receive State: CURRENT
Aug 14 21:32:53 my-mx-960 last message repeated 4 times
Aug 14 21:42:53 my-mx-960 last message repeated 20 times
Aug 14 21:43:53 my-mx-960 last message repeated 2 times

Tshooting....
- I bounced xe-11/0/0:2, nothing
- I bounced neighboring other 10 gig ints within that 4x10 breakout,
xe-11/0/0:0, xe-11/0/0:1, xe-11/0/0:3.... nothing... as a matter of fact, it
went from bad to worse. By bouncing those neighboring other members of that
4x10 breakout, I broke all other interfaces on that vPIC of the MPC7E-MRATE
card !!! I mean, all these went down ... (so in bouncing those neighboring
10's within that 4x10 , I broke one of my 100 gig interfaces in that PIC
group (namely, et-11/0/2)

agould@stlr-960> show interfaces terse | grep ^"xe|et" | grep ^..-11/0
xe-11/0/0:0 up up
xe-11/0/0:0.0 up up aenet --> ae50.0
xe-11/0/0:1 up up
xe-11/0/0:1.0 up up aenet --> ae50.0
xe-11/0/0:2 up up
xe-11/0/0:2.0 up up aenet --> ae61.0
xe-11/0/0:3 up up
xe-11/0/0:3.0 up up aenet --> ae100.0
et-11/0/2 up up
et-11/0/2.0 up up aenet --> ae1.0
et-11/0/5 up down

here's what it showed when I tried that...
Aug 15 13:50:54 my-mx-960 rpd[14820]: RPD_LDP_NBRDOWN: LDP neighbor
111.222.129.10 (ae1.0) is down
Aug 15 13:50:54 my-mx-960 rpd[14820]: RPD_LDP_SESSIONDOWN: LDP session
111.222.128.7 is down, reason: all adjacencies down
Aug 15 13:50:54 my-mx-960 mib2d[15194]: SNMP_TRAP_LINK_DOWN: ifIndex 733,
ifAdminStatus up(1), ifOperStatus down(2), ifName ae1
Aug 15 13:50:54 my-mx-960 mib2d[15194]: SNMP_TRAP_LINK_DOWN: ifIndex 719,
ifAdminStatus up(1), ifOperStatus down(2), ifName et-11/0/2
Aug 15 13:50:54 my-mx-960 rpd[14820]: RPD_OSPF_NBRDOWN: OSPF neighbor
111.222.129.10 (realm ospf-v2 ae1.0 area 0.0.0.1) state changed from Full to
Down due to KillNbr (event reason: interface went down)
Aug 15 13:50:54 my-mx-960 fpc11 PFE 0: 'PFE Disable' action performed.
Bringing down ifd et-11/0/5 218
Aug 15 13:50:54 my-mx-960 fpc11 PFE 0: 'PFE Disable' action performed.
Bringing down ifd xe-11/0/0:0 213
Aug 15 13:50:54 my-mx-960 fpc11 PFE 0: 'PFE Disable' action performed.
Bringing down ifd xe-11/0/0:1 214
Aug 15 13:50:54 my-mx-960 fpc11 PFE 0: 'PFE Disable' action performed.
Bringing down ifd xe-11/0/0:2 215
Aug 15 13:50:54 my-mx-960 fpc11 PFE 0: 'PFE Disable' action performed.
Bringing down ifd xe-11/0/0:3 216
Aug 15 13:50:54 my-mx-960 kernel: if_msg_ifd_cmd_tlv_decode ifd et-11/0/2
#217 down with ASIC Error
Aug 15 13:50:54 my-mx-960 kernel: if_msg_ifd_cmd_tlv_decode ifd et-11/0/5
#218 down with ASIC Error
Aug 15 13:50:54 my-mx-960 kernel: if_msg_ifd_cmd_tlv_decode ifd xe-11/0/0:0
#213 down with ASIC Error
Aug 15 13:50:54 my-mx-960 kernel: if_msg_ifd_cmd_tlv_decode ifd xe-11/0/0:1
#214 down with ASIC Error
Aug 15 13:50:54 my-mx-960 kernel: if_msg_ifd_cmd_tlv_decode ifd xe-11/0/0:2
#215 down with ASIC Error
Aug 15 13:50:54 my-mx-960 kernel: if_msg_ifd_cmd_tlv_decode ifd xe-11/0/0:3
#216 down with ASIC Error
Aug 15 13:50:54 my-mx-960 fpc11 Cmerror Op Set: MQSS(0): MQSS(0): WANIO_CR:
Parity Protect: Parity error detected for Tx SRAM - detected_txlink 0x1
Aug 15 13:50:54 my-mx-960 fpc11 PFE 0: 'PFE Disable' action performed.
Bringing down ifd et-11/0/2 217
Aug 15 13:50:54 my-mx-960 fpc11
mqss_wo_coreif_conn_credits_wait_for_init_value: Timeout occured while
waiting for available credits value to become initial value - conn_num 2,
credits 0, init_credits 3
Aug 15 13:50:54 my-mx-960 fpc11
mqss_stream_phy_stream_out_wanio_cr_flush_start: Waiting for available
credits value to become initial value for WO connection failed - status 29,
conn_num 2
Aug 15 13:50:54 my-mx-960 fpc11 mqss_stream_phy_stream_out_wanio_cr_flush:
Starting traffic flush using WANIO core flush for PHY stream failed - status
29, stream_num 1094, chmac_speed 0, pr_stream 32
Aug 15 13:50:54 my-mx-960 fpc11 mqss_stream_out_disable_wanio_ea: Starting
traffic flush for PHY stream using WANIO core failed - status 29, stream_num
1094
Aug 15 13:50:54 my-mx-960 fpc11 mqss_stream_out_disable_wanio: Performing
egress PHY stream disable operations for WANIO failed - status 29,
stream_num 1094
Aug 15 13:50:54 my-mx-960 fpc11 mqss_stream_out_disable: Performing egress
PHY stream disable operations for WANIO failed - status 29, stream_num 1094
Aug 15 13:50:54 my-mx-960 fpc11 mqss_ifd_link_up_down_handler: Disabling
PHY stream for egress side failed - status 29, instance 0, phy_stream 1094
Aug 15 13:50:54 my-mx-960 fpc11 pfe_ifd_link_updown: Handling IFD link DOWN
failed - status 29, ifd xe-11/0/0:2
Aug 15 13:50:54 my-mx-960 fpc11
mqss_wo_coreif_conn_credits_wait_for_init_value: Timeout occured while
waiting for available credits value to become initial value - conn_num 3,
credits 10, init_credits 3
Aug 15 13:50:54 my-mx-960 fpc11
mqss_stream_phy_stream_out_wanio_cr_flush_start: Waiting for available
credits value to become initial value for WO connection failed - status 29,
conn_num 3
Aug 15 13:50:54 my-mx-960 fpc11 mqss_stream_phy_stream_out_wanio_cr_flush:
Starting traffic flush using WANIO core flush for PHY stream failed - status
29, stream_num 1095, chmac_speed 0, pr_stream 33
Aug 15 13:50:54 my-mx-960 fpc11 mqss_stream_out_disable_wanio_ea: Starting
traffic flush for PHY stream using WANIO core failed - status 29, stream_num
1095
Aug 15 13:50:54 my-mx-960 fpc11 mqss_stream_out_disable_wanio: Performing
egress PHY stream disable operations for WANIO failed - status 29,
stream_num 1095
Aug 15 13:50:54 my-mx-960 fpc11 mqss_stream_out_disable: Performing egress
PHY stream disable operations for WANIO failed - status 29, stream_num 1095
Aug 15 13:50:54 my-mx-960 fpc11 mqss_ifd_link_up_down_handler: Disabling
PHY stream for egress side failed - status 29, instance 0, phy_stream 1095
Aug 15 13:50:54 my-mx-960 fpc11 pfe_ifd_link_updown: Handling IFD link DOWN
failed - status 29, ifd xe-11/0/0:3


-Aaron


_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp