Mailing List Archive

OSPF and BGP flapping when enabling a certain amount of BGP neighbors
Hi,

one of our CER2024F routers started to behave weird without noticeable
reason, we didn't apply any changes before:

A while ago the device showed up in out monitoring with flapping OSPF
sessions caused by malformed packets and BGP sessions with expired
hold-timers. This made the device to become unresponsive, so we disabled
most BGP sessions except the one to our transit partner and 4 iBPG
sessions. This brought the device to an operational state again.

In exchange we received a new identical device from our vendor and
applied a configuration backup of the former device, but it behaves just
like the old one when we took all sessions in service.

To get an idea how many sessions are needed to cause issues we carefully
took sessions of small networks in service one by one while observing
cpu, memory usage and the number of routes installed. No issues occured,
so we took two big sessions in service (DECIX route servers), again,
nothing remarkable happened.
Encouraged by that we simultaneously took 10 sessions in service and the
ospf flapping started, so we disabled them and the device was able to
cope with its workload again.
To make sure we don't exceed the capabilities of the device we took
those sessions in service one by one with a delay of 10 seconds, this
did *not* cause OSPF flaps or BGP connections to restart, so we decided
to take the last 10 remaining sessions in service at once again, which
almost immediately caused OSPF flaps and BGP sessions to restart.
Therefore we stopped all sessions we took in service before, except the
transit partner and 4 iBGP sessions, but the flapping continued, the
only way to get the CER to an operational state again was reloading it
with most of the BGP sessions disabled by default.

However, we were able to drag some information from the device during
the last flapping, we didn't see a significant change in memory usage,
but the load increased dramatically:

SSH@CER(config-bgp)#sho cpu-utilization

00:09:57 GMT+01 Fri Jun 22 2018

... Usage average for all tasks in the last 1 seconds ...
==========================================================
Name us/sec %

idle 0 0
con 35 0
mon 190 0
flash 44 0
dbg 39 0
boot 70 0
main 0 0
itc 0 0
tmr 4358 0
ip_rx 26720 2
scp 54 0
lpagent 357 0
console 324 0
vlan 0 0
mac_mgr 199 0
mrp 241 0
vsrp 0 0
erp 239 0
mxrp 127 0
snms 0 0
rtm 638 0
rtm6 301 0
ip_tx 11100 1
rip 0 0
l2vpn 0 0
mpls 0 0
nht 0 0
mpls_glue 0 0
pcep 0 0
bgp 212773 21
bgp_io 240 0
ospf 1005 0
ospf_r_calc 1193 0
isis 260 0
isis_spf 0 0
mcast 460 0
msdp 23 0
vrrp 0 0
ripng 0 0
ospf6 667 0
ospf6_rt 0 0
mcast6 557 0
vrrp6 0 0
bfd 20 0
ipsec 57 0
l4 0 0
stp 0 0
gvrp_mgr 0 0
snmp 458 0
rmon 25 0
web 1573 0
lacp 4199 0
dot1x 0 0
dot1ag 177 0
loop_detect 127 0
ccp 12 0
cluster_mgr 131 0
hw_access 0 0
ntp 22 0
openflow_ofm 15 0
openflow_opm 30 0
dhcp6 0 0
sysmon 0 0
ospf_msg_task 0 0
ssl 0 0
http_client 0 0
lp 723566 76
LP-I2C 35 0
ssh_0 84 0
ssh_1 2140 0
ssh_2 5072 0
ssh_3 43 0

The documentation states the device is able to handle 1.5 Mio routes and
we didn't get above this limit:

SSH@CER(config-bgp)#show ip bgp route sum
Total number of BGP routes (NLRIs) Installed : 1210135
Distinct BGP destination networks : 697652
Filtered bgp routes for soft reconfig : 394895
Routes originated by this router : 4
Routes selected as BEST routes : 410535
BEST routes not installed in IP forwarding table : 0
Unreachable routes (no IGP route for NEXTHOP) : 0
IBGP routes selected as best routes : 79640
EBGP routes selected as best routes : 330891


SSH@CER(config-bgp)#show ip route sum
IP Routing Table - 410845 entries
8 connected, 11 static, 0 RIP, 294 OSPF, 410532 BGP, 0 ISIS
Number of prefixes:
/0: 1 /4: 1 /8: 16 /9: 11 /10: 36 /11: 99 /12: 291 /13: 565 /14: 1099
/15: 1924 /16: 13355 /17: 7910 /18: 13673 /19: 24926 /20: 38033 /21:
44870 /22: 86917 /23: 70274 /24: 106572 /25: 12 /26: 11 /27: 25 /28: 21
/29: 21 /30: 67 /32: 115
Nexthop Table Entry - 682 entries

Can anybody give me some hint what could cause the behaviour described
above or what to investigate to tackle that issue?


--
Frank Menzel - menzel@sipgate.de

sipgate GmbH - Gladbacher Str. 74 - 40219 Düsseldorf
HRB Düsseldorf 39841 - Geschäftsführer: Thilo Salmon, Tim Mois
Steuernummer: 106/5724/7147, Umsatzsteuer-ID: DE219349391

http://www.sipgate.de - http://www.sipgate.co.uk
_______________________________________________
foundry-nsp mailing list
foundry-nsp@puck.nether.net
http://puck.nether.net/mailman/listinfo/foundry-nsp
Re: OSPF and BGP flapping when enabling a certain amount of BGP neighbors [ In reply to ]
Can you post your confg?

LP load looks high.
Try to disable icmp redirect in config:

no ip icmp redirect

It's a Brocade thing...



Kind regards/Met vriendelijke groet,

Dennis op de Weegh

 

Bitency
Willem van Oranjestraat 9
4931NJ Geertruidenberg
 
Kvk nummer: 20144338
BTW nummer: NL213538519B01
 
W: www.bitency.nl
E: info@bitency.nl
T: +31 (0)162 714066


-----Oorspronkelijk bericht-----
Van: foundry-nsp <foundry-nsp-bounces@puck.nether.net> Namens Frank Menzel
Verzonden: vrijdag 22 juni 2018 17:57
Aan: foundry-nsp@puck.nether.net
Onderwerp: [f-nsp] OSPF and BGP flapping when enabling a certain amount of BGP neighbors

Hi,

one of our CER2024F routers started to behave weird without noticeable reason, we didn't apply any changes before:

A while ago the device showed up in out monitoring with flapping OSPF sessions caused by malformed packets and BGP sessions with expired hold-timers. This made the device to become unresponsive, so we disabled
most BGP sessions except the one to our transit partner and 4 iBPG sessions. This brought the device to an operational state again.

In exchange we received a new identical device from our vendor and applied a configuration backup of the former device, but it behaves just like the old one when we took all sessions in service.

To get an idea how many sessions are needed to cause issues we carefully took sessions of small networks in service one by one while observing cpu, memory usage and the number of routes installed. No issues occured, so we took two big sessions in service (DECIX route servers), again, nothing remarkable happened.
Encouraged by that we simultaneously took 10 sessions in service and the ospf flapping started, so we disabled them and the device was able to cope with its workload again.
To make sure we don't exceed the capabilities of the device we took those sessions in service one by one with a delay of 10 seconds, this did *not* cause OSPF flaps or BGP connections to restart, so we decided to take the last 10 remaining sessions in service at once again, which almost immediately caused OSPF flaps and BGP sessions to restart.
Therefore we stopped all sessions we took in service before, except the transit partner and 4 iBGP sessions, but the flapping continued, the only way to get the CER to an operational state again was reloading it with most of the BGP sessions disabled by default.

However, we were able to drag some information from the device during the last flapping, we didn't see a significant change in memory usage, but the load increased dramatically:

SSH@CER(config-bgp)#sho cpu-utilization

00:09:57 GMT+01 Fri Jun 22 2018

... Usage average for all tasks in the last 1 seconds ...
==========================================================
Name us/sec %

idle 0 0
con 35 0
mon 190 0
flash 44 0
dbg 39 0
boot 70 0
main 0 0
itc 0 0
tmr 4358 0
ip_rx 26720 2
scp 54 0
lpagent 357 0
console 324 0
vlan 0 0
mac_mgr 199 0
mrp 241 0
vsrp 0 0
erp 239 0
mxrp 127 0
snms 0 0
rtm 638 0
rtm6 301 0
ip_tx 11100 1
rip 0 0
l2vpn 0 0
mpls 0 0
nht 0 0
mpls_glue 0 0
pcep 0 0
bgp 212773 21
bgp_io 240 0
ospf 1005 0
ospf_r_calc 1193 0
isis 260 0
isis_spf 0 0
mcast 460 0
msdp 23 0
vrrp 0 0
ripng 0 0
ospf6 667 0
ospf6_rt 0 0
mcast6 557 0
vrrp6 0 0
bfd 20 0
ipsec 57 0
l4 0 0
stp 0 0
gvrp_mgr 0 0
snmp 458 0
rmon 25 0
web 1573 0
lacp 4199 0
dot1x 0 0
dot1ag 177 0
loop_detect 127 0
ccp 12 0
cluster_mgr 131 0
hw_access 0 0
ntp 22 0
openflow_ofm 15 0
openflow_opm 30 0
dhcp6 0 0
sysmon 0 0
ospf_msg_task 0 0
ssl 0 0
http_client 0 0
lp 723566 76
LP-I2C 35 0
ssh_0 84 0
ssh_1 2140 0
ssh_2 5072 0
ssh_3 43 0

The documentation states the device is able to handle 1.5 Mio routes and we didn't get above this limit:

SSH@CER(config-bgp)#show ip bgp route sum
Total number of BGP routes (NLRIs) Installed : 1210135
Distinct BGP destination networks : 697652
Filtered bgp routes for soft reconfig : 394895
Routes originated by this router : 4
Routes selected as BEST routes : 410535
BEST routes not installed in IP forwarding table : 0
Unreachable routes (no IGP route for NEXTHOP) : 0
IBGP routes selected as best routes : 79640
EBGP routes selected as best routes : 330891


SSH@CER(config-bgp)#show ip route sum
IP Routing Table - 410845 entries
8 connected, 11 static, 0 RIP, 294 OSPF, 410532 BGP, 0 ISIS
Number of prefixes:
/0: 1 /4: 1 /8: 16 /9: 11 /10: 36 /11: 99 /12: 291 /13: 565 /14: 1099
/15: 1924 /16: 13355 /17: 7910 /18: 13673 /19: 24926 /20: 38033 /21:
44870 /22: 86917 /23: 70274 /24: 106572 /25: 12 /26: 11 /27: 25 /28: 21
/29: 21 /30: 67 /32: 115
Nexthop Table Entry - 682 entries

Can anybody give me some hint what could cause the behaviour described above or what to investigate to tackle that issue?


--
Frank Menzel - menzel@sipgate.de

sipgate GmbH - Gladbacher Str. 74 - 40219 Düsseldorf
HRB Düsseldorf 39841 - Geschäftsführer: Thilo Salmon, Tim Mois
Steuernummer: 106/5724/7147, Umsatzsteuer-ID: DE219349391

http://www.sipgate.de - http://www.sipgate.co.uk
_______________________________________________
foundry-nsp mailing list
foundry-nsp@puck.nether.net
http://puck.nether.net/mailman/listinfo/foundry-nsp
_______________________________________________
foundry-nsp mailing list
foundry-nsp@puck.nether.net
http://puck.nether.net/mailman/listinfo/foundry-nsp
Re: OSPF and BGP flapping when enabling a certain amount of BGP neighbors [ In reply to ]
I'll second Dennis. Disabling icmp redirects is extremely important if you
have multiple addresses on a single interface.

If you have a lot of routes, you may need to change your system-max
values. Run 'show default values' and look for ip-route and ip-cache
values (and ipv6- equivalents). The defaults are usually quite low (290k
routes on our CER2024F's, this needs to fit your entire FIB). Change with
'system-max <parameter> <value>', write mem, then reload. On the MLX, you
also have to worry about cam partitioning profiles. The CER2024F may be
able to handle 1.5M routes in the BGP RIB, but it has a HW max of 524288 in
the FIB.

I have also seen a lot of lp-cpu usage caused by multicast traffic,
especially with older code.

If you see high lp cpu again in the future, you can run 'dm pstat' a few
times to try to get an idea of what kind of traffic you are receiving. The
first run is typically a throwaway, as it shows counts since the last run.
It gives per-PP stats, but I think the CERs only have one PP anyway. If
you are feeling brave, you can use 'rconsole' to connect to the LP and play
with 'debug packet capture' (captures/displays packets that are hitting the
lp cpu), but beware... I have had devices unexpectedly reboot playing with
that. Always specify a limit.

--
Eldon

On Fri, Jun 22, 2018 at 10:06 AM, Dennis op de Weegh <info@bitency.nl>
wrote:

> Can you post your confg?
>
> LP load looks high.
> Try to disable icmp redirect in config:
>
> no ip icmp redirect
>
> It's a Brocade thing...
>
>
>
> Kind regards/Met vriendelijke groet,
>
> Dennis op de Weegh
>
>
>
> Bitency
> Willem van Oranjestraat 9
> 4931NJ Geertruidenberg
>
> Kvk nummer: 20144338
> BTW nummer: NL213538519B01
>
> W: www.bitency.nl
> E: info@bitency.nl
> T: +31 (0)162 714066
>
>
> -----Oorspronkelijk bericht-----
> Van: foundry-nsp <foundry-nsp-bounces@puck.nether.net> Namens Frank Menzel
> Verzonden: vrijdag 22 juni 2018 17:57
> Aan: foundry-nsp@puck.nether.net
> Onderwerp: [f-nsp] OSPF and BGP flapping when enabling a certain amount of
> BGP neighbors
>
> Hi,
>
> one of our CER2024F routers started to behave weird without noticeable
> reason, we didn't apply any changes before:
>
> A while ago the device showed up in out monitoring with flapping OSPF
> sessions caused by malformed packets and BGP sessions with expired
> hold-timers. This made the device to become unresponsive, so we disabled
> most BGP sessions except the one to our transit partner and 4 iBPG
> sessions. This brought the device to an operational state again.
>
> In exchange we received a new identical device from our vendor and applied
> a configuration backup of the former device, but it behaves just like the
> old one when we took all sessions in service.
>
> To get an idea how many sessions are needed to cause issues we carefully
> took sessions of small networks in service one by one while observing cpu,
> memory usage and the number of routes installed. No issues occured, so we
> took two big sessions in service (DECIX route servers), again, nothing
> remarkable happened.
> Encouraged by that we simultaneously took 10 sessions in service and the
> ospf flapping started, so we disabled them and the device was able to cope
> with its workload again.
> To make sure we don't exceed the capabilities of the device we took those
> sessions in service one by one with a delay of 10 seconds, this did *not*
> cause OSPF flaps or BGP connections to restart, so we decided to take the
> last 10 remaining sessions in service at once again, which almost
> immediately caused OSPF flaps and BGP sessions to restart.
> Therefore we stopped all sessions we took in service before, except the
> transit partner and 4 iBGP sessions, but the flapping continued, the only
> way to get the CER to an operational state again was reloading it with most
> of the BGP sessions disabled by default.
>
> However, we were able to drag some information from the device during the
> last flapping, we didn't see a significant change in memory usage, but the
> load increased dramatically:
>
> SSH@CER(config-bgp)#sho cpu-utilization
>
> 00:09:57 GMT+01 Fri Jun 22 2018
>
> ... Usage average for all tasks in the last 1 seconds ...
> ==========================================================
> Name us/sec %
>
> idle 0 0
> con 35 0
> mon 190 0
> flash 44 0
> dbg 39 0
> boot 70 0
> main 0 0
> itc 0 0
> tmr 4358 0
> ip_rx 26720 2
> scp 54 0
> lpagent 357 0
> console 324 0
> vlan 0 0
> mac_mgr 199 0
> mrp 241 0
> vsrp 0 0
> erp 239 0
> mxrp 127 0
> snms 0 0
> rtm 638 0
> rtm6 301 0
> ip_tx 11100 1
> rip 0 0
> l2vpn 0 0
> mpls 0 0
> nht 0 0
> mpls_glue 0 0
> pcep 0 0
> bgp 212773 21
> bgp_io 240 0
> ospf 1005 0
> ospf_r_calc 1193 0
> isis 260 0
> isis_spf 0 0
> mcast 460 0
> msdp 23 0
> vrrp 0 0
> ripng 0 0
> ospf6 667 0
> ospf6_rt 0 0
> mcast6 557 0
> vrrp6 0 0
> bfd 20 0
> ipsec 57 0
> l4 0 0
> stp 0 0
> gvrp_mgr 0 0
> snmp 458 0
> rmon 25 0
> web 1573 0
> lacp 4199 0
> dot1x 0 0
> dot1ag 177 0
> loop_detect 127 0
> ccp 12 0
> cluster_mgr 131 0
> hw_access 0 0
> ntp 22 0
> openflow_ofm 15 0
> openflow_opm 30 0
> dhcp6 0 0
> sysmon 0 0
> ospf_msg_task 0 0
> ssl 0 0
> http_client 0 0
> lp 723566 76
> LP-I2C 35 0
> ssh_0 84 0
> ssh_1 2140 0
> ssh_2 5072 0
> ssh_3 43 0
>
> The documentation states the device is able to handle 1.5 Mio routes and
> we didn't get above this limit:
>
> SSH@CER(config-bgp)#show ip bgp route sum
> Total number of BGP routes (NLRIs) Installed : 1210135
> Distinct BGP destination networks : 697652
> Filtered bgp routes for soft reconfig : 394895
> Routes originated by this router : 4
> Routes selected as BEST routes : 410535
> BEST routes not installed in IP forwarding table : 0
> Unreachable routes (no IGP route for NEXTHOP) : 0
> IBGP routes selected as best routes : 79640
> EBGP routes selected as best routes : 330891
>
>
> SSH@CER(config-bgp)#show ip route sum
> IP Routing Table - 410845 entries
> 8 connected, 11 static, 0 RIP, 294 OSPF, 410532 BGP, 0 ISIS
> Number of prefixes:
> /0: 1 /4: 1 /8: 16 /9: 11 /10: 36 /11: 99 /12: 291 /13: 565 /14: 1099
> /15: 1924 /16: 13355 /17: 7910 /18: 13673 /19: 24926 /20: 38033 /21:
> 44870 /22: 86917 /23: 70274 /24: 106572 /25: 12 /26: 11 /27: 25 /28: 21
> /29: 21 /30: 67 /32: 115
> Nexthop Table Entry - 682 entries
>
> Can anybody give me some hint what could cause the behaviour described
> above or what to investigate to tackle that issue?
>
>
> --
> Frank Menzel - menzel@sipgate.de
>
> sipgate GmbH - Gladbacher Str. 74 - 40219 Düsseldorf
> HRB Düsseldorf 39841 - Geschäftsführer: Thilo Salmon, Tim Mois
> Steuernummer: 106/5724/7147, Umsatzsteuer-ID: DE219349391
>
> http://www.sipgate.de - http://www.sipgate.co.uk
> _______________________________________________
> foundry-nsp mailing list
> foundry-nsp@puck.nether.net
> http://puck.nether.net/mailman/listinfo/foundry-nsp
> _______________________________________________
> foundry-nsp mailing list
> foundry-nsp@puck.nether.net
> http://puck.nether.net/mailman/listinfo/foundry-nsp
>
Re: OSPF and BGP flapping when enabling a certain amount of BGP neighbors [ In reply to ]
Disabling the ICMP redirects looks absolutely promising, we found
metrics for that and the amount of redirects during our time of testing
was significant. I'll definitely give that a try. Stay tuned, I'll
report back.

Thanks for replying!

On 06/22/2018 07:50 PM, Eldon Koyle wrote:
> I'll second Dennis.  Disabling icmp redirects is extremely important if
> you have multiple addresses on a single interface.
>
> If you have a lot of routes, you may need to change your system-max
> values.  Run 'show default values' and look for ip-route and ip-cache
> values (and ipv6- equivalents).  The defaults are usually quite low
> (290k routes on our CER2024F's, this needs to fit your entire FIB).
> Change with 'system-max <parameter> <value>', write mem, then reload.
> On the MLX, you also have to worry about cam partitioning profiles.  The
> CER2024F may be able to handle 1.5M routes in the BGP RIB, but it has a
> HW max of 524288 in the FIB.
>
> I have also seen a lot of lp-cpu usage caused by multicast traffic,
> especially with older code.
>
> If you see high lp cpu again in the future, you can run 'dm pstat' a few
> times to try to get an idea of what kind of traffic you are receiving.
> The first run is typically a throwaway, as it shows counts since the
> last run.  It gives per-PP stats, but I think the CERs only have one PP
> anyway.  If you are feeling brave, you can use 'rconsole' to connect to
> the LP and play with 'debug packet capture' (captures/displays packets
> that are hitting the lp cpu), but beware... I have had devices
> unexpectedly reboot playing with that.  Always specify a limit.
>
> --
> Eldon
>
> On Fri, Jun 22, 2018 at 10:06 AM, Dennis op de Weegh <info@bitency.nl
> <mailto:info@bitency.nl>> wrote:
>
> Can you post your confg?
>
> LP load looks high.
> Try to disable icmp redirect in config:
>
> no ip icmp redirect
>
> It's a Brocade thing...
>
>
>
> Kind regards/Met vriendelijke groet,
>
> Dennis op de Weegh
>
>
>
> Bitency
> Willem van Oranjestraat 9
> 4931NJ Geertruidenberg
>
> Kvk nummer: 20144338
> BTW nummer: NL213538519B01
>
> W: www.bitency.nl <http://www.bitency.nl>
> E: info@bitency.nl <mailto:info@bitency.nl>
> T: +31 (0)162 714066
>
>
> -----Oorspronkelijk bericht-----
> Van: foundry-nsp <foundry-nsp-bounces@puck.nether.net
> <mailto:foundry-nsp-bounces@puck.nether.net>> Namens Frank Menzel
> Verzonden: vrijdag 22 juni 2018 17:57
> Aan: foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net>
> Onderwerp: [f-nsp] OSPF and BGP flapping when enabling a certain
> amount of BGP neighbors
>
> Hi,
>
>   one of our CER2024F routers started to behave weird without
> noticeable reason, we didn't apply any changes before:
>
> A while ago the device showed up in out monitoring with flapping
> OSPF sessions caused by malformed packets and BGP sessions with
> expired hold-timers. This made the device to become unresponsive, so
> we disabled
>   most BGP sessions except the one to our transit partner and 4
> iBPG sessions. This brought the device to an operational state again.
>
> In exchange we received a new identical device from our vendor and
> applied a configuration backup of the former device, but it behaves
> just like the old one when we took all sessions in service.
>
> To get an idea how many sessions are needed to cause issues we
> carefully took sessions of small networks in service one by one
> while observing cpu, memory usage and the number of routes
> installed. No issues occured, so we took two big sessions in service
> (DECIX route servers), again, nothing remarkable happened.
> Encouraged by that we simultaneously took 10 sessions in service and
> the ospf flapping started, so we disabled them and the device was
> able to cope with its workload again.
> To make sure we don't exceed the capabilities of the device we took
> those sessions in service one by one with a delay of 10 seconds,
> this did *not* cause OSPF flaps or BGP connections to restart, so we
> decided to take the last 10 remaining sessions in service at once
> again, which almost immediately caused OSPF flaps and BGP sessions
> to restart.
> Therefore we stopped all sessions we took in service before, except
> the transit partner and 4 iBGP sessions, but the flapping continued,
> the only way to get the CER to an operational state again was
> reloading it with most of the BGP sessions disabled by default.
>
> However, we were able to drag some information from the device
> during the last flapping, we didn't see a significant change in
> memory usage, but the load increased dramatically:
>
> SSH@CER(config-bgp)#sho cpu-utilization
>
> 00:09:57 GMT+01 Fri Jun 22 2018
>
> ... Usage average for all tasks in the last 1 seconds  ...
> ==========================================================
> Name                    us/sec          %
>
> idle                    0               0
> con                     35              0
> mon                     190             0
> flash                   44              0
> dbg                     39              0
> boot                    70              0
> main                    0               0
> itc                     0               0
> tmr                     4358            0
> ip_rx                   26720           2
> scp                     54              0
> lpagent                 357             0
> console                 324             0
> vlan                    0               0
> mac_mgr                 199             0
> mrp                     241             0
> vsrp                    0               0
> erp                     239             0
> mxrp                    127             0
> snms                    0               0
> rtm                     638             0
> rtm6                    301             0
> ip_tx                   11100           1
> rip                     0               0
> l2vpn                   0               0
> mpls                    0               0
> nht                     0               0
> mpls_glue               0               0
> pcep                    0               0
> bgp                     212773          21
> bgp_io                  240             0
> ospf                    1005            0
> ospf_r_calc             1193            0
> isis                    260             0
> isis_spf                0               0
> mcast                   460             0
> msdp                    23              0
> vrrp                    0               0
> ripng                   0               0
> ospf6                   667             0
> ospf6_rt                0               0
> mcast6                  557             0
> vrrp6                   0               0
> bfd                     20              0
> ipsec                   57              0
> l4                      0               0
> stp                     0               0
> gvrp_mgr                0               0
> snmp                    458             0
> rmon                    25              0
> web                     1573            0
> lacp                    4199            0
> dot1x                   0               0
> dot1ag                  177             0
> loop_detect             127             0
> ccp                     12              0
> cluster_mgr             131             0
> hw_access               0               0
> ntp                     22              0
> openflow_ofm            15              0
> openflow_opm            30              0
> dhcp6                   0               0
> sysmon                  0               0
> ospf_msg_task           0               0
> ssl                     0               0
> http_client             0               0
> lp                      723566          76
> LP-I2C                  35              0
> ssh_0                   84              0
> ssh_1                   2140            0
> ssh_2                   5072            0
> ssh_3                   43              0
>
> The documentation states the device is able to handle 1.5 Mio routes
> and we didn't get above this limit:
>
> SSH@CER(config-bgp)#show ip bgp route sum
>    Total number of BGP routes (NLRIs) Installed     : 1210135
>    Distinct BGP destination networks                : 697652
>    Filtered bgp routes for soft reconfig            : 394895
>    Routes originated by this router                 : 4
>    Routes selected as BEST routes                   : 410535
>    BEST routes not installed in IP forwarding table : 0
>    Unreachable routes (no IGP route for NEXTHOP)    : 0
>    IBGP routes selected as best routes              : 79640
>    EBGP routes selected as best routes              : 330891
>
>
> SSH@CER(config-bgp)#show ip route sum
> IP Routing Table - 410845 entries
>    8 connected, 11 static, 0 RIP, 294 OSPF, 410532 BGP, 0 ISIS
>    Number of prefixes:
>    /0: 1 /4: 1 /8: 16 /9: 11 /10: 36 /11: 99 /12: 291 /13: 565 /14:
> 1099
> /15: 1924 /16: 13355 /17: 7910 /18: 13673 /19: 24926 /20: 38033 /21:
> 44870 /22: 86917 /23: 70274 /24: 106572 /25: 12 /26: 11 /27: 25 /28: 21
> /29: 21 /30: 67 /32: 115
> Nexthop Table Entry - 682 entries
>
> Can anybody give me some hint what could cause the behaviour
> described above or what to investigate to tackle that issue?
>
>
>   --
>   Frank Menzel - menzel@sipgate.de <mailto:menzel@sipgate.de>
>
>   sipgate GmbH - Gladbacher Str. 74 - 40219 Düsseldorf
>   HRB Düsseldorf 39841 - Geschäftsführer: Thilo Salmon, Tim Mois
>   Steuernummer: 106/5724/7147, Umsatzsteuer-ID: DE219349391
>
> http://www.sipgate.de - http://www.sipgate.co.uk
> _______________________________________________
> foundry-nsp mailing list
> foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net>
> http://puck.nether.net/mailman/listinfo/foundry-nsp
> <http://puck.nether.net/mailman/listinfo/foundry-nsp>
> _______________________________________________
> foundry-nsp mailing list
> foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net>
> http://puck.nether.net/mailman/listinfo/foundry-nsp
> <http://puck.nether.net/mailman/listinfo/foundry-nsp>
>
>

--
Frank Menzel - menzel@sipgate.de
Telefon: +49 (0)211-63 55 55-98
Telefax: +49 (0)211-63 55 55-22

sipgate GmbH - Gladbacher Str. 74 - 40219 Düsseldorf
HRB Düsseldorf 39841 - Geschäftsführer: Thilo Salmon, Tim Mois
Steuernummer: 106/5724/7147, Umsatzsteuer-ID: DE219349391

http://www.sipgate.de - http://www.sipgate.co.uk
_______________________________________________
foundry-nsp mailing list
foundry-nsp@puck.nether.net
http://puck.nether.net/mailman/listinfo/foundry-nsp
Re: OSPF and BGP flapping when enabling a certain amount of BGP neighbors [ In reply to ]
Finally, we gave the disabling of ICMP redirects a try and it worked
like a charm!
I was asked about the metrics mentioned, we are using observium for this.
We could not set that within the global context, but in the interface:

interface ethernet 1/1
no ip redirect


Thank you so much!

On 06/25/2018 05:29 PM, Frank Menzel wrote:
> Disabling the ICMP redirects looks absolutely promising, we found
> metrics for that and the amount of redirects during our time of testing
> was significant. I'll definitely give that a try. Stay tuned, I'll
> report back.
>
> Thanks for replying!
>
> On 06/22/2018 07:50 PM, Eldon Koyle wrote:
>> I'll second Dennis.  Disabling icmp redirects is extremely important
>> if you have multiple addresses on a single interface.
>>
>> If you have a lot of routes, you may need to change your system-max
>> values.  Run 'show default values' and look for ip-route and ip-cache
>> values (and ipv6- equivalents).  The defaults are usually quite low
>> (290k routes on our CER2024F's, this needs to fit your entire FIB).
>> Change with 'system-max <parameter> <value>', write mem, then reload.
>> On the MLX, you also have to worry about cam partitioning profiles.
>> The CER2024F may be able to handle 1.5M routes in the BGP RIB, but it
>> has a HW max of 524288 in the FIB.
>>
>> I have also seen a lot of lp-cpu usage caused by multicast traffic,
>> especially with older code.
>>
>> If you see high lp cpu again in the future, you can run 'dm pstat' a
>> few times to try to get an idea of what kind of traffic you are
>> receiving. The first run is typically a throwaway, as it shows counts
>> since the last run.  It gives per-PP stats, but I think the CERs only
>> have one PP anyway.  If you are feeling brave, you can use 'rconsole'
>> to connect to the LP and play with 'debug packet capture'
>> (captures/displays packets that are hitting the lp cpu), but beware...
>> I have had devices unexpectedly reboot playing with that.  Always
>> specify a limit.
>>
>> --
>> Eldon
>>
>> On Fri, Jun 22, 2018 at 10:06 AM, Dennis op de Weegh <info@bitency.nl
>> <mailto:info@bitency.nl>> wrote:
>>
>>     Can you post your confg?
>>
>>     LP load looks high.
>>     Try to disable icmp redirect in config:
>>
>>     no ip icmp redirect
>>
>>     It's a Brocade thing...
>>
>>
>>
>>     Kind regards/Met vriendelijke groet,
>>
>>     Dennis op de Weegh
>>
>>
>>
>>     Bitency
>>     Willem van Oranjestraat 9
>>     4931NJ Geertruidenberg
>>
>>     Kvk nummer: 20144338
>>     BTW nummer: NL213538519B01
>>
>>     W: www.bitency.nl <http://www.bitency.nl>
>>     E: info@bitency.nl <mailto:info@bitency.nl>
>>     T: +31 (0)162 714066
>>
>>
>>     -----Oorspronkelijk bericht-----
>>     Van: foundry-nsp <foundry-nsp-bounces@puck.nether.net
>>     <mailto:foundry-nsp-bounces@puck.nether.net>> Namens Frank Menzel
>>     Verzonden: vrijdag 22 juni 2018 17:57
>>     Aan: foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net>
>>     Onderwerp: [f-nsp] OSPF and BGP flapping when enabling a certain
>>     amount of BGP neighbors
>>
>>     Hi,
>>
>>        one of our CER2024F routers started to behave weird without
>>     noticeable reason, we didn't apply any changes before:
>>
>>     A while ago the device showed up in out monitoring with flapping
>>     OSPF sessions caused by malformed packets and BGP sessions with
>>     expired hold-timers. This made the device to become unresponsive, so
>>     we disabled
>>        most BGP sessions except the one to our transit partner and 4
>>     iBPG sessions. This brought the device to an operational state again.
>>
>>     In exchange we received a new identical device from our vendor and
>>     applied a configuration backup of the former device, but it behaves
>>     just like the old one when we took all sessions in service.
>>
>>     To get an idea how many sessions are needed to cause issues we
>>     carefully took sessions of small networks in service one by one
>>     while observing cpu, memory usage and the number of routes
>>     installed. No issues occured, so we took two big sessions in service
>>     (DECIX route servers), again, nothing remarkable happened.
>>     Encouraged by that we simultaneously took 10 sessions in service and
>>     the ospf flapping started, so we disabled them and the device was
>>     able to cope with its workload again.
>>     To make sure we don't exceed the capabilities of the device we took
>>     those sessions in service one by one with a delay of 10 seconds,
>>     this did *not* cause OSPF flaps or BGP connections to restart, so we
>>     decided to take the last 10 remaining sessions in service at once
>>     again, which almost immediately caused OSPF flaps and BGP sessions
>>     to restart.
>>     Therefore we stopped all sessions we took in service before, except
>>     the transit partner and 4 iBGP sessions, but the flapping continued,
>>     the only way to get the CER to an operational state again was
>>     reloading it with most of the BGP sessions disabled by default.
>>
>>     However, we were able to drag some information from the device
>>     during the last flapping, we didn't see a significant change in
>>     memory usage, but the load increased dramatically:
>>
>>     SSH@CER(config-bgp)#sho cpu-utilization
>>
>>     00:09:57 GMT+01 Fri Jun 22 2018
>>
>>     ... Usage average for all tasks in the last 1 seconds  ...
>>     ==========================================================
>>     Name                    us/sec          %
>>
>>     idle                    0               0
>>     con                     35              0
>>     mon                     190             0
>>     flash                   44              0
>>     dbg                     39              0
>>     boot                    70              0
>>     main                    0               0
>>     itc                     0               0
>>     tmr                     4358            0
>>     ip_rx                   26720           2
>>     scp                     54              0
>>     lpagent                 357             0
>>     console                 324             0
>>     vlan                    0               0
>>     mac_mgr                 199             0
>>     mrp                     241             0
>>     vsrp                    0               0
>>     erp                     239             0
>>     mxrp                    127             0
>>     snms                    0               0
>>     rtm                     638             0
>>     rtm6                    301             0
>>     ip_tx                   11100           1
>>     rip                     0               0
>>     l2vpn                   0               0
>>     mpls                    0               0
>>     nht                     0               0
>>     mpls_glue               0               0
>>     pcep                    0               0
>>     bgp                     212773          21
>>     bgp_io                  240             0
>>     ospf                    1005            0
>>     ospf_r_calc             1193            0
>>     isis                    260             0
>>     isis_spf                0               0
>>     mcast                   460             0
>>     msdp                    23              0
>>     vrrp                    0               0
>>     ripng                   0               0
>>     ospf6                   667             0
>>     ospf6_rt                0               0
>>     mcast6                  557             0
>>     vrrp6                   0               0
>>     bfd                     20              0
>>     ipsec                   57              0
>>     l4                      0               0
>>     stp                     0               0
>>     gvrp_mgr                0               0
>>     snmp                    458             0
>>     rmon                    25              0
>>     web                     1573            0
>>     lacp                    4199            0
>>     dot1x                   0               0
>>     dot1ag                  177             0
>>     loop_detect             127             0
>>     ccp                     12              0
>>     cluster_mgr             131             0
>>     hw_access               0               0
>>     ntp                     22              0
>>     openflow_ofm            15              0
>>     openflow_opm            30              0
>>     dhcp6                   0               0
>>     sysmon                  0               0
>>     ospf_msg_task           0               0
>>     ssl                     0               0
>>     http_client             0               0
>>     lp                      723566          76
>>     LP-I2C                  35              0
>>     ssh_0                   84              0
>>     ssh_1                   2140            0
>>     ssh_2                   5072            0
>>     ssh_3                   43              0
>>
>>     The documentation states the device is able to handle 1.5 Mio routes
>>     and we didn't get above this limit:
>>
>>     SSH@CER(config-bgp)#show ip bgp route sum
>>         Total number of BGP routes (NLRIs) Installed     : 1210135
>>         Distinct BGP destination networks                : 697652
>>         Filtered bgp routes for soft reconfig            : 394895
>>         Routes originated by this router                 : 4
>>         Routes selected as BEST routes                   : 410535
>>         BEST routes not installed in IP forwarding table : 0
>>         Unreachable routes (no IGP route for NEXTHOP)    : 0
>>         IBGP routes selected as best routes              : 79640
>>         EBGP routes selected as best routes              : 330891
>>
>>
>>     SSH@CER(config-bgp)#show ip route sum
>>     IP Routing Table - 410845 entries
>>         8 connected, 11 static, 0 RIP, 294 OSPF, 410532 BGP, 0 ISIS
>>         Number of prefixes:
>>         /0: 1 /4: 1 /8: 16 /9: 11 /10: 36 /11: 99 /12: 291 /13: 565 /14:
>>     1099
>>     /15: 1924 /16: 13355 /17: 7910 /18: 13673 /19: 24926 /20: 38033 /21:
>>     44870 /22: 86917 /23: 70274 /24: 106572 /25: 12 /26: 11 /27: 25
>> /28: 21
>>     /29: 21 /30: 67 /32: 115
>>     Nexthop Table Entry - 682 entries
>>
>>     Can anybody give me some hint what could cause the behaviour
>>     described above or what to investigate to tackle that issue?
>>
>>
>>        --
>>        Frank Menzel - menzel@sipgate.de <mailto:menzel@sipgate.de>
>>
>>        sipgate GmbH - Gladbacher Str. 74 - 40219 Düsseldorf
>>        HRB Düsseldorf 39841 - Geschäftsführer: Thilo Salmon, Tim Mois
>>        Steuernummer: 106/5724/7147, Umsatzsteuer-ID: DE219349391
>>
>>     http://www.sipgate.de - http://www.sipgate.co.uk
>>     _______________________________________________
>>     foundry-nsp mailing list
>>     foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net>
>>     http://puck.nether.net/mailman/listinfo/foundry-nsp
>>     <http://puck.nether.net/mailman/listinfo/foundry-nsp>
>>     _______________________________________________
>>     foundry-nsp mailing list
>>     foundry-nsp@puck.nether.net <mailto:foundry-nsp@puck.nether.net>
>>     http://puck.nether.net/mailman/listinfo/foundry-nsp
>>     <http://puck.nether.net/mailman/listinfo/foundry-nsp>
>>
>>
>

--
Frank Menzel - menzel@sipgate.de
Telefon: +49 (0)211-63 55 55-98
Telefax: +49 (0)211-63 55 55-22

sipgate GmbH - Gladbacher Str. 74 - 40219 Düsseldorf
HRB Düsseldorf 39841 - Geschäftsführer: Thilo Salmon, Tim Mois
Steuernummer: 106/5724/7147, Umsatzsteuer-ID: DE219349391

http://www.sipgate.de - http://www.sipgate.co.uk
_______________________________________________
foundry-nsp mailing list
foundry-nsp@puck.nether.net
http://puck.nether.net/mailman/listinfo/foundry-nsp
Re: OSPF and BGP flapping when enabling a certain amount of BGP neighbors [ In reply to ]
On Do, Jun 28, 2018 at 10:15:14 +0200, Frank Menzel <menzel@sipgate.de> wrote:
>
> interface ethernet 1/1
> no ip redirect

You usually want to set no icmp redirects on global level.


Best regards,
Franz Georg Köhler

_______________________________________________
foundry-nsp mailing list
foundry-nsp@puck.nether.net
http://puck.nether.net/mailman/listinfo/foundry-nsp
Re: OSPF and BGP flapping when enabling a certain amount of BGP neighbors [ In reply to ]
Based on this thread, I went to check if we had that enabled on a few
CER2024F's we have. it's not available at a global level.

SSH@cer2024#show version | inc IronWare
IronWare : Version 5.7.0bT183 Copyright (c) 1996-2014 Brocade
Communications Systems, Inc.


SSH@cer2024f(config)#no ip icmp ?
burst-normal Number of packets per second in normal burst mode
echo Global ICMP Echo message control
max-err-msg-rate ICMP max error message rate per second
mpls-response Global ICMP error response control on MPLS labeled
packet
unreachable Global ICMP Unreachable message control


Yep we're aware it's old- they're not doing much and haven't been
problematic. Curious what version folks are running? I reached out to
Extreme in February, and they recommended 6.0.00f even though it was not
yet out at the time.
Re: OSPF and BGP flapping when enabling a certain amount of BGP neighbors [ In reply to ]
That's true, the CER/CES icmp redirect is only available on interface
level. The MLXe prompt indeed looks like this:

(config)#no ip icmp ?
burst-normal Number of packets per second in normal burst mode
echo Global ICMP Echo message control
fast-echo-reply Harware assisted ICMP response generation
max-err-msg-rate ICMP max error message rate per second
mpls-response Global ICMP error response control on MPLS labeled
packet
redirects Global ICMP Redirect control
unreachable Global ICMP Unreachable message control

On 29 Jun 2018, at 14:24, Chris Wopat wrote:

> Yep we're aware it's old- they're not doing much and haven't been
> problematic. Curious what version folks are running? I reached out to
> Extreme in February, and they recommended 6.0.00f even though it was
> not
> yet out at the time.

Currently we run 6.2.0b, 6.2.0a, 5.8.00f. Can't complain as the changes
are minor, 6.2 introduced BGP teardown restart-interval, that feature
was kind of important for us.
_______________________________________________
foundry-nsp mailing list
foundry-nsp@puck.nether.net
http://puck.nether.net/mailman/listinfo/foundry-nsp