Mailing List Archive

XR6 process conflicts
Hi all,

Recently I was doing some testing on XR6 and noticed interesting behavior.

I enabled OSPF adjacency traps to see how the router performs with these
traps. I was getting only a handful of traps for a big router with only a
few OSPF sessions.
That behavior aligns with the RFC 1850 definition back from 1995:
















*4.4. Throttling Traps The mechanism for throttling the traps is similar
to the mechanism explained in RFC 1224 [11], section 5. The basic idea
is that there is a sliding window in seconds and an upper bound on the
number of traps that may be generated within this window. Unlike RFC
1224, traps are not sent to inform the network manager that the
throttling mechanism has kicked in. A single window should be used to
throttle all OSPF traps types except for the ospfLsdbOverflow and the
ospfLsdbApproachingOverflow trap which should not be throttled. For
example, if the window time is 3, the upper bound is 3 and the events
that would cause trap types 1,3,5 and 7 occur within a 3 second period,
the type 7 trap should not be generated. Appropriate values are 7 traps
with a window time of 10 seconds. *


TAC mentioned that it is expected behavior targeted on protecting CPUs that
can be changed tweaking how snmpd reacts to OL signal. While it kinda makes
sense from an overall perspective, it seems very strange that a lab XR
router running on a multicore Xeon CPU is unable to send out a handful of
traps (i was expecting about 50 traps over a 1 minute period). At the same
time, TAC didn't mention anything about OSPF's built-in OSPF trap
throttling mechanisms.
Further investigation showed that snmpd was silently dropping internal
messages because of the OL condition. This is not the way I would expect a
Linux based system to behave.

Trying to diagnose XR, I collected some syslogs and with a quick script to
compare what XR is sending out over it's SNMP interface and it's Syslog
interface. To my surprise, while OSPF related SNMP traps were really bad,
Syslog feed was clear and accurate. As if syslogd doesn't react to OL
condition and relies on a different OS scheduling mechanism.

I tried to put extra stress on a router and created a few hundreds of
subinterfaces and only a few OSPF neighbors and flapped an interface. Yet
another time, SNMP traps sent to the collector were crippled, syslog was
clear, and reflected everything.

Going deeper, I tested BGP reconvergence and tried to observe what is
happening in there. Yet another time SNMP trap feeds were very bad at
reflecting the status of BGP FASM transitions (I looked at
cbgpFsmStateChange). However now even outgoing syslog messages were
affected as well. This is a bit surprising as well as if BGP on XR
uses CPUs differently. But I still not get it how an idling lab router is
unable to send tramps and syslog messages indicating that it just
experienced a big outage.

On the one hand side, some of that behavior is actually derived from the
fact that the router is trying to bring the connectivity up and operational
asap, but on the other hand side, these CPU throttling aspects were drafted
back in the days when routers ran using 800Mhz CPUs, while now we're
running on server-grade multi-core x86s that should not only be able to do
the whole SPF computation in milliseconds, but also swamp any alarm
collector at the same time. However, it feels like either SNMP/syslog
reporting function in XR didn't address any hw improvements that happened
over time, or that XR's OS process scheduler has major deficiencies. And I
feel like it is the later.

I knew SNMP and syslogs are inherently not reliable being UDP based, but I
never expected that even a router itself doesn't try to inform alarm
collectors about potential large scale outages. If it is actually the OS
scheduler, a lot of existing processes and any new upcoming features, like
BGP-LS or telemetry may be affected.

Later I remembered that XR6 is now based on Windriver Linux, now QNX as it
was previously. While QNX is an RTOS, Linux is not, and it's kernel relies
on a totally different scheduling principle. At the same time, it feels as
if XR's internals were hard on-bolted onto a new OS without much of a
thought about its architecture. Potentially that may lead to a large number
of gray outages that are not properly detected by XR6 routers, and even not
reported NOCs globally.

Am I the only one seeing that behavior in XR?
Did anyone else test how XR routers running on multicore CPUs handle
concurrency? Maybe anyone compared how XR handles process concurrency for
network events?
Unfortunately, I am unable to share any data dumps, but I would be happy to
share scripts and methods I used for data analysis.
Hope I am just overreacting.

Rgds,
Nival
_______________________________________________
cisco-nsp mailing list cisco-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/
Re: XR6 process conflicts [ In reply to ]
On Sun, 6 Sep 2020 at 10:22, nivalMcNd d <nivalmcnd@gmail.com> wrote:
>
> Hi all,

Hi Nival,

Can I ask, as you seem disappointed by what you have found; what were
you hoping to find? XR is a proprietary product, the design of which
has had almost zero input from the majority of its users.

In your specific case/example; if I have a PE with a single physical
interface connected to some 3rd party wholesale Ethernet NNI, with 500
sub-interfaces, each running OSPF to a remote CPE; if the physical
interface goes down I don't need 500 SNMP traps or syslog messages to
tell me that all 500 OSPF sessions are down. There are two sides to
every coin.

Cheers,
James.
_______________________________________________
cisco-nsp mailing list cisco-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/
Re: XR6 process conflicts [ In reply to ]
On 13 September 2020 05:37:11 CEST, aaron1@gvtc.com wrote:
>Hi James, I'm coming into this conversation late or mid-point, but as a
>thought, if 1 of those 500 routers goes down, you need to know about
>that
>individual router's ospf state dropping. How else would you know that
>unless you sent traps on a per ospf-subinterface basis?

Hi Aaron,

Perhaps I misunderstood; my interpretation of OPs concern was that when batches of the same event occur, only some of the traps are sent, and the rest suppressed, but all are reported via syslog. If you have a single OSPF session flap, as we (all running IOS-XR and OSPF) already know, IOS-XR will send a trap.

Is my interpretation wrong? If yes, please ignore my ramblings. If no, then I'm curious to know what's OPs requirement is to receive 500 traps (my experience is that it's too much noise to reasonably interpret and handle in a useful manner).

Cheers,
James.
_______________________________________________
cisco-nsp mailing list cisco-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/
Re: XR6 process conflicts [ In reply to ]
Oh ok, yeah, I didn't see the original post, disregard me then, thanks James

-Aaron

-----Original Message-----
From: James Bensley <jwbensley@gmail.com> On Behalf Of James Bensley
Sent: Sunday, September 13, 2020 6:24 AM
To: aaron1@gvtc.com; cisco-nsp@puck.nether.net
Subject: RE: [c-nsp] XR6 process conflicts



On 13 September 2020 05:37:11 CEST, aaron1@gvtc.com wrote:
>Hi James, I'm coming into this conversation late or mid-point, but as a
>thought, if 1 of those 500 routers goes down, you need to know about
>that individual router's ospf state dropping. How else would you know
>that unless you sent traps on a per ospf-subinterface basis?

Hi Aaron,

Perhaps I misunderstood; my interpretation of OPs concern was that when batches of the same event occur, only some of the traps are sent, and the rest suppressed, but all are reported via syslog. If you have a single OSPF session flap, as we (all running IOS-XR and OSPF) already know, IOS-XR will send a trap.

Is my interpretation wrong? If yes, please ignore my ramblings. If no, then I'm curious to know what's OPs requirement is to receive 500 traps (my experience is that it's too much noise to reasonably interpret and handle in a useful manner).

Cheers,
James.

_______________________________________________
cisco-nsp mailing list cisco-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/
Re: XR6 process conflicts [ In reply to ]
On Sat, Sep 12, 2020, at 11:25, James Bensley wrote:

> In your specific case/example; if I have a PE with a single physical
> interface connected to some 3rd party wholesale Ethernet NNI, with 500
> sub-interfaces, each running OSPF to a remote CPE; if the physical
> interface goes down I don't need 500 SNMP traps or syslog messages to
> tell me that all 500 OSPF sessions are down. There are two sides to

OTOH, if the NNI service goes down (circuits are interrupted), but the interface stays up, you will be happy to know that ALL circuits are down (or at least which of them went down) when you open a ticket to the NNI provider.

--
R.-A. Feurdean
_______________________________________________
cisco-nsp mailing list cisco-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/
Re: XR6 process conflicts [ In reply to ]
On Monday, 14 September, 2020 19:38, "Radu-Adrian FEURDEAN" <cisco-nsp@radu-adrian.feurdean.net> said:

> On Sat, Sep 12, 2020, at 11:25, James Bensley wrote:
>
>> In your specific case/example; if I have a PE with a single physical
>> interface connected to some 3rd party wholesale Ethernet NNI, with 500
>> sub-interfaces, each running OSPF to a remote CPE; if the physical
>> interface goes down I don't need 500 SNMP traps or syslog messages to
>> tell me that all 500 OSPF sessions are down. There are two sides to
>
> OTOH, if the NNI service goes down (circuits are interrupted), but the interface
> stays up, you will be happy to know that ALL circuits are down (or at least which
> of them went down) when you open a ticket to the NNI provider.

And in an ideal world, of course, your monitoring platform will do intelligent root-cause analysis, suppress all the individual circuit alarms, generate a single master alarm for the NNI for the NOC to deal with, and notify all the impacted customers of the master ticket.

I'd usually want to err on the side of having more data and putting appropriate filtering between the data and the person viewing, rather than NOT having data it later turns out would be useful.

Regards,
Tim.


_______________________________________________
cisco-nsp mailing list cisco-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/
Re: XR6 process conflicts [ In reply to ]
On Tue, 15 Sep 2020 at 11:22, tim@pelican.org <tim@pelican.org> wrote:

> I'd usually want to err on the side of having more data and putting appropriate filtering between the data and the person viewing, rather than NOT having data it later turns out would be useful.

Yes tons of (bad) input isn't a problem. Where we make mistakes is
generating a lot of inactionable or redundant output for human
consumption. It is much better to omit sending alerts about real
problems to humans than to generate a lot of inactionable alerts and
messages for human consumption.
We will quickly learn to ignore input if it's rarely actionable and
mistakes due to humans ignoring legit alerts will be far more common
than legit alerts not being generated. Of course oftentimes this is a
game of where the blame falls, if you generate a lot of useless alerts
but never miss alerts, you did your job and the problem is on the
consumption side for not reacting.

So rather fix situations where you discover them where you suppress
legit alerts than spew out trash you don't know for a fact to be
actionable. You will have better overall results but of course you
will have to carry the blame of managers asking 'why didn't we get
alert'.

--
++ytti
_______________________________________________
cisco-nsp mailing list cisco-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/
Re: XR6 process conflicts [ In reply to ]
On 15 September 2020 10:17:09 CEST, "tim@pelican.org" <tim@pelican.org> wrote:
>On Monday, 14 September, 2020 19:38, "Radu-Adrian FEURDEAN"
><cisco-nsp@radu-adrian.feurdean.net> said:
>
>> On Sat, Sep 12, 2020, at 11:25, James Bensley wrote:
>>
>>> In your specific case/example; if I have a PE with a single physical
>>> interface connected to some 3rd party wholesale Ethernet NNI, with
>500
>>> sub-interfaces, each running OSPF to a remote CPE; if the physical
>>> interface goes down I don't need 500 SNMP traps or syslog messages
>to
>>> tell me that all 500 OSPF sessions are down. There are two sides to
>>
>> OTOH, if the NNI service goes down (circuits are interrupted), but
>the interface
>> stays up, you will be happy to know that ALL circuits are down (or at
>least which
>> of them went down) when you open a ticket to the NNI provider.
>
>And in an ideal world, of course, your monitoring platform will do
>intelligent root-cause analysis, suppress all the individual circuit
>alarms, generate a single master alarm for the NNI for the NOC to deal
>with, and notify all the impacted customers of the master ticket.
>
>I'd usually want to err on the side of having more data and putting
>appropriate filtering between the data and the person viewing, rather
>than NOT having data it later turns out would be useful.

Yeah well that's the ideal dream scenario. The reality is that many operators (especially small operators who's entire OSS stack is a single Observium/LibreNMS/Cacti/Solarwinds/whatever server) can't handle this.

I wonder if Cisco have tried to do what they did with LPTS in IOS-XR and implement what they consider a "sensible" default, i.e., the control plane is not fully open and unlimited, and not 100% closed by default, but some rate limiting is implemented out of the box as a compromise.

I guess the OP's concern is that there doesn't appear to be a command (unless I missed it?) to change the threshold above which XR starts surprising traps?

Cheers,
James.
_______________________________________________
cisco-nsp mailing list cisco-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/