Mailing List Archive

Arista “IP-SLA” / Active Probing
Hello all,

We find ourselves trying to solve a requirement where we would like to test
the viability of our paths to the internet and tear down the bgp session if
it is determined to be faulty. We had an issue recently where we did not
lose link or bgp but the carrier lost the ability to route traffic to the
internet for us and our existing automatic detection and remediation
strategies failed to detect this condition and we lost customer packets.

Conceptually, we have a pair of DCS7050-QX landing a fiber each from two
ISPs with default routes on BGP at a dozen POPs around the US.

One of the ISPs is our primary transit, and one is predominantly for peered
customers, but we can use it for transit during issues with the primary
circuits.

I did some research on this and it seems like perhaps the on-boot event
handler launching a python daemon to do this active probing out each isp
circuit and then making config changes in response to transit failures
might be the best option available to us.

However, I thought I’d reach out to the broader community to see if there’s
a better way to solve this, has an example script, or if anyone has
recommendations for methods of active monitoring for protecting against
this sort of failure.

Thanks in advance for any insight and time.




*Alex Buie*Senior Cloud Operations Engineer

450 Century Pkwy # 100 Allen, TX 75013
<https://maps.google.com/?q=450+Century+Pkwy+STE+100+%7C+Allen,+TX+%7C+75013&entry=gmail&source=g>
D: 469-884-0225 | www.cytracom.com
Re: Arista “IP-SLA” / Active Probing [ In reply to ]
Hi, Alex. If it helps, I've had a variant of this on our transit routers for enterprise purposes for a few years. We run DFZ and originate 0/0 and ::/0 internally, but because we follow them to the nearest egress (0/0 using NAT for path symmetry, ::/0 using conditional advertisement for path symmetry), we want to only originate the internal default routes if the external peering at that egress is "healthy". For IOS-XR, it effectively looks like this:

route-policy PS-TRANSIT-UP-IPV4
if rib-has-route in TRANSIT-SUBNETS-V4 then
pass
endif
end-policy
!
route-policy PS-TRANSIT-UP-IPV6
if rib-has-route in TRANSIT-SUBNETS-V6 then
pass
endif
end-policy

prefix-set TRANSIT-SUBNETS-V4
212.123.212.184/30
end-set
!
prefix-set TRANSIT-SUBNETS-V6
2001:920:3815::64/127
end-set

neighbor-group EBGP-CRT-IPV4
...
address-family ipv4 unicast
...
default-originate route-policy PS-TRANSIT-UP-IPV4

neighbor-group EBGP-CRT-IPV6
...
address-family ipv6 unicast
...
default-originate route-policy PS-TRANSIT-UP-IPV6

We keep those stanzas simple — is the direct link to the peer up, and therefore the direct route is in our RIB — but depending on the platform you're using, you may have more knobs to check things. For example, I can't directly check if a BGP peer is up/down, though I can match on routes the peer has given us (or lack thereof).

-dp

From: NANOG <nanog-bounces+dzimmerman=linkedin.com@nanog.org> on behalf of Alex Buie <abuie@cytracom.com>
Date: Wednesday, December 20, 2023 at 10:45?AM
To: nanog@nanog.org <nanog@nanog.org>
Subject: Arista “IP-SLA” / Active Probing
Hello all,

We find ourselves trying to solve a requirement where we would like to test the viability of our paths to the internet and tear down the bgp session if it is determined to be faulty. We had an issue recently where we did not lose link or bgp but the carrier lost the ability to route traffic to the internet for us and our existing automatic detection and remediation strategies failed to detect this condition and we lost customer packets.

Conceptually, we have a pair of DCS7050-QX landing a fiber each from two ISPs with default routes on BGP at a dozen POPs around the US.

One of the ISPs is our primary transit, and one is predominantly for peered customers, but we can use it for transit during issues with the primary circuits.

I did some research on this and it seems like perhaps the on-boot event handler launching a python daemon to do this active probing out each isp circuit and then making config changes in response to transit failures might be the best option available to us.

However, I thought I’d reach out to the broader community to see if there’s a better way to solve this, has an example script, or if anyone has recommendations for methods of active monitoring for protecting against this sort of failure.

Thanks in advance for any insight and time.





Alex Buie
Senior Cloud Operations Engineer

450 Century Pkwy # 100 Allen, TX 75013<https://maps.google.com/?q=450+Century+Pkwy+STE+100+%7C+Allen,+TX+%7C+75013&entry=gmail&source=g>
D: 469-884-0225 | www.cytracom.com<http://www.cytracom.com/>
Re: Arista “IP-SLA” / Active Probing [ In reply to ]
On Fri, Dec 22, 2023 at 12:13?PM David Zimmerman via NANOG
<nanog@nanog.org> wrote:
> I've had a variant of this on our transit routers for enterprise purposes
> for a few years. We run DFZ and originate 0/0 and ::/0 internally, but

Hi David,

There are several variants on Alex's problem. One is that there's an
upstream failure reflected in the BGP table but Alex doesn't see it
because he's only taking a default route. Your solution, or one like
it, should work for that. In a nutshell:

1. Take a full table
2. Filter everything but a selection of representative routes
3. Set static default routes tied to addresses within the representative routes.

If the representative routes disappear from the table, the static
defaults become invalid and leave the local routing table as well.

Or perhaps he has the reverse problem where he wants to advertise his
route only if the representative routes are there so that when his
anycast node has network problems it drops itself off the Internet and
allows others to take over.


Another variant is that BGP reports having the entire Internet table
but the packets don't get there. The upstream suffers from anything
between high packet loss to a misbegotten filtering rule that black
holes all his packets. He'd like to do some active polling via static
routes to the upstream and drop both advertised and received routes
when the polling indicates a path failure.

I thought the latter was what he was asking for, but on a second
read-through I see he talked about taking a default route via BGP
rather than a full table.

Regards,
Bill Herrin


--
William Herrin
bill@herrin.us
https://bill.herrin.us/
Re: Arista “IP-SLA” / Active Probing [ In reply to ]
>
> I did some research on this and it seems like perhaps the on-boot event
> handler launching a python daemon to do this active probing out each isp
> circuit and then making config changes in response to transit failures
> might be the best option available to us.
>

Pretty much, yes. They don't have any functionality like IP-SLA in EOS, but
you can roll your own.

Ex:

https://github.com/arista-eosext/PingCheck



On Wed, Dec 20, 2023 at 1:44?PM Alex Buie <abuie@cytracom.com> wrote:

> Hello all,
>
> We find ourselves trying to solve a requirement where we would like to
> test the viability of our paths to the internet and tear down the bgp
> session if it is determined to be faulty. We had an issue recently where we
> did not lose link or bgp but the carrier lost the ability to route traffic
> to the internet for us and our existing automatic detection and remediation
> strategies failed to detect this condition and we lost customer packets.
>
> Conceptually, we have a pair of DCS7050-QX landing a fiber each from two
> ISPs with default routes on BGP at a dozen POPs around the US.
>
> One of the ISPs is our primary transit, and one is predominantly for
> peered customers, but we can use it for transit during issues with the
> primary circuits.
>
> I did some research on this and it seems like perhaps the on-boot event
> handler launching a python daemon to do this active probing out each isp
> circuit and then making config changes in response to transit failures
> might be the best option available to us.
>
> However, I thought I’d reach out to the broader community to see if
> there’s a better way to solve this, has an example script, or if anyone has
> recommendations for methods of active monitoring for protecting against
> this sort of failure.
>
> Thanks in advance for any insight and time.
>
>
>
>
> *Alex Buie*Senior Cloud Operations Engineer
>
> 450 Century Pkwy # 100 Allen, TX 75013
> <https://maps.google.com/?q=450+Century+Pkwy+STE+100+%7C+Allen,+TX+%7C+75013&entry=gmail&source=g>
> D: 469-884-0225 | www.cytracom.com
>