Mailing List Archive

1 2 3 4 5  View All
Re: Famous operational issues [ In reply to ]
Hardly famous and not service-affecting in the end, but figured I'd
share an incident from our side that occurred back in 2018.

While commissioning a new node in our Metro-E network, an IPv6
point-to-point address was mis-typed. Instead of ending in /126, it
ended in /12. This happened in Johannesburg.

We actually came across this by chance while examining the IGP table of
another router located in Slough, and found an entry for 2c00::/12
floating around. That definitely looked out of place, as we never carry
parent blocks in our IGP.

Running the trace from Slough led us back to this one Metro-E device in
Jo'burg.

It took everyone nearly an hour to figure out the typo, because for all
the laser focus we had on the supposed link of the supposed box that was
creating this problem, we all overlooked the fact that the /12
configured on the point-to-point link was actually supposed to have been
a /126.

The reason this never caused a service problem was because we do not
redistribute our IGP into BGP (not that anyone should). And even if we
did, there are a ton of filters and BGP communities on all devices to
ensure a route such as that would have never made it out of our AS.

Also, the IGP contains the most specific paths to every node in our
network, so the presence of the 2c00::/12 was mostly cosmetic. It would
have never been used for routing decisions.

Mark.
Re: Famous operational issues [ In reply to ]
I only just now found this thread, so I'm sorry I'm late to the party, but here, I put it on Medium.

https://gushi.medium.com/the-worst-day-ever-at-my-day-job-beff7f4170aa

> On Mar 12, 2021, at 10:07 PM, Mark Tinka <mark@tinka.africa> wrote:
>
> Hardly famous and not service-affecting in the end, but figured I'd share an incident from our side that occurred back in 2018.
>
> While commissioning a new node in our Metro-E network, an IPv6 point-to-point address was mis-typed. Instead of ending in /126, it ended in /12. This happened in Johannesburg.
>
> We actually came across this by chance while examining the IGP table of another router located in Slough, and found an entry for 2c00::/12 floating around. That definitely looked out of place, as we never carry parent blocks in our IGP.
>
> Running the trace from Slough led us back to this one Metro-E device in Jo'burg.
>
> It took everyone nearly an hour to figure out the typo, because for all the laser focus we had on the supposed link of the supposed box that was creating this problem, we all overlooked the fact that the /12 configured on the point-to-point link was actually supposed to have been a /126.
>
> The reason this never caused a service problem was because we do not redistribute our IGP into BGP (not that anyone should). And even if we did, there are a ton of filters and BGP communities on all devices to ensure a route such as that would have never made it out of our AS.
>
> Also, the IGP contains the most specific paths to every node in our network, so the presence of the 2c00::/12 was mostly cosmetic. It would have never been used for routing decisions.
>
> Mark.
Re: Famous operational issues [ In reply to ]
What a day.. hope you are better now :)


On 6/12/2021 2:42 AM, Dan Mahoney wrote:
> I only just now found this thread, so I'm sorry I'm late to the party,
> but here, I put it on Medium.
>
> https://gushi.medium.com/the-worst-day-ever-at-my-day-job-beff7f4170aa
> <https://gushi.medium.com/the-worst-day-ever-at-my-day-job-beff7f4170aa>
>
>> On Mar 12, 2021, at 10:07 PM, Mark Tinka <mark@tinka.africa> wrote:
>>
>> Hardly famous and not service-affecting in the end, but figured I'd
>> share an incident from our side that occurred back in 2018.
>>
>> While commissioning a new node in our Metro-E network, an IPv6
>> point-to-point address was mis-typed. Instead of ending in /126, it
>> ended in /12. This happened in Johannesburg.
>>
>> We actually came across this by chance while examining the IGP table
>> of another router located in Slough, and found an entry for 2c00::/12
>> floating around. That definitely looked out of place, as we never
>> carry parent blocks in our IGP.
>>
>> Running the trace from Slough led us back to this one Metro-E device
>> in Jo'burg.
>>
>> It took everyone nearly an hour to figure out the typo, because for
>> all the laser focus we had on the supposed link of the supposed box
>> that was creating this problem, we all overlooked the fact that the
>> /12 configured on the point-to-point link was actually supposed to
>> have been a /126.
>>
>> The reason this never caused a service problem was because we do not
>> redistribute our IGP into BGP (not that anyone should). And even if
>> we did, there are a ton of filters and BGP communities on all devices
>> to ensure a route such as that would have never made it out of our AS.
>>
>> Also, the IGP contains the most specific paths to every node in our
>> network, so the presence of the 2c00::/12 was mostly cosmetic. It
>> would have never been used for routing decisions.
>>
>> Mark.
>
Re: Famous operational issues [ In reply to ]
opening the link currently gives me a HTTP 500 error, very fitting :)

Am 12.06.2021 um 04:42 schrieb Dan Mahoney:
> I only just now found this thread, so I'm sorry I'm late to the party, but here, I put it on Medium.
>
> https://gushi.medium.com/the-worst-day-ever-at-my-day-job-beff7f4170aa
>
>> On Mar 12, 2021, at 10:07 PM, Mark Tinka <mark@tinka.africa> wrote:
>>
>> Hardly famous and not service-affecting in the end, but figured I'd share an incident from our side that occurred back in 2018.
>>
>> While commissioning a new node in our Metro-E network, an IPv6 point-to-point address was mis-typed. Instead of ending in /126, it ended in /12. This happened in Johannesburg.
>>
>> We actually came across this by chance while examining the IGP table of another router located in Slough, and found an entry for 2c00::/12 floating around. That definitely looked out of place, as we never carry parent blocks in our IGP.
>>
>> Running the trace from Slough led us back to this one Metro-E device in Jo'burg.
>>
>> It took everyone nearly an hour to figure out the typo, because for all the laser focus we had on the supposed link of the supposed box that was creating this problem, we all overlooked the fact that the /12 configured on the point-to-point link was
>> actually supposed to have been a /126.
>>
>> The reason this never caused a service problem was because we do not redistribute our IGP into BGP (not that anyone should). And even if we did, there are a ton of filters and BGP communities on all devices to ensure a route such as that would have
>> never made it out of our AS.
>>
>> Also, the IGP contains the most specific paths to every node in our network, so the presence of the 2c00::/12 was mostly cosmetic. It would have never been used for routing decisions.
>>
>> Mark.
>
Re: [EXTERNAL] Re: Famous operational issues [ In reply to ]
On 16/02/2021 22:51, Compton, Rich A wrote:

> There was the outage in 2014 when we got to 512K routes. http://www.bgpmon.net/what-caused-todays-internet-hiccup/

There was a similar issue in 1998/9 or so when we got to 64K routes,
which broke the routing table index (which defaulted to a uint16_t) on
any FreeBSD box doing BGP.

Fortunately a quick kernel recompile with the type changed to uint32_t
fixed that.

Ray

1 2 3 4 5  View All