Mailing List Archive

Telia Not Withdrawing v6 Routes
Has anyone else experienced issues where Telia won't withdraw (though will happily accept an overriding) prefixes for
the past week, at least?

eg 2620:6e:a003::/48 was a test prefix and should not now appear in any DFZ, has not been announced for a few days at
least, but shows up in Telia's LG and RIPE RIS as transiting Telia. Telia's LG traceroute doesn't of course, go
anywhere, traces die immediately after a hop or with a !N.

Wouldn't be a problem except that I needed to withdraw another route due to a separate issue which wouldn't budge out of
Telia's tables until it was replaced with something else of higher pref.

Matt
Re: Telia Not Withdrawing v6 Routes [ In reply to ]
This same issue happened in Los Angeles a number of years ago, but for IPv4 and v6. They need to setup sane BGP timers, and/or advocate the use of BFD for BGP sessions both customer facing and internal.

Ryan
On Nov 15 2020, at 5:58 pm, Matt Corallo <nanog@as397444.net> wrote:
> Has anyone else experienced issues where Telia won't withdraw (though will happily accept an overriding) prefixes for
> the past week, at least?
>
> eg 2620:6e:a003::/48 was a test prefix and should not now appear in any DFZ, has not been announced for a few days at
> least, but shows up in Telia's LG and RIPE RIS as transiting Telia. Telia's LG traceroute doesn't of course, go
> anywhere, traces die immediately after a hop or with a !N.
>
> Wouldn't be a problem except that I needed to withdraw another route due to a separate issue which wouldn't budge out of
> Telia's tables until it was replaced with something else of higher pref.
>
> Matt
Re: Telia Not Withdrawing v6 Routes [ In reply to ]
Probably a ghost route. Such thing happens :(

https://labs.ripe.net/Members/romain_fontugne/bgp-zombies

Their (nice) LG shows that it's still advertised from a router of theirs in Frankfurt (iBGP next hop ::ffff:2.255.251.224 – so by the way they use 6PE).

Your best option would probably be to re-advertise the exact same prefix, then re-withdraw it, then yell at Telia's NOC if it fails...


Some years ago we experienced something similar (it was a router of TI Sparkle still advertising a prefix of us in Asia to their clients, that they were previously receiving from our former transit GTT – we were advertising it in Europe...).


> Le 16 nov. 2020 à 02:58, Matt Corallo <nanog@as397444.net> a écrit :
>
> Has anyone else experienced issues where Telia won't withdraw (though will happily accept an overriding) prefixes for the past week, at least?
>
> eg 2620:6e:a003::/48 was a test prefix and should not now appear in any DFZ, has not been announced for a few days at least, but shows up in Telia's LG and RIPE RIS as transiting Telia. Telia's LG traceroute doesn't of course, go anywhere, traces die immediately after a hop or with a !N.
>
> Wouldn't be a problem except that I needed to withdraw another route due to a separate issue which wouldn't budge out of Telia's tables until it was replaced with something else of higher pref.
>
> Matt
Re: Telia Not Withdrawing v6 Routes [ In reply to ]
Yea, I did try that on that test prefix but it just stuck around anyway.. I don't care too much, its just some stale
test prefix.

Sadly, I now see it again with 2620:6e:a002::/48, which, somewhat more impressively, is now generating a routing loop
Ashburn <-> NYC, and has always been announced from other places has was dropped/re-announced as wel.

Must just be something with my particular prefixes, oh well.

Matt

On 11/15/20 10:40 PM, Olivier Benghozi wrote:
> Probably a ghost route. Such thing happens :(
>
> https://labs.ripe.net/Members/romain_fontugne/bgp-zombies
>
> Their (nice) LG shows that it's still advertised from a router of theirs in Frankfurt (iBGP next hop ::ffff:2.255.251.224 – so by the way they use 6PE).
>
> Your best option would probably be to re-advertise the exact same prefix, then re-withdraw it, then yell at Telia's NOC if it fails...
>
>
> Some years ago we experienced something similar (it was a router of TI Sparkle still advertising a prefix of us in Asia to their clients, that they were previously receiving from our former transit GTT – we were advertising it in Europe...).
>
>
>> Le 16 nov. 2020 à 02:58, Matt Corallo <nanog@as397444.net> a écrit :
>>
>> Has anyone else experienced issues where Telia won't withdraw (though will happily accept an overriding) prefixes for the past week, at least?
>>
>> eg 2620:6e:a003::/48 was a test prefix and should not now appear in any DFZ, has not been announced for a few days at least, but shows up in Telia's LG and RIPE RIS as transiting Telia. Telia's LG traceroute doesn't of course, go anywhere, traces die immediately after a hop or with a !N.
>>
>> Wouldn't be a problem except that I needed to withdraw another route due to a separate issue which wouldn't budge out of Telia's tables until it was replaced with something else of higher pref.
>>
>> Matt
>
Re: Telia Not Withdrawing v6 Routes [ In reply to ]
One of the routing gears on the path don't like the large community inside those routes maybe ? :)
By the way we currently see 2620:6e:a002::/48 at LINX LON1 from Choopa and HE...

> Le 16 nov. 2020 à 04:44, Matt Corallo <nanog@as397444.net> a écrit :
>
> Yea, I did try that on that test prefix but it just stuck around anyway.. I don't care too much, its just some stale test prefix.
>
> Sadly, I now see it again with 2620:6e:a002::/48, which, somewhat more impressively, is now generating a routing loop Ashburn <-> NYC, and has always been announced from other places has was dropped/re-announced as wel.
>
> Must just be something with my particular prefixes, oh well.
>
> Matt
>
> On 11/15/20 10:40 PM, Olivier Benghozi wrote:
>> Probably a ghost route. Such thing happens :(
>> https://labs.ripe.net/Members/romain_fontugne/bgp-zombies
>> Their (nice) LG shows that it's still advertised from a router of theirs in Frankfurt (iBGP next hop ::ffff:2.255.251.224 – so by the way they use 6PE).
>> Your best option would probably be to re-advertise the exact same prefix, then re-withdraw it, then yell at Telia's NOC if it fails...
>> Some years ago we experienced something similar (it was a router of TI Sparkle still advertising a prefix of us in Asia to their clients, that they were previously receiving from our former transit GTT – we were advertising it in Europe...).
>>> Le 16 nov. 2020 à 02:58, Matt Corallo <nanog@as397444.net> a écrit :
>>>
>>> Has anyone else experienced issues where Telia won't withdraw (though will happily accept an overriding) prefixes for the past week, at least?
>>>
>>> eg 2620:6e:a003::/48 was a test prefix and should not now appear in any DFZ, has not been announced for a few days at least, but shows up in Telia's LG and RIPE RIS as transiting Telia. Telia's LG traceroute doesn't of course, go anywhere, traces die immediately after a hop or with a !N.
>>>
>>> Wouldn't be a problem except that I needed to withdraw another route due to a separate issue which wouldn't budge out of Telia's tables until it was replaced with something else of higher pref.
>>>
>>> Matt
Re: Telia Not Withdrawing v6 Routes [ In reply to ]
Maybe? Never been an issue before. In this case the route does have a depref community on Telia hence why one wouldn’t expect it via the same path, but the other ghost route in question never had anything similar.

Matt

> On Nov 15, 2020, at 23:07, Olivier Benghozi <olivier.benghozi@wifirst.fr> wrote:
>
> ?One of the routing gears on the path don't like the large community inside those routes maybe ? :)
> By the way we currently see 2620:6e:a002::/48 at LINX LON1 from Choopa and HE...
>
>> Le 16 nov. 2020 à 04:44, Matt Corallo <nanog@as397444.net> a écrit :
>>
>> Yea, I did try that on that test prefix but it just stuck around anyway.. I don't care too much, its just some stale test prefix.
>>
>> Sadly, I now see it again with 2620:6e:a002::/48, which, somewhat more impressively, is now generating a routing loop Ashburn <-> NYC, and has always been announced from other places has was dropped/re-announced as wel.
>>
>> Must just be something with my particular prefixes, oh well.
>>
>> Matt
>>
>>> On 11/15/20 10:40 PM, Olivier Benghozi wrote:
>>> Probably a ghost route. Such thing happens :(
>>> https://labs.ripe.net/Members/romain_fontugne/bgp-zombies
>>> Their (nice) LG shows that it's still advertised from a router of theirs in Frankfurt (iBGP next hop ::ffff:2.255.251.224 – so by the way they use 6PE).
>>> Your best option would probably be to re-advertise the exact same prefix, then re-withdraw it, then yell at Telia's NOC if it fails...
>>> Some years ago we experienced something similar (it was a router of TI Sparkle still advertising a prefix of us in Asia to their clients, that they were previously receiving from our former transit GTT – we were advertising it in Europe...).
>>>> Le 16 nov. 2020 à 02:58, Matt Corallo <nanog@as397444.net> a écrit :
>>>>
>>>> Has anyone else experienced issues where Telia won't withdraw (though will happily accept an overriding) prefixes for the past week, at least?
>>>>
>>>> eg 2620:6e:a003::/48 was a test prefix and should not now appear in any DFZ, has not been announced for a few days at least, but shows up in Telia's LG and RIPE RIS as transiting Telia. Telia's LG traceroute doesn't of course, go anywhere, traces die immediately after a hop or with a !N.
>>>>
>>>> Wouldn't be a problem except that I needed to withdraw another route due to a separate issue which wouldn't budge out of Telia's tables until it was replaced with something else of higher pref.
>>>>
>>>> Matt
>
Re: Telia Not Withdrawing v6 Routes [ In reply to ]
For those curious, Johan indicated on Twitter this was a JunOS bug.

https://twitter.com/gustawsson/status/1328298914785730561

Matt

> On Nov 15, 2020, at 23:13, Matt Corallo <nanog@as397444.net> wrote:
>
> ?Maybe? Never been an issue before. In this case the route does have a depref community on Telia hence why one wouldn’t expect it via the same path, but the other ghost route in question never had anything similar.
>
> Matt
>
>> On Nov 15, 2020, at 23:07, Olivier Benghozi <olivier.benghozi@wifirst.fr> wrote:
>>
>> ?One of the routing gears on the path don't like the large community inside those routes maybe ? :)
>> By the way we currently see 2620:6e:a002::/48 at LINX LON1 from Choopa and HE...
>>
>>>> Le 16 nov. 2020 à 04:44, Matt Corallo <nanog@as397444.net> a écrit :
>>>
>>> Yea, I did try that on that test prefix but it just stuck around anyway.. I don't care too much, its just some stale test prefix.
>>>
>>> Sadly, I now see it again with 2620:6e:a002::/48, which, somewhat more impressively, is now generating a routing loop Ashburn <-> NYC, and has always been announced from other places has was dropped/re-announced as wel.
>>>
>>> Must just be something with my particular prefixes, oh well.
>>>
>>> Matt
>>>
>>>> On 11/15/20 10:40 PM, Olivier Benghozi wrote:
>>>> Probably a ghost route. Such thing happens :(
>>>> https://labs.ripe.net/Members/romain_fontugne/bgp-zombies
>>>> Their (nice) LG shows that it's still advertised from a router of theirs in Frankfurt (iBGP next hop ::ffff:2.255.251.224 – so by the way they use 6PE).
>>>> Your best option would probably be to re-advertise the exact same prefix, then re-withdraw it, then yell at Telia's NOC if it fails...
>>>> Some years ago we experienced something similar (it was a router of TI Sparkle still advertising a prefix of us in Asia to their clients, that they were previously receiving from our former transit GTT – we were advertising it in Europe...).
>>>>> Le 16 nov. 2020 à 02:58, Matt Corallo <nanog@as397444.net> a écrit :
>>>>>
>>>>> Has anyone else experienced issues where Telia won't withdraw (though will happily accept an overriding) prefixes for the past week, at least?
>>>>>
>>>>> eg 2620:6e:a003::/48 was a test prefix and should not now appear in any DFZ, has not been announced for a few days at least, but shows up in Telia's LG and RIPE RIS as transiting Telia. Telia's LG traceroute doesn't of course, go anywhere, traces die immediately after a hop or with a !N.
>>>>>
>>>>> Wouldn't be a problem except that I needed to withdraw another route due to a separate issue which wouldn't budge out of Telia's tables until it was replaced with something else of higher pref.
>>>>>
>>>>> Matt
>>
Re: Telia Not Withdrawing v6 Routes [ In reply to ]
----- On Nov 15, 2020, at 5:58 PM, Matt Corallo nanog@as397444.net wrote:

> Has anyone else experienced issues where Telia won't withdraw (though will
> happily accept an overriding) prefixes for the past week, at least?

I have seen issues like this in a network that I operated. In that particular
case, it was an internal ipv4 10/8 route which was withdrawn, along with a
few hundred other routes. The withdrawl was configured on a DC exit router,
in a Clos network with leaf, spine, and superspine. On the spine layer, I
observed that BGP withdrawls, although being received, were not processed
by the control plane.

Further investigation and working with the TAC of the vendor, revealed that
on that particular platform, the BGP process would stop process withdrawls
in a very nasty race condition that was very difficult to reproduce.

This was the first (and so far only) time in my 20+ years of working with
BGP that I've observed such a weird bug. Since I operated the entire
network, it was fairly easy to find the culprit. The why, took some more
time.

If I were in your shoes, I'd ping Telia's NOC to see what's going on. I
would not be surprised if they'd be hitting a similar issue.

Thanks,

Sabri
Re: Telia Not Withdrawing v6 Routes [ In reply to ]
See my latest response from this morning. Telia's "Head of Network Engineering & Architecture" confirmed on Twitter this
was due to a (now-worked-around) bug in JunOS.

https://twitter.com/gustawsson/status/1328298914785730561

Matt

On 11/16/20 2:13 PM, Sabri Berisha wrote:
> ----- On Nov 15, 2020, at 5:58 PM, Matt Corallo nanog@as397444.net wrote:
>
>> Has anyone else experienced issues where Telia won't withdraw (though will
>> happily accept an overriding) prefixes for the past week, at least?
>
> I have seen issues like this in a network that I operated. In that particular
> case, it was an internal ipv4 10/8 route which was withdrawn, along with a
> few hundred other routes. The withdrawl was configured on a DC exit router,
> in a Clos network with leaf, spine, and superspine. On the spine layer, I
> observed that BGP withdrawls, although being received, were not processed
> by the control plane.
>
> Further investigation and working with the TAC of the vendor, revealed that
> on that particular platform, the BGP process would stop process withdrawls
> in a very nasty race condition that was very difficult to reproduce.
>
> This was the first (and so far only) time in my 20+ years of working with
> BGP that I've observed such a weird bug. Since I operated the entire
> network, it was fairly easy to find the culprit. The why, took some more
> time.
>
> If I were in your shoes, I'd ping Telia's NOC to see what's going on. I
> would not be surprised if they'd be hitting a similar issue.
>
> Thanks,
>
> Sabri
>
Re: Telia Not Withdrawing v6 Routes [ In reply to ]
----- On Nov 16, 2020, at 11:45 AM, Matt Corallo nanog@as397444.net wrote:

Hi,

> See my latest response from this morning. Telia's "Head of Network Engineering &
> Architecture" confirmed on Twitter this
> was due to a (now-worked-around) bug in JunOS.
>
> https://twitter.com/gustawsson/status/1328298914785730561

Interesting. A long time ago, in a galaxy far far away, where I was a JTAC engineer,
policy was that once a PR was hit in the field, it would be marked public.

Also, in the case that I described it wasn't a Junos device. Makes me wonder how bugs
like that get introduced. One would expect that after 20+ years of writing BGP code,
handling a withdrawl would be easy-peasy.

Thanks,

Sabri
Re: Telia Not Withdrawing v6 Routes [ In reply to ]
On Mon, 16 Nov 2020 17:36:58 -0800, Sabri Berisha said:

> Also, in the case that I described it wasn't a Junos device. Makes me wonder how bugs
> like that get introduced. One would expect that after 20+ years of writing BGP code,
> handling a withdrawl would be easy-peasy.

Handling a withdrawal is easy.

Handling one correctly without race conditions when you're seeing withdrawals
and additions from multiple bgp sessions concurrently, while also maintaining
RIB and FIB consistency and keep forwarding customer packets is a little bit harder.
Re: Telia Not Withdrawing v6 Routes [ In reply to ]
Surely they can just put them in an array.

;)


On Mon, Nov 16, 2020, 21:54 Valdis Kl?tnieks <valdis.kletnieks@vt.edu>
wrote:

> On Mon, 16 Nov 2020 17:36:58 -0800, Sabri Berisha said:
>
> > Also, in the case that I described it wasn't a Junos device. Makes me
> wonder how bugs
> > like that get introduced. One would expect that after 20+ years of
> writing BGP code,
> > handling a withdrawl would be easy-peasy.
>
> Handling a withdrawal is easy.
>
> Handling one correctly without race conditions when you're seeing
> withdrawals
> and additions from multiple bgp sessions concurrently, while also
> maintaining
> RIB and FIB consistency and keep forwarding customer packets is a little
> bit harder.
>
Re: Telia Not Withdrawing v6 Routes [ In reply to ]
On 17.11.2020 around 02:36 Sabri Berisha wrote:
> Interesting. A long time ago, in a galaxy far far away, where I was a
> JTAC engineer, policy was that once a PR was hit in the field, it
> would be marked public.
>
> Also, in the case that I described it wasn't a Junos device. Makes me
> wonder how bugs like that get introduced. One would expect that after
> 20+ years of writing BGP code, handling a withdrawl would be
> easy-peasy.

New code, new features, new problems. E.g. public PR1323306 describes a
BGP stuck situation. (And the fixed code should address as well a -
hidden - PR, which causes down/stale sessions, leading to stuck routes
even without a both-side GRES event). All very, very special cases ...
but some of us will find / get hit by them (unfortunately).

Markus
Re: Telia Not Withdrawing v6 Routes [ In reply to ]
On Tue, 17 Nov 2020 at 03:40, Sabri Berisha <sabri@cluecentral.net> wrote:

Hey Sabri,

> Also, in the case that I described it wasn't a Junos device. Makes me wonder how bugs
> like that get introduced. One would expect that after 20+ years of writing BGP code,
> handling a withdrawl would be easy-peasy.

I don't think this is related to skill, that there was some hard
programming problem that DE couldn't solve. These are honest mistakes.
I've not experienced in my tenure the frequency of these bugs change
at all, NOS are as common now as they were in the 90s.

I put most of the blame on the market, we've modelled commercial
router market so that poor quality NOS is good for business and good
quality NOS is bad for business, I don't think this is in anyone's
formal business plan or that companies even realise they are not even
trying to make good NOS. I think it's emergent behaviour due to the
market and people follow that market demand unknowingly.
If we suddenly had one commercial NOS which is 100% bug free, many of
their customers would stop buying support, would rely on spare HW and
Internet forums for configuration help. Lot of us only need contracts
to deal with novel bugs all of us find on a regular basis, so good NOS
would immediately reduce revenue. For some reason Windows, macOS or
Linux almost never have novel bugs that the end user finds and when
those are found, it's big news. While we don't go a month without
hitting a novel bug in one of our NOS, and no one cares about it, it's
business as usual.

I also put a lot of blame on C, it was a terrific language when
compiling had to be fast. Basically macro assembler. Now the utility
of being 'close to HW' is gone, as the CPU does so much C compiler has
no control over, it's not really even executing the same code
as-written anymore. MSFT estimated >70% of their bugs are related to
memory safety. We could accomplish significant improvements in
software quality if we'd ditch C and allow the computer to do more
formal correctness checks at compile time and design languages which
lend towards this.


We constantly misattribute problems (like in this post) to config or
HW, while most common reasons for outages are pilot error and SW
defect, and very little engineering time is spent on those. And often
the time spent improving the two first increases the risk of the two
latter, reducing mean availability over time.


--
++ytti
Re: Telia Not Withdrawing v6 Routes [ In reply to ]
On 11/17/20 08:54, Saku Ytti wrote:

> I put most of the blame on the market, we've modelled commercial
> router market so that poor quality NOS is good for business and good
> quality NOS is bad for business, I don't think this is in anyone's
> formal business plan or that companies even realise they are not even
> trying to make good NOS. I think it's emergent behaviour due to the
> market and people follow that market demand unknowingly.
> If we suddenly had one commercial NOS which is 100% bug free, many of
> their customers would stop buying support, would rely on spare HW and
> Internet forums for configuration help.

Not to mention that many of us would not need to be around to babysit
all this dodgy software.

Definitely bad for business :-).

Mark.
RE: Telia Not Withdrawing v6 Routes [ In reply to ]
> On Behalf Of Mark Tinka
> Sent: Tuesday, November 17, 2020 4:32 PM
>
> On 11/17/20 08:54, Saku Ytti wrote:
>
> > I put most of the blame on the market, we've modelled commercial
> > router market so that poor quality NOS is good for business and good
> > quality NOS is bad for business, I don't think this is in anyone's
> > formal business plan or that companies even realise they are not even
> > trying to make good NOS. I think it's emergent behaviour due to the
> > market and people follow that market demand unknowingly.
> > If we suddenly had one commercial NOS which is 100% bug free, many of
> > their customers would stop buying support, would rely on spare HW and
> > Internet forums for configuration help.
>
> Not to mention that many of us would not need to be around to babysit all this
> dodgy software.
>
> Definitely bad for business :-).
>
Being obsoleted already by "self-driving networks", there's no limit to what one can automate...
But then one needs someone to babysit all the automation systems.

adam
RE: Telia Not Withdrawing v6 Routes [ In reply to ]
> Saku Ytti
> Sent: Tuesday, November 17, 2020 6:55 AM
>
> On Tue, 17 Nov 2020 at 03:40, Sabri Berisha <sabri@cluecentral.net> wrote:
>
> Hey Sabri,
>
> > Also, in the case that I described it wasn't a Junos device. Makes me
> > wonder how bugs like that get introduced. One would expect that after
> > 20+ years of writing BGP code, handling a withdrawl would be easy-peasy.
>
> I don't think this is related to skill, that there was some hard programming
> problem that DE couldn't solve. These are honest mistakes.
> I've not experienced in my tenure the frequency of these bugs change at all,
> NOS are as common now as they were in the 90s.
>
> I put most of the blame on the market, we've modelled commercial router
> market so that poor quality NOS is good for business and good quality NOS is
> bad for business, I don't think this is in anyone's formal business plan or that
> companies even realise they are not even trying to make good NOS. I think it's
> emergent behaviour due to the market and people follow that market demand
> unknowingly.
> If we suddenly had one commercial NOS which is 100% bug free, many of their
> customers would stop buying support, would rely on spare HW and Internet
> forums for configuration help. Lot of us only need contracts to deal with novel
> bugs all of us find on a regular basis, so good NOS would immediately reduce
> revenue. For some reason Windows, macOS or Linux almost never have novel
> bugs that the end user finds and when those are found, it's big news. While we
> don't go a month without hitting a novel bug in one of our NOS, and no one
> cares about it, it's business as usual.
>
> I also put a lot of blame on C, it was a terrific language when compiling had to
> be fast. Basically macro assembler. Now the utility of being 'close to HW' is
> gone, as the CPU does so much C compiler has no control over, it's not really
> even executing the same code as-written anymore. MSFT estimated >70% of
> their bugs are related to memory safety. We could accomplish significant
> improvements in software quality if we'd ditch C and allow the computer to do
> more formal correctness checks at compile time and design languages which
> lend towards this.
>
>
> We constantly misattribute problems (like in this post) to config or HW, while
> most common reasons for outages are pilot error and SW defect, and very little
> engineering time is spent on those. And often the time spent improving the two
> first increases the risk of the two latter, reducing mean availability over time.
>
I agree with everything but the last statement.
From my experience, most of the SPs spend a considerable time testing for SW defects on features (and combinations of features) that will be used and at scale intended, that's how you identify most of the bugs. What you're left with afterwards are special packets of death or some slow memory leaks (basically the more exotic stuff).

adam
Re: Telia Not Withdrawing v6 Routes [ In reply to ]
>
> I also put a lot of blame on C, it was a terrific language when
> compiling had to be fast. Basically macro assembler. Now the utility
> of being 'close to HW' is gone, as the CPU does so much C compiler has
> no control over, it's not really even executing the same code
> as-written anymore. MSFT estimated >70% of their bugs are related to
> memory safety. We could accomplish significant improvements in
> software quality if we'd ditch C and allow the computer to do more
> formal correctness checks at compile time and design languages which
> lend towards this.
>

Agree 1000%. I think this is greatly compounded by current generations of
programmers who come out of school without having had much experience with
low level memory management, having mostly worked in more modern languages
that handle such things in a much better way. Moving from college Python to
mature C code with a hellscape of pointers must be a pretty jarring
transition. :)

On Tue, Nov 17, 2020 at 1:56 AM Saku Ytti <saku@ytti.fi> wrote:

> On Tue, 17 Nov 2020 at 03:40, Sabri Berisha <sabri@cluecentral.net> wrote:
>
> Hey Sabri,
>
> > Also, in the case that I described it wasn't a Junos device. Makes me
> wonder how bugs
> > like that get introduced. One would expect that after 20+ years of
> writing BGP code,
> > handling a withdrawl would be easy-peasy.
>
> I don't think this is related to skill, that there was some hard
> programming problem that DE couldn't solve. These are honest mistakes.
> I've not experienced in my tenure the frequency of these bugs change
> at all, NOS are as common now as they were in the 90s.
>
> I put most of the blame on the market, we've modelled commercial
> router market so that poor quality NOS is good for business and good
> quality NOS is bad for business, I don't think this is in anyone's
> formal business plan or that companies even realise they are not even
> trying to make good NOS. I think it's emergent behaviour due to the
> market and people follow that market demand unknowingly.
> If we suddenly had one commercial NOS which is 100% bug free, many of
> their customers would stop buying support, would rely on spare HW and
> Internet forums for configuration help. Lot of us only need contracts
> to deal with novel bugs all of us find on a regular basis, so good NOS
> would immediately reduce revenue. For some reason Windows, macOS or
> Linux almost never have novel bugs that the end user finds and when
> those are found, it's big news. While we don't go a month without
> hitting a novel bug in one of our NOS, and no one cares about it, it's
> business as usual.
>
> I also put a lot of blame on C, it was a terrific language when
> compiling had to be fast. Basically macro assembler. Now the utility
> of being 'close to HW' is gone, as the CPU does so much C compiler has
> no control over, it's not really even executing the same code
> as-written anymore. MSFT estimated >70% of their bugs are related to
> memory safety. We could accomplish significant improvements in
> software quality if we'd ditch C and allow the computer to do more
> formal correctness checks at compile time and design languages which
> lend towards this.
>
>
> We constantly misattribute problems (like in this post) to config or
> HW, while most common reasons for outages are pilot error and SW
> defect, and very little engineering time is spent on those. And often
> the time spent improving the two first increases the risk of the two
> latter, reducing mean availability over time.
>
>
> --
> ++ytti
>
Re: Telia Not Withdrawing v6 Routes [ In reply to ]
On 11/18/20 14:58, adamv0025@netconsultings.com wrote:

> From my experience, most of the SPs spend a considerable time testing for SW defects on features (and combinations of features) that will be used and at scale intended,

I'm not so sure about that, actually.

I'd say there are some ISP's that spend some (or a considerable) amount
of time testing for software defects.

My anecdotal experience is that most ISP's have neither the time, tools
nor resources to do significant testing of software. More like, "is the
version anything after R1, has it been around long enough, has it been
recommended by TAC, are the -nsp lists raving on about it, is it a
maintenance release, is the caveat list too long, does my vendor SE
approve", type-thing.


> that's how you identify most of the bugs. What you're left with afterwards are special packets of death or some slow memory leaks (basically the more exotic stuff).

Which the majority of ISP's likely will never test for.

Mark.