Mailing List Archive

Famous operational issues
Friends,

I'd like to start a thread about the most famous and widespread Internet
operational issues, outages or implementation incompatibilities you
have seen.

Which examples would make up your top three?

To get things started, I'd suggest the AS 7007 event is perhaps the
most notorious and likely to top many lists including mine. So if
that is one for you I'm asking for just two more.

I'm particularly interested in this as the first step in developing a
future NANOG session. I'd be particularly interested in any issues
that also identify key individuals that might still be around and
interested in participating in a retrospective. I already have someone
that is willing to talk about AS 7007, which shouldn't be hard to guess
who.

Thanks in advance for your suggestions,

John
Re: Famous operational issues [ In reply to ]
On Tue, Feb 16, 2021 at 01:37:35PM -0600, John Kristoff wrote:
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?

This was a fantastic outage, one could really feel the tremors into the
far corners of the BGP default-free zone:

https://labs.ripe.net/Members/erik/ripe-ncc-and-duke-university-bgp-experiment/

The experiment triggered a bug in some Cisco router models: affected
Ciscos would corrupt this specific BGP announcement ** ON OUTBOUND **.
Any peers of such Ciscos receiving this BGP update, would (according to
then current RFCs) consider the BGP UPDATE corrupted, and would
subsequently tear down the BGP sessions with the Ciscos. Because the
corruption was not detected by the Ciscos themselves, whenever the
sessions would come back online again they'd reannounce the corrupted
update, causing a session tear down. Bounce ... Bounce ... Bounce ... at
global scale in both IBGP and EBGP! :-)

Luckily the industry took these, and many other lessons to heart: in
2015 the IETF published RFC 7606 ("Revised Error Handling for BGP UPDATE
Messages") which specifices far more robust behaviour for BGP speakers.

Kind regards,

Job
Re: Famous operational issues [ In reply to ]
On Tue, 16 Feb 2021, John Kristoff wrote:

> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?

https://blogs.oracle.com/internetintelligence/longer-is-not-always-better

--
Mikael Abrahamsson email: swmike@swm.pp.se
Re: Famous operational issues [ In reply to ]
Hi,

I don't want to classify and rate it, but would name 9/11.

You can read about the impacts on the list archives and there is also a
presentation from NANOG '23 online.

Regards
Jörg

On 16 Feb 2021, at 20:37, John Kristoff wrote:

> Friends,
>
> I'd like to start a thread about the most famous and widespread
> Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
>
Re: Famous operational issues [ In reply to ]
actually, the 129/8 incident was as damaging as 7007, but folk tend not
to remember it; maybe because it was a bit embarrassing

and the baltimore tunnel is a gift that gave a few times

and the quake/mudslides off taiwan

the tohoku quake was also fun, in some sense of the word

but the list of really damaging wet glass cuts is long
Re: Famous operational issues [ In reply to ]
> actually, the 129/8 incident

a friend pointed out that it was the 128/9 incident

> but folk tend not to remember it

qed, eh? :)
Re: Famous operational issues [ In reply to ]
https://en.wikipedia.org/wiki/SQL_Slammer was interesting in that it was an
application-layer issue that affected the network layer.

Damian

On Tue, Feb 16, 2021 at 11:37 AM John Kristoff <jtk@dataplane.org> wrote:

> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
>
> To get things started, I'd suggest the AS 7007 event is perhaps the
> most notorious and likely to top many lists including mine. So if
> that is one for you I'm asking for just two more.
>
> I'm particularly interested in this as the first step in developing a
> future NANOG session. I'd be particularly interested in any issues
> that also identify key individuals that might still be around and
> interested in participating in a retrospective. I already have someone
> that is willing to talk about AS 7007, which shouldn't be hard to guess
> who.
>
> Thanks in advance for your suggestions,
>
> John
>
Re: Famous operational issues [ In reply to ]
Since you said operational issues, instead of just outage...

How about MCI Worldcom's 10-day operational disaster in 1999.


http://www.cnn.com/TECH/computing/9908/23/network.nono.idg/
How not to handle a network outage

[...]
MCI WorldCom issued an alert to its sales force, which was given the
option to deliver a notice to customers by e-mail, hand delivery or
telephone – or not at all. After a deafening silence from company
executives on the 10-day network outage, MCI WorldCom CEO Bernie Ebbers
finally took the podium to discuss the situation. How did he explain the
failure, and reassure customers that the network would not suffer such a
failure in the future? He didn't. Instead, he blamed Lucent.
[...]
Re: Famous operational issues [ In reply to ]
There are all the hilarious leaks and blocks.

Pakistan blocks youtube and the announcement leaks internet-wide.
Turk telecom (AS9121 IIRC) leaks a full table out one of their providers.

So many routing level incidents they're probably not even interesting any
more, I suppose.

The huge power outages in the US northeast in 2003 (
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.183.998&rep=rep1&type=pdf)
were pretty decent.



On Tue, Feb 16, 2021 at 4:02 PM Damian Menscher via NANOG <nanog@nanog.org>
wrote:

> https://en.wikipedia.org/wiki/SQL_Slammer was interesting in that it was
> an application-layer issue that affected the network layer.
>
> Damian
>
> On Tue, Feb 16, 2021 at 11:37 AM John Kristoff <jtk@dataplane.org> wrote:
>
>> Friends,
>>
>> I'd like to start a thread about the most famous and widespread Internet
>> operational issues, outages or implementation incompatibilities you
>> have seen.
>>
>> Which examples would make up your top three?
>>
>> To get things started, I'd suggest the AS 7007 event is perhaps the
>> most notorious and likely to top many lists including mine. So if
>> that is one for you I'm asking for just two more.
>>
>> I'm particularly interested in this as the first step in developing a
>> future NANOG session. I'd be particularly interested in any issues
>> that also identify key individuals that might still be around and
>> interested in participating in a retrospective. I already have someone
>> that is willing to talk about AS 7007, which shouldn't be hard to guess
>> who.
>>
>> Thanks in advance for your suggestions,
>>
>> John
>>
>
Re: Famous operational issues [ In reply to ]
Oh well, MCI in 1999 was all about…
https://www.youtube.com/watch?v=7iM5nFNUG4U

On 16 Feb 2021, at 22:28, Sean Donelan wrote:

> Since you said operational issues, instead of just outage...
>
> How about MCI Worldcom's 10-day operational disaster in 1999.
>
>
> http://www.cnn.com/TECH/computing/9908/23/network.nono.idg/
> How not to handle a network outage
>
> [...]
> MCI WorldCom issued an alert to its sales force, which was given the
> option to deliver a notice to customers by e-mail, hand delivery or
> telephone – or not at all. After a deafening silence from company
> executives on the 10-day network outage, MCI WorldCom CEO Bernie
> Ebbers finally took the podium to discuss the situation. How did he
> explain the failure, and reassure customers that the network would not
> suffer such a failure in the future? He didn't. Instead, he blamed
> Lucent.
> [...]
Re: Famous operational issues [ In reply to ]
Would this also extend to intentional actions that may have had unintended
consequences, such as provider A intentionally de-peering provider B, or
the monopoly telco for $country cutting itself off from the rest of the
global Internet for various reasons (technical, political, or otherwise)?

That said, I'd still have to stick with AS7007, the Baltimore tunnel fire,
and 9/11 as the most prominent examples of widespread issues/outages and
how those issues were addressed.

Honorable mention: $vendor BGP bugs, either due to $vendor ignoring the
relevant RFCs, implementing them incorrectly, or an outage exposed a design
flaw that the RFCs didn't catch. Too many of those to list here :)

jms

On Tue, Feb 16, 2021 at 2:37 PM John Kristoff <jtk@dataplane.org> wrote:

> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
>
> To get things started, I'd suggest the AS 7007 event is perhaps the
> most notorious and likely to top many lists including mine. So if
> that is one for you I'm asking for just two more.
>
> I'm particularly interested in this as the first step in developing a
> future NANOG session. I'd be particularly interested in any issues
> that also identify key individuals that might still be around and
> interested in participating in a retrospective. I already have someone
> that is willing to talk about AS 7007, which shouldn't be hard to guess
> who.
>
> Thanks in advance for your suggestions,
>
> John
>
Re: Famous operational issues [ In reply to ]
I was thinking about how we need a war stories nanog track. My favorite was being on call when the router was stolen.

Sent from my TI-99/4a

> On Feb 16, 2021, at 2:40 PM, John Kristoff <jtk@dataplane.org> wrote:
>
> ?Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
>
> To get things started, I'd suggest the AS 7007 event is perhaps the
> most notorious and likely to top many lists including mine. So if
> that is one for you I'm asking for just two more.
>
> I'm particularly interested in this as the first step in developing a
> future NANOG session. I'd be particularly interested in any issues
> that also identify key individuals that might still be around and
> interested in participating in a retrospective. I already have someone
> that is willing to talk about AS 7007, which shouldn't be hard to guess
> who.
>
> Thanks in advance for your suggestions,
>
> John
Re: Famous operational issues [ In reply to ]
----- On Feb 16, 2021, at 2:08 PM, Jared Mauch jared@puck.nether.net wrote:

Hi,

> I was thinking about how we need a war stories nanog track. My favorite was
> being on call when the router was stolen.

Wait... what? I would love to listen to that call between you and your manager.

But, here is one for you then. I was once called to a POP where one of our main
routers was down. Due to political reasons, my access had been revoked. My
manager told me to do whatever I needed to do to fix the problem, he would cover
my behind. I did, and I "gently" removed the door. My manager held word.

Another interesting one: entering a pop to find it flooded. Luckily there were
raised floors with only fiber underneath the floor panels. The NOC ignored the
warnings because "it was impossible for water to enter the building as it was
not raining". Yeah, but water pipes do burst from time to time.

But my favorite was pressing an undocumented combination of keys on a fire
alarm system which set off the Inergen protection without warning, immediately.
The noise and pressure of all that air entering the datacenter space with me
still in it is something I will never forget. Similar to the response of my
manager who, instead of asking me if I was ok, decided to try and light a piece
of paper. "Oh wow, it does work, I can't set anything on fire".

All if this was, obviously, in the late 1990s and early 2000s. These days,
things are -slightly- more professional.

Thanks,

Sabri
Re: Famous operational issues [ In reply to ]
On Tue, 16 Feb 2021, Sabri Berisha wrote:

> ----- On Feb 16, 2021, at 2:08 PM, Jared Mauch jared@puck.nether.net wrote:
>
> Hi,
>
>> I was thinking about how we need a war stories nanog track. My favorite was
>> being on call when the router was stolen.
>
> Wait... what? I would love to listen to that call between you and your manager.
>
> But, here is one for you then. I was once called to a POP where one of our main
> routers was down. Due to political reasons, my access had been revoked. My
> manager told me to do whatever I needed to do to fix the problem, he would cover
> my behind. I did, and I "gently" removed the door. My manager held word.

This reminds me of one of the Sprint CO's we were colo'd in. Access to
the CLEC colo area was via a back door through the Men's room! One
weekend, I had to make the drive to that site to deal with an access
server issue, and I found they'd locked the back door to the Men's room
from the colo floor side, so no access. Using supplies I found inside the
CO, I managed open the locked door and get to our gear. That route, being
our only access route was probably some kind of violation. Not all of our
techs were guys.

While we never had a router stolen, we did have a flash card stolen from
one of our routers in a WCOM colo facility (most customers in open relay
racks). It was right after they'd upgraded the doors to the colo area
from simplex locks to card access. I was pissed for quite some time that
WCOM knew who was in there (due to the card access system), but refused to
tell us. I figured it was probably one of their own people.

----------------------------------------------------------------------
Jon Lewis, MCP :) | I route
StackPath, Sr. Neteng | therefore you are
_________ http://www.lewis.org/~jlewis/pgp for PGP public key_________
Re: Famous operational issues [ In reply to ]
Biggest internet operational SUCCESS

1. Secure Shell (SSH) replaced TELNET. Nearly eliminated an entire class
of security problems on the Internet. But then HTTP took over everything,
so a good news/bad news.

2. Internet worms massively reduced by changed default configurations
and default firewalls (Windows XP proved defaults could be changed). Still
need to work on DDOS amplification.

3. Head of Line blocking in IX switches (although I miss Stephen Stuart
saying "I'm Sorry" at every NANOG for a decade). Was a huge problem, which
is a non-problem now.

4. Classless Inter-Domain Routing and BGP4 changed how Internet routing
worked across the entire backbone, and it worked! Vince Fuller et al
rebuilt the aircraft in flight, without crashing.

5. Y2K was a huge suggess because a lot of people fixed things ahead time,
and almost nothing crashed (other than the National Security Agency's
internal systems :-). I'll be retired before Y2038, so that's someone
else's problem.
Re: [EXTERNAL] Re: Famous operational issues [ In reply to ]
There was the outage in 2014 when we got to 512K routes. http://www.bgpmon.net/what-caused-todays-internet-hiccup/


?On 2/16/21, 1:04 PM, "NANOG on behalf of Job Snijders via NANOG" <nanog-bounces+rich.compton=charter.com@nanog.org on behalf of nanog@nanog.org> wrote:

CAUTION: The e-mail below is from an external source. Please exercise caution before opening attachments, clicking links, or following guidance.

On Tue, Feb 16, 2021 at 01:37:35PM -0600, John Kristoff wrote:
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?

This was a fantastic outage, one could really feel the tremors into the
far corners of the BGP default-free zone:

https://labs.ripe.net/Members/erik/ripe-ncc-and-duke-university-bgp-experiment/

The experiment triggered a bug in some Cisco router models: affected
Ciscos would corrupt this specific BGP announcement ** ON OUTBOUND **.
Any peers of such Ciscos receiving this BGP update, would (according to
then current RFCs) consider the BGP UPDATE corrupted, and would
subsequently tear down the BGP sessions with the Ciscos. Because the
corruption was not detected by the Ciscos themselves, whenever the
sessions would come back online again they'd reannounce the corrupted
update, causing a session tear down. Bounce ... Bounce ... Bounce ... at
global scale in both IBGP and EBGP! :-)

Luckily the industry took these, and many other lessons to heart: in
2015 the IETF published RFC 7606 ("Revised Error Handling for BGP UPDATE
Messages") which specifices far more robust behaviour for BGP speakers.

Kind regards,

Job


E-MAIL CONFIDENTIALITY NOTICE:
The contents of this e-mail message and any attachments are intended solely for the addressee(s) and may contain confidential and/or legally privileged information. If you are not the intended recipient of this message or if this message has been addressed to you in error, please immediately alert the sender by reply e-mail and then delete this message and any attachments. If you are not the intended recipient, you are notified that any use, dissemination, distribution, copying, or storage of this message or any attachment is strictly prohibited.
Re: Famous operational issues [ In reply to ]
Le mar. 16 févr. 2021 à 21:03, Job Snijders via NANOG
<nanog@nanog.org> a écrit :
>
> https://labs.ripe.net/Members/erik/ripe-ncc-and-duke-university-bgp-experiment/
>
> The experiment triggered a bug in some Cisco router models: affected
> Ciscos would corrupt this specific BGP announcement ** ON OUTBOUND **.
> Any peers of such Ciscos receiving this BGP update, would (according to
> then current RFCs) consider the BGP UPDATE corrupted, and would
> subsequently tear down the BGP sessions with the Ciscos. Because the
> corruption was not detected by the Ciscos themselves, whenever the
> sessions would come back online again they'd reannounce the corrupted
> update, causing a session tear down. Bounce ... Bounce ... Bounce ... at
> global scale in both IBGP and EBGP! :-)

In a similar fashion, a network I know had a massive outage when a
failing linecard corrupted is-is lsps, triggering a flood of purges
and taking down the whole backbone.

This was pre-rfc6232, so you can guess that resolving the issue was a real PITA.

This kind of outages fuels my netops nightmares.
Re: Famous operational issues [ In reply to ]
On 2/16/2021 9:37 AM, John Kristoff wrote:

> I'd suggest the AS 7007 event is perhaps the most notorious and
> likely to top many lists including mine.
> --------------------------------------------------------


AS7007 is how I found NANOG.  We (Digital Island; first job out
of college) were in 10-20 countries around the planet at the time.
All of them wentdown while we were in cisco training.  I kept
interrupting the class andtelling my manager "everything's down!
We need to stop the training and get on it!"  We didn't because I
was new and no onebelieved that much could go down all at once.
They assumed it was a monitoring glitch.So, the training
continued for a while until very senior engineers got involved.
One of the senior guys said something to the effect of "yeah, it's
all over NANOG."  I said what is NANOG?  I signed upfor the list
and many of you have had to listen to me ever since... ;)

scott
Re: Famous operational issues [ In reply to ]
jlewis> This reminds me of one of the Sprint CO's we were colo'd in.

Ah, Sprint. Nothing like using your railroad to run phone lines...
Our routers in San Jose colo were black from the soot of the trains.

Fondly remember a major Sprint outage in the early 90s. All our data
circuits in the southeast went down at once and there were major voice
outages in the entire southeast.

Turns out a storm caused a mudslide which in turn derailed a train
carrying toxic waste, resulting in a wave of 6-10' of toxic mud taking
out the Spring voice pop for the whole southeast, because it was
conveniently located right on said railroad tracks.

We were a big enough customer that PLSC in Atlanta gave us the real
story when we asked for an ETA on repair. They couldn't give us one
immediately until the HAZMAT crew let them in. Turned out to be a total
loss of all gear.

They yanked every tech east of the Misssissippi and a 7ESS was Fedex
overnighted (stolen from some customer in the middle east?) and they had
to rebuild everything.

Was down less than 10 days. Good times.
Re: Famous operational issues [ In reply to ]
On Tue Feb 16, 2021 at 09:33:20PM +0100, J?rg Kost wrote:
> I don't want to classify and rate it, but would name 9/11.
>
> You can read about the impacts on the list archives and there is also a
> presentation from NANOG '23 online.

For an operational perspective, I was part of the team trying to keep the
BBC website up and running through 9/11...

http://www.slimey.org/bbc_ticket_10083.txt

Simon
Re: Famous operational issues [ In reply to ]
If were just talking about outages historically, I recall the 1996 AOL
Email debacle, not really anything to do with network mishaps but more so
DNS configuration..

As well, I believe the North East 2003 blackout was a great DR test that no
one was expecting.

Of course we also have the big non-events too such as Y2K....

Regards
-Joe B.


On Tue, Feb 16, 2021 at 1:38 PM John Kristoff <jtk@dataplane.org> wrote:

> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
>
> To get things started, I'd suggest the AS 7007 event is perhaps the
> most notorious and likely to top many lists including mine. So if
> that is one for you I'm asking for just two more.
>
> I'm particularly interested in this as the first step in developing a
> future NANOG session. I'd be particularly interested in any issues
> that also identify key individuals that might still be around and
> interested in participating in a retrospective. I already have someone
> that is willing to talk about AS 7007, which shouldn't be hard to guess
> who.
>
> Thanks in advance for your suggestions,
>
> John
>
Re: Famous operational issues [ In reply to ]
> On 17 Feb 2021, at 09:51, Sean Donelan <sean@donelan.com> wrote:
>
>
> Biggest internet operational SUCCESS
>
> 1. Secure Shell (SSH) replaced TELNET. Nearly eliminated an entire class of security problems on the Internet. But then HTTP took over everything, so a good news/bad news.
>
> 2. Internet worms massively reduced by changed default configurations and default firewalls (Windows XP proved defaults could be changed). Still need to work on DDOS amplification.
>
> 3. Head of Line blocking in IX switches (although I miss Stephen Stuart saying "I'm Sorry" at every NANOG for a decade). Was a huge problem, which is a non-problem now.
>
> 4. Classless Inter-Domain Routing and BGP4 changed how Internet routing worked across the entire backbone, and it worked! Vince Fuller et al rebuilt the aircraft in flight, without crashing.
>
> 5. Y2K was a huge suggess because a lot of people fixed things ahead time, and almost nothing crashed (other than the National Security Agency's internal systems :-). I'll be retired before Y2038, so that's someone else's problem.

Lets hope you aren’t depending on a piece of medical equipment with a Y2038 issue to keep you alive.

Y2038 is everybody's problem!

Mark
--
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Australia
PHONE: +61 2 9871 4742 INTERNET: marka@isc.org
Re: Famous operational issues [ In reply to ]
That was the one with the most severe imact for my company. Seven Frame
Circuits (UUNET) and we all saw what an updtae can do

On 2/16/21 3:28 PM, Sean Donelan wrote:
> Since you said operational issues, instead of just outage...
>
> How about MCI Worldcom's 10-day operational disaster in 1999.
>
>
> http://www.cnn.com/TECH/computing/9908/23/network.nono.idg/
> How not to handle a network outage
>
> [...]
> MCI WorldCom issued an alert to its sales force, which was given the
> option to deliver a notice to customers by e-mail, hand delivery or
> telephone – or not at all. After a deafening silence from company
> executives on the 10-day network outage, MCI WorldCom CEO Bernie
> Ebbers finally took the podium to discuss the situation. How did he
> explain the failure, and reassure customers that the network would not
> suffer such a failure in the future? He didn't. Instead, he blamed
> Lucent.
> [...]
Re: Famous operational issues [ In reply to ]
On Tue, Feb 16, 2021 at 01:37:35PM -0600, John Kristoff wrote:
> Which examples would make up your top three?

Morris worm, November 1988. Much confusion and eventually the realization
the John Brunner had called it from 13 years out ("The Shockwave Rider", 1975).
But sloppy coding meant it could be defeated with one line of /bin/sh.

---rsk
Re: Famous operational issues [ In reply to ]
> On Tue, 16 Feb 2021, John Kristoff wrote:
>
> > Friends,
> >
> > I'd like to start a thread about the most famous and widespread Internet
> > operational issues, outages or implementation incompatibilities you
> > have seen.
> >

When Boston University joined the internet proper ca 1984 I was in
charge of that group.

We accidentally* submitted an initial HOSTS.TXT file which included
some internally used one-character host names (A, B, C) and one which
began with a digit (3B, an AT&T 3B5), both illegal for HOSTS.TXT back
then.

This put the BSD Unix program which converted from HOSTS.TXT to Unix'
/etc/hosts format into an infinite loop filling /tmp which in those
days crashed Unix and it often couldn't reboot successfully without
manual intervention.

On many, many hosts across the internet.

I hesitate to guess a number since scale has changed so much but some
of the more heated email claimed it brought down at least half the
internet by some count.

It was worsened by the fact that many hosts pulled and processed a new
HOSTS.TXT file via cron (time-based job scheduler) at midnight so no
one was around to fix and reboot systems.

The thread on the TCP-IP mailing list was: BU JOINS THE INTERNET!

It was a little embarrassing.

Today it probably would have landed me in Gitmo.

* There were two versions, the one we used internally, and the one to
be submitted which removed those host names. The wrong one got
submitted.

--
-Barry Shein

Software Tool & Die | bzs@TheWorld.com | http://www.TheWorld.com
Purveyors to the Trade | Voice: +1 617-STD-WRLD | 800-THE-WRLD
The World: Since 1989 | A Public Information Utility | *oo*

1 2 3 4 5  View All