Mailing List Archive

Famous operational issues
Friends,

I'd like to start a thread about the most famous and widespread Internet
operational issues, outages or implementation incompatibilities you
have seen.

Which examples would make up your top three?

To get things started, I'd suggest the AS 7007 event is perhaps the
most notorious and likely to top many lists including mine. So if
that is one for you I'm asking for just two more.

I'm particularly interested in this as the first step in developing a
future NANOG session. I'd be particularly interested in any issues
that also identify key individuals that might still be around and
interested in participating in a retrospective. I already have someone
that is willing to talk about AS 7007, which shouldn't be hard to guess
who.

Thanks in advance for your suggestions,

John
Re: Famous operational issues [ In reply to ]
On Tue, Feb 16, 2021 at 01:37:35PM -0600, John Kristoff wrote:
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?

This was a fantastic outage, one could really feel the tremors into the
far corners of the BGP default-free zone:

https://labs.ripe.net/Members/erik/ripe-ncc-and-duke-university-bgp-experiment/

The experiment triggered a bug in some Cisco router models: affected
Ciscos would corrupt this specific BGP announcement ** ON OUTBOUND **.
Any peers of such Ciscos receiving this BGP update, would (according to
then current RFCs) consider the BGP UPDATE corrupted, and would
subsequently tear down the BGP sessions with the Ciscos. Because the
corruption was not detected by the Ciscos themselves, whenever the
sessions would come back online again they'd reannounce the corrupted
update, causing a session tear down. Bounce ... Bounce ... Bounce ... at
global scale in both IBGP and EBGP! :-)

Luckily the industry took these, and many other lessons to heart: in
2015 the IETF published RFC 7606 ("Revised Error Handling for BGP UPDATE
Messages") which specifices far more robust behaviour for BGP speakers.

Kind regards,

Job
Re: Famous operational issues [ In reply to ]
On Tue, 16 Feb 2021, John Kristoff wrote:

> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?

https://blogs.oracle.com/internetintelligence/longer-is-not-always-better

--
Mikael Abrahamsson email: swmike@swm.pp.se
Re: Famous operational issues [ In reply to ]
Hi,

I don't want to classify and rate it, but would name 9/11.

You can read about the impacts on the list archives and there is also a
presentation from NANOG '23 online.

Regards
Jörg

On 16 Feb 2021, at 20:37, John Kristoff wrote:

> Friends,
>
> I'd like to start a thread about the most famous and widespread
> Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
>
Re: Famous operational issues [ In reply to ]
actually, the 129/8 incident was as damaging as 7007, but folk tend not
to remember it; maybe because it was a bit embarrassing

and the baltimore tunnel is a gift that gave a few times

and the quake/mudslides off taiwan

the tohoku quake was also fun, in some sense of the word

but the list of really damaging wet glass cuts is long
Re: Famous operational issues [ In reply to ]
> actually, the 129/8 incident

a friend pointed out that it was the 128/9 incident

> but folk tend not to remember it

qed, eh? :)
Re: Famous operational issues [ In reply to ]
https://en.wikipedia.org/wiki/SQL_Slammer was interesting in that it was an
application-layer issue that affected the network layer.

Damian

On Tue, Feb 16, 2021 at 11:37 AM John Kristoff <jtk@dataplane.org> wrote:

> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
>
> To get things started, I'd suggest the AS 7007 event is perhaps the
> most notorious and likely to top many lists including mine. So if
> that is one for you I'm asking for just two more.
>
> I'm particularly interested in this as the first step in developing a
> future NANOG session. I'd be particularly interested in any issues
> that also identify key individuals that might still be around and
> interested in participating in a retrospective. I already have someone
> that is willing to talk about AS 7007, which shouldn't be hard to guess
> who.
>
> Thanks in advance for your suggestions,
>
> John
>
Re: Famous operational issues [ In reply to ]
Since you said operational issues, instead of just outage...

How about MCI Worldcom's 10-day operational disaster in 1999.


http://www.cnn.com/TECH/computing/9908/23/network.nono.idg/
How not to handle a network outage

[...]
MCI WorldCom issued an alert to its sales force, which was given the
option to deliver a notice to customers by e-mail, hand delivery or
telephone – or not at all. After a deafening silence from company
executives on the 10-day network outage, MCI WorldCom CEO Bernie Ebbers
finally took the podium to discuss the situation. How did he explain the
failure, and reassure customers that the network would not suffer such a
failure in the future? He didn't. Instead, he blamed Lucent.
[...]
Re: Famous operational issues [ In reply to ]
There are all the hilarious leaks and blocks.

Pakistan blocks youtube and the announcement leaks internet-wide.
Turk telecom (AS9121 IIRC) leaks a full table out one of their providers.

So many routing level incidents they're probably not even interesting any
more, I suppose.

The huge power outages in the US northeast in 2003 (
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.183.998&rep=rep1&type=pdf)
were pretty decent.



On Tue, Feb 16, 2021 at 4:02 PM Damian Menscher via NANOG <nanog@nanog.org>
wrote:

> https://en.wikipedia.org/wiki/SQL_Slammer was interesting in that it was
> an application-layer issue that affected the network layer.
>
> Damian
>
> On Tue, Feb 16, 2021 at 11:37 AM John Kristoff <jtk@dataplane.org> wrote:
>
>> Friends,
>>
>> I'd like to start a thread about the most famous and widespread Internet
>> operational issues, outages or implementation incompatibilities you
>> have seen.
>>
>> Which examples would make up your top three?
>>
>> To get things started, I'd suggest the AS 7007 event is perhaps the
>> most notorious and likely to top many lists including mine. So if
>> that is one for you I'm asking for just two more.
>>
>> I'm particularly interested in this as the first step in developing a
>> future NANOG session. I'd be particularly interested in any issues
>> that also identify key individuals that might still be around and
>> interested in participating in a retrospective. I already have someone
>> that is willing to talk about AS 7007, which shouldn't be hard to guess
>> who.
>>
>> Thanks in advance for your suggestions,
>>
>> John
>>
>
Re: Famous operational issues [ In reply to ]
Oh well, MCI in 1999 was all about…
https://www.youtube.com/watch?v=7iM5nFNUG4U

On 16 Feb 2021, at 22:28, Sean Donelan wrote:

> Since you said operational issues, instead of just outage...
>
> How about MCI Worldcom's 10-day operational disaster in 1999.
>
>
> http://www.cnn.com/TECH/computing/9908/23/network.nono.idg/
> How not to handle a network outage
>
> [...]
> MCI WorldCom issued an alert to its sales force, which was given the
> option to deliver a notice to customers by e-mail, hand delivery or
> telephone – or not at all. After a deafening silence from company
> executives on the 10-day network outage, MCI WorldCom CEO Bernie
> Ebbers finally took the podium to discuss the situation. How did he
> explain the failure, and reassure customers that the network would not
> suffer such a failure in the future? He didn't. Instead, he blamed
> Lucent.
> [...]
Re: Famous operational issues [ In reply to ]
Would this also extend to intentional actions that may have had unintended
consequences, such as provider A intentionally de-peering provider B, or
the monopoly telco for $country cutting itself off from the rest of the
global Internet for various reasons (technical, political, or otherwise)?

That said, I'd still have to stick with AS7007, the Baltimore tunnel fire,
and 9/11 as the most prominent examples of widespread issues/outages and
how those issues were addressed.

Honorable mention: $vendor BGP bugs, either due to $vendor ignoring the
relevant RFCs, implementing them incorrectly, or an outage exposed a design
flaw that the RFCs didn't catch. Too many of those to list here :)

jms

On Tue, Feb 16, 2021 at 2:37 PM John Kristoff <jtk@dataplane.org> wrote:

> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
>
> To get things started, I'd suggest the AS 7007 event is perhaps the
> most notorious and likely to top many lists including mine. So if
> that is one for you I'm asking for just two more.
>
> I'm particularly interested in this as the first step in developing a
> future NANOG session. I'd be particularly interested in any issues
> that also identify key individuals that might still be around and
> interested in participating in a retrospective. I already have someone
> that is willing to talk about AS 7007, which shouldn't be hard to guess
> who.
>
> Thanks in advance for your suggestions,
>
> John
>
Re: Famous operational issues [ In reply to ]
I was thinking about how we need a war stories nanog track. My favorite was being on call when the router was stolen.

Sent from my TI-99/4a

> On Feb 16, 2021, at 2:40 PM, John Kristoff <jtk@dataplane.org> wrote:
>
> ?Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
>
> To get things started, I'd suggest the AS 7007 event is perhaps the
> most notorious and likely to top many lists including mine. So if
> that is one for you I'm asking for just two more.
>
> I'm particularly interested in this as the first step in developing a
> future NANOG session. I'd be particularly interested in any issues
> that also identify key individuals that might still be around and
> interested in participating in a retrospective. I already have someone
> that is willing to talk about AS 7007, which shouldn't be hard to guess
> who.
>
> Thanks in advance for your suggestions,
>
> John
Re: Famous operational issues [ In reply to ]
----- On Feb 16, 2021, at 2:08 PM, Jared Mauch jared@puck.nether.net wrote:

Hi,

> I was thinking about how we need a war stories nanog track. My favorite was
> being on call when the router was stolen.

Wait... what? I would love to listen to that call between you and your manager.

But, here is one for you then. I was once called to a POP where one of our main
routers was down. Due to political reasons, my access had been revoked. My
manager told me to do whatever I needed to do to fix the problem, he would cover
my behind. I did, and I "gently" removed the door. My manager held word.

Another interesting one: entering a pop to find it flooded. Luckily there were
raised floors with only fiber underneath the floor panels. The NOC ignored the
warnings because "it was impossible for water to enter the building as it was
not raining". Yeah, but water pipes do burst from time to time.

But my favorite was pressing an undocumented combination of keys on a fire
alarm system which set off the Inergen protection without warning, immediately.
The noise and pressure of all that air entering the datacenter space with me
still in it is something I will never forget. Similar to the response of my
manager who, instead of asking me if I was ok, decided to try and light a piece
of paper. "Oh wow, it does work, I can't set anything on fire".

All if this was, obviously, in the late 1990s and early 2000s. These days,
things are -slightly- more professional.

Thanks,

Sabri
Re: Famous operational issues [ In reply to ]
On Tue, 16 Feb 2021, Sabri Berisha wrote:

> ----- On Feb 16, 2021, at 2:08 PM, Jared Mauch jared@puck.nether.net wrote:
>
> Hi,
>
>> I was thinking about how we need a war stories nanog track. My favorite was
>> being on call when the router was stolen.
>
> Wait... what? I would love to listen to that call between you and your manager.
>
> But, here is one for you then. I was once called to a POP where one of our main
> routers was down. Due to political reasons, my access had been revoked. My
> manager told me to do whatever I needed to do to fix the problem, he would cover
> my behind. I did, and I "gently" removed the door. My manager held word.

This reminds me of one of the Sprint CO's we were colo'd in. Access to
the CLEC colo area was via a back door through the Men's room! One
weekend, I had to make the drive to that site to deal with an access
server issue, and I found they'd locked the back door to the Men's room
from the colo floor side, so no access. Using supplies I found inside the
CO, I managed open the locked door and get to our gear. That route, being
our only access route was probably some kind of violation. Not all of our
techs were guys.

While we never had a router stolen, we did have a flash card stolen from
one of our routers in a WCOM colo facility (most customers in open relay
racks). It was right after they'd upgraded the doors to the colo area
from simplex locks to card access. I was pissed for quite some time that
WCOM knew who was in there (due to the card access system), but refused to
tell us. I figured it was probably one of their own people.

----------------------------------------------------------------------
Jon Lewis, MCP :) | I route
StackPath, Sr. Neteng | therefore you are
_________ http://www.lewis.org/~jlewis/pgp for PGP public key_________
Re: Famous operational issues [ In reply to ]
Biggest internet operational SUCCESS

1. Secure Shell (SSH) replaced TELNET. Nearly eliminated an entire class
of security problems on the Internet. But then HTTP took over everything,
so a good news/bad news.

2. Internet worms massively reduced by changed default configurations
and default firewalls (Windows XP proved defaults could be changed). Still
need to work on DDOS amplification.

3. Head of Line blocking in IX switches (although I miss Stephen Stuart
saying "I'm Sorry" at every NANOG for a decade). Was a huge problem, which
is a non-problem now.

4. Classless Inter-Domain Routing and BGP4 changed how Internet routing
worked across the entire backbone, and it worked! Vince Fuller et al
rebuilt the aircraft in flight, without crashing.

5. Y2K was a huge suggess because a lot of people fixed things ahead time,
and almost nothing crashed (other than the National Security Agency's
internal systems :-). I'll be retired before Y2038, so that's someone
else's problem.
Re: [EXTERNAL] Re: Famous operational issues [ In reply to ]
There was the outage in 2014 when we got to 512K routes. http://www.bgpmon.net/what-caused-todays-internet-hiccup/


?On 2/16/21, 1:04 PM, "NANOG on behalf of Job Snijders via NANOG" <nanog-bounces+rich.compton=charter.com@nanog.org on behalf of nanog@nanog.org> wrote:

CAUTION: The e-mail below is from an external source. Please exercise caution before opening attachments, clicking links, or following guidance.

On Tue, Feb 16, 2021 at 01:37:35PM -0600, John Kristoff wrote:
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?

This was a fantastic outage, one could really feel the tremors into the
far corners of the BGP default-free zone:

https://labs.ripe.net/Members/erik/ripe-ncc-and-duke-university-bgp-experiment/

The experiment triggered a bug in some Cisco router models: affected
Ciscos would corrupt this specific BGP announcement ** ON OUTBOUND **.
Any peers of such Ciscos receiving this BGP update, would (according to
then current RFCs) consider the BGP UPDATE corrupted, and would
subsequently tear down the BGP sessions with the Ciscos. Because the
corruption was not detected by the Ciscos themselves, whenever the
sessions would come back online again they'd reannounce the corrupted
update, causing a session tear down. Bounce ... Bounce ... Bounce ... at
global scale in both IBGP and EBGP! :-)

Luckily the industry took these, and many other lessons to heart: in
2015 the IETF published RFC 7606 ("Revised Error Handling for BGP UPDATE
Messages") which specifices far more robust behaviour for BGP speakers.

Kind regards,

Job


E-MAIL CONFIDENTIALITY NOTICE:
The contents of this e-mail message and any attachments are intended solely for the addressee(s) and may contain confidential and/or legally privileged information. If you are not the intended recipient of this message or if this message has been addressed to you in error, please immediately alert the sender by reply e-mail and then delete this message and any attachments. If you are not the intended recipient, you are notified that any use, dissemination, distribution, copying, or storage of this message or any attachment is strictly prohibited.
Re: Famous operational issues [ In reply to ]
Le mar. 16 févr. 2021 à 21:03, Job Snijders via NANOG
<nanog@nanog.org> a écrit :
>
> https://labs.ripe.net/Members/erik/ripe-ncc-and-duke-university-bgp-experiment/
>
> The experiment triggered a bug in some Cisco router models: affected
> Ciscos would corrupt this specific BGP announcement ** ON OUTBOUND **.
> Any peers of such Ciscos receiving this BGP update, would (according to
> then current RFCs) consider the BGP UPDATE corrupted, and would
> subsequently tear down the BGP sessions with the Ciscos. Because the
> corruption was not detected by the Ciscos themselves, whenever the
> sessions would come back online again they'd reannounce the corrupted
> update, causing a session tear down. Bounce ... Bounce ... Bounce ... at
> global scale in both IBGP and EBGP! :-)

In a similar fashion, a network I know had a massive outage when a
failing linecard corrupted is-is lsps, triggering a flood of purges
and taking down the whole backbone.

This was pre-rfc6232, so you can guess that resolving the issue was a real PITA.

This kind of outages fuels my netops nightmares.
Re: Famous operational issues [ In reply to ]
On 2/16/2021 9:37 AM, John Kristoff wrote:

> I'd suggest the AS 7007 event is perhaps the most notorious and
> likely to top many lists including mine.
> --------------------------------------------------------


AS7007 is how I found NANOG.  We (Digital Island; first job out
of college) were in 10-20 countries around the planet at the time.
All of them wentdown while we were in cisco training.  I kept
interrupting the class andtelling my manager "everything's down!
We need to stop the training and get on it!"  We didn't because I
was new and no onebelieved that much could go down all at once.
They assumed it was a monitoring glitch.So, the training
continued for a while until very senior engineers got involved.
One of the senior guys said something to the effect of "yeah, it's
all over NANOG."  I said what is NANOG?  I signed upfor the list
and many of you have had to listen to me ever since... ;)

scott
Re: Famous operational issues [ In reply to ]
jlewis> This reminds me of one of the Sprint CO's we were colo'd in.

Ah, Sprint. Nothing like using your railroad to run phone lines...
Our routers in San Jose colo were black from the soot of the trains.

Fondly remember a major Sprint outage in the early 90s. All our data
circuits in the southeast went down at once and there were major voice
outages in the entire southeast.

Turns out a storm caused a mudslide which in turn derailed a train
carrying toxic waste, resulting in a wave of 6-10' of toxic mud taking
out the Spring voice pop for the whole southeast, because it was
conveniently located right on said railroad tracks.

We were a big enough customer that PLSC in Atlanta gave us the real
story when we asked for an ETA on repair. They couldn't give us one
immediately until the HAZMAT crew let them in. Turned out to be a total
loss of all gear.

They yanked every tech east of the Misssissippi and a 7ESS was Fedex
overnighted (stolen from some customer in the middle east?) and they had
to rebuild everything.

Was down less than 10 days. Good times.
Re: Famous operational issues [ In reply to ]
On Tue Feb 16, 2021 at 09:33:20PM +0100, J?rg Kost wrote:
> I don't want to classify and rate it, but would name 9/11.
>
> You can read about the impacts on the list archives and there is also a
> presentation from NANOG '23 online.

For an operational perspective, I was part of the team trying to keep the
BBC website up and running through 9/11...

http://www.slimey.org/bbc_ticket_10083.txt

Simon
Re: Famous operational issues [ In reply to ]
If were just talking about outages historically, I recall the 1996 AOL
Email debacle, not really anything to do with network mishaps but more so
DNS configuration..

As well, I believe the North East 2003 blackout was a great DR test that no
one was expecting.

Of course we also have the big non-events too such as Y2K....

Regards
-Joe B.


On Tue, Feb 16, 2021 at 1:38 PM John Kristoff <jtk@dataplane.org> wrote:

> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
>
> To get things started, I'd suggest the AS 7007 event is perhaps the
> most notorious and likely to top many lists including mine. So if
> that is one for you I'm asking for just two more.
>
> I'm particularly interested in this as the first step in developing a
> future NANOG session. I'd be particularly interested in any issues
> that also identify key individuals that might still be around and
> interested in participating in a retrospective. I already have someone
> that is willing to talk about AS 7007, which shouldn't be hard to guess
> who.
>
> Thanks in advance for your suggestions,
>
> John
>
Re: Famous operational issues [ In reply to ]
> On 17 Feb 2021, at 09:51, Sean Donelan <sean@donelan.com> wrote:
>
>
> Biggest internet operational SUCCESS
>
> 1. Secure Shell (SSH) replaced TELNET. Nearly eliminated an entire class of security problems on the Internet. But then HTTP took over everything, so a good news/bad news.
>
> 2. Internet worms massively reduced by changed default configurations and default firewalls (Windows XP proved defaults could be changed). Still need to work on DDOS amplification.
>
> 3. Head of Line blocking in IX switches (although I miss Stephen Stuart saying "I'm Sorry" at every NANOG for a decade). Was a huge problem, which is a non-problem now.
>
> 4. Classless Inter-Domain Routing and BGP4 changed how Internet routing worked across the entire backbone, and it worked! Vince Fuller et al rebuilt the aircraft in flight, without crashing.
>
> 5. Y2K was a huge suggess because a lot of people fixed things ahead time, and almost nothing crashed (other than the National Security Agency's internal systems :-). I'll be retired before Y2038, so that's someone else's problem.

Lets hope you aren’t depending on a piece of medical equipment with a Y2038 issue to keep you alive.

Y2038 is everybody's problem!

Mark
--
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Australia
PHONE: +61 2 9871 4742 INTERNET: marka@isc.org
Re: Famous operational issues [ In reply to ]
That was the one with the most severe imact for my company. Seven Frame
Circuits (UUNET) and we all saw what an updtae can do

On 2/16/21 3:28 PM, Sean Donelan wrote:
> Since you said operational issues, instead of just outage...
>
> How about MCI Worldcom's 10-day operational disaster in 1999.
>
>
> http://www.cnn.com/TECH/computing/9908/23/network.nono.idg/
> How not to handle a network outage
>
> [...]
> MCI WorldCom issued an alert to its sales force, which was given the
> option to deliver a notice to customers by e-mail, hand delivery or
> telephone – or not at all. After a deafening silence from company
> executives on the 10-day network outage, MCI WorldCom CEO Bernie
> Ebbers finally took the podium to discuss the situation. How did he
> explain the failure, and reassure customers that the network would not
> suffer such a failure in the future? He didn't. Instead, he blamed
> Lucent.
> [...]
Re: Famous operational issues [ In reply to ]
On Tue, Feb 16, 2021 at 01:37:35PM -0600, John Kristoff wrote:
> Which examples would make up your top three?

Morris worm, November 1988. Much confusion and eventually the realization
the John Brunner had called it from 13 years out ("The Shockwave Rider", 1975).
But sloppy coding meant it could be defeated with one line of /bin/sh.

---rsk
Re: Famous operational issues [ In reply to ]
> On Tue, 16 Feb 2021, John Kristoff wrote:
>
> > Friends,
> >
> > I'd like to start a thread about the most famous and widespread Internet
> > operational issues, outages or implementation incompatibilities you
> > have seen.
> >

When Boston University joined the internet proper ca 1984 I was in
charge of that group.

We accidentally* submitted an initial HOSTS.TXT file which included
some internally used one-character host names (A, B, C) and one which
began with a digit (3B, an AT&T 3B5), both illegal for HOSTS.TXT back
then.

This put the BSD Unix program which converted from HOSTS.TXT to Unix'
/etc/hosts format into an infinite loop filling /tmp which in those
days crashed Unix and it often couldn't reboot successfully without
manual intervention.

On many, many hosts across the internet.

I hesitate to guess a number since scale has changed so much but some
of the more heated email claimed it brought down at least half the
internet by some count.

It was worsened by the fact that many hosts pulled and processed a new
HOSTS.TXT file via cron (time-based job scheduler) at midnight so no
one was around to fix and reboot systems.

The thread on the TCP-IP mailing list was: BU JOINS THE INTERNET!

It was a little embarrassing.

Today it probably would have landed me in Gitmo.

* There were two versions, the one we used internally, and the one to
be submitted which removed those host names. The wrong one got
submitted.

--
-Barry Shein

Software Tool & Die | bzs@TheWorld.com | http://www.TheWorld.com
Purveyors to the Trade | Voice: +1 617-STD-WRLD | 800-THE-WRLD
The World: Since 1989 | A Public Information Utility | *oo*
Re: Famous operational issues [ In reply to ]
Ahh, war stories. I like the one where I got a wake up call that our IRC
server was on fire,  together with the rest of the DC.


Not that widespread, but we reached Slashdot. :)

November 2002, University of Twente, The Netherlands. Some idiot wanted
to be a hero. He deflated peoples tires, to help inflate them. One
morning he thought it would be a good idea to start a small fire and
then extinguish it, so he would be the hero that stopped a fire. He
failed and the building burned down. He got caught a few days later when
he tried the same thing in a different building.

Almost all of the IT was in that building, including core network,
uplinks to SURFNet (Dutch Educational Network) and to the 2000 students
living on the campus. Ironically a new DC was already being built, so
that was ready for use a few weeks later.

As we had quite a network for 2002 we hosted for instance
security.debian.org. The students all had 100Mbit in their room, so some
of them also hosted some popular websites. One I can remember was an
image sharing site.

Some students immediately created a backup network; dhcp server, dns
server with a catch all, website explaining what was going on, IRC
server, etc..

A local ISP offered to sponsor 50Mbit for the residents, which was
connected via a microwave relay and a temporary fiber was run through a
ditch to connect two parts of the campus residencies. At the end of the
day all 2000 students had their internet connection back, although all
behind a single 50Mbit link.


Syslog message from the local SURFNet router:

lo0.ar5.enschede1.surf.net 3613: Nov 20 07:20:50.927 UTC:
%ENV_MON-2-TEMP: Hotpoint temp sensor(slot 18) temperature has reached
WARNING level at 61(C)


(Disclaimer: Where I say we, I mean we as University. I wasn't working
for the university, but was part of the students working on the backup
network. There are probably some other people on list with some more
details and I've probably missed some details, but this is the summary.)


On 16-02-2021 23:08, Jared Mauch wrote:
> I was thinking about how we need a war stories nanog track. My favorite was being on call when the router was stolen.
>
> Sent from my TI-99/4a
>
>> On Feb 16, 2021, at 2:40 PM, John Kristoff <jtk@dataplane.org> wrote:
>>
>> ?Friends,
>>
>> I'd like to start a thread about the most famous and widespread Internet
>> operational issues, outages or implementation incompatibilities you
>> have seen.
>>
>> Which examples would make up your top three?
>>
>> To get things started, I'd suggest the AS 7007 event is perhaps the
>> most notorious and likely to top many lists including mine. So if
>> that is one for you I'm asking for just two more.
>>
>> I'm particularly interested in this as the first step in developing a
>> future NANOG session. I'd be particularly interested in any issues
>> that also identify key individuals that might still be around and
>> interested in participating in a retrospective. I already have someone
>> that is willing to talk about AS 7007, which shouldn't be hard to guess
>> who.
>>
>> Thanks in advance for your suggestions,
>>
>> John
Re: Famous operational issues [ In reply to ]
John Kristoff wrote:
> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
Well... pre-Internet, but the great Northeast fiber cut comes to mind
(backhoe vs. fiber, backhoe won).

Miles Fidelman

--
In theory, there is no difference between theory and practice.
In practice, there is. .... Yogi Berra

Theory is when you know everything but nothing works.
Practice is when everything works but no one knows why.
In our lab, theory and practice are combined:
nothing works and no one knows why. ... unknown
Re: Famous operational issues [ In reply to ]
I remember when the big carriers de-peered with Cogent in the early 2000s. The underestimated the amount of web-sites being hosted by people using cogent exclusively.


Justin Wilson
j2sw@j2sw.com


https://j2sw.com - All things jsw (AS209109)
https://blog.j2sw.com - Podcast and Blog

> On Feb 17, 2021, at 10:29 AM, Miles Fidelman <mfidelman@meetinghouse.net> wrote:
>
> John Kristoff wrote:
>> Friends,
>>
>> I'd like to start a thread about the most famous and widespread Internet
>> operational issues, outages or implementation incompatibilities you
>> have seen.
>>
> Well... pre-Internet, but the great Northeast fiber cut comes to mind (backhoe vs. fiber, backhoe won).
>
> Miles Fidelman
>
> --
> In theory, there is no difference between theory and practice.
> In practice, there is. .... Yogi Berra
>
> Theory is when you know everything but nothing works.
> Practice is when everything works but no one knows why.
> In our lab, theory and practice are combined:
> nothing works and no one knows why. ... unknown
Re: Famous operational issues [ In reply to ]
Cogentco still did not peer with Google and HE over IPv6 I guess.

________________________________
From: NANOG <nanog-bounces+david=xtom.com@nanog.org> on behalf of Justin Wilson (Lists) <lists@mtin.net>
Sent: Thursday, February 18, 2021 00:53
To: Miles Fidelman
Cc: nanog@nanog.org
Subject: Re: Famous operational issues

I remember when the big carriers de-peered with Cogent in the early 2000s. The underestimated the amount of web-sites being hosted by people using cogent exclusively.


Justin Wilson
j2sw@j2sw.com

?
https://j2sw.com - All things jsw (AS209109)
https://blog.j2sw.com - Podcast and Blog

> On Feb 17, 2021, at 10:29 AM, Miles Fidelman <mfidelman@meetinghouse.net> wrote:
>
> John Kristoff wrote:
>> Friends,
>>
>> I'd like to start a thread about the most famous and widespread Internet
>> operational issues, outages or implementation incompatibilities you
>> have seen.
>>
> Well... pre-Internet, but the great Northeast fiber cut comes to mind (backhoe vs. fiber, backhoe won).
>
> Miles Fidelman
>
> --
> In theory, there is no difference between theory and practice.
> In practice, there is. .... Yogi Berra
>
> Theory is when you know everything but nothing works.
> Practice is when everything works but no one knows why.
> In our lab, theory and practice are combined:
> nothing works and no one knows why. ... unknown
Re: Famous operational issues [ In reply to ]
The he.net side is interesting as you can see who their v4 transits are but they suppress their routes via v6, but (last I knew) lacked community support for their customers to do similar route suppression.

I’m not a fan of it, but it makes the commercial discussions much easier each time those networks come by to shop services to me in a personal or professional capacity. “No, I need all the internet”.

- Jared

> On Feb 17, 2021, at 12:07 PM, David Guo via NANOG <nanog@nanog.org> wrote:
>
> Cogentco still did not peer with Google and HE over IPv6 I guess.
>
> From: NANOG <nanog-bounces+david=xtom.com@nanog.org> on behalf of Justin Wilson (Lists) <lists@mtin.net>
> Sent: Thursday, February 18, 2021 00:53
> To: Miles Fidelman
> Cc: nanog@nanog.org
> Subject: Re: Famous operational issues
>
> I remember when the big carriers de-peered with Cogent in the early 2000s. The underestimated the amount of web-sites being hosted by people using cogent exclusively.
>
>
> Justin Wilson
> j2sw@j2sw.com
>
> —
> https://j2sw.com - All things jsw (AS209109)
> https://blog.j2sw.com - Podcast and Blog
>
> > On Feb 17, 2021, at 10:29 AM, Miles Fidelman <mfidelman@meetinghouse.net> wrote:
> >
> > John Kristoff wrote:
> >> Friends,
> >>
> >> I'd like to start a thread about the most famous and widespread Internet
> >> operational issues, outages or implementation incompatibilities you
> >> have seen.
> >>
> > Well... pre-Internet, but the great Northeast fiber cut comes to mind (backhoe vs. fiber, backhoe won).
> >
> > Miles Fidelman
> >
> > --
> > In theory, there is no difference between theory and practice.
> > In practice, there is. .... Yogi Berra
> >
> > Theory is when you know everything but nothing works.
> > Practice is when everything works but no one knows why.
> > In our lab, theory and practice are combined:
> > nothing works and no one knows why. ... unknown
Re: Famous operational issues [ In reply to ]
On Wed, 17 Feb 2021 14:07:54 -0500
John Curran <jcurrran@istaff.org> wrote:

> I have no idea what outages were most memorable for others, but the
> Stanford transfer switch explosion in October 1996 resulted in a much
> of the Internet in the Bay Area simply not being reachable for
> several days.

Thanks John.

This reminds me of two I've not seen anyone mention yet. Both
coincidentally in the Chicago area that I learned before my entry
into netops full time. One was a flood:

<https://en.wikipedia.org/wiki/Chicago_flood>

The other, at the dawn of an earlier era:

<http://telecom-digest.org/telecom-archives/TELECOM_Digest_Online/1309.html>

I wouldn't necessarily put those two in the top 3, but by some standard
for many they were certainly very significant and noteworthy.

John
Re: Famous operational issues [ In reply to ]
(resent - to list this time)
On 16 Feb 2021, at 2:37 PM, John Kristoff <jtk@dataplane.org <mailto:jtk@dataplane.org>> wrote:
>
> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?

John -

I have no idea what outages were most memorable for others, but the Stanford transfer switch explosion in October 1996 resulted in a much of the Internet in the Bay Area simply not being reachable for several days.

At the time there were three main power grids feeding Stanford – two from PG&E and one from Stanford’s own CoGen plant – and somehow a rat crawling into one of the two 12KVA transfer switches resulted in an the switch disppearing in an epic explosion that even took out a portion of the exterior wall of the building.

The ensuing restoration involved lots of industry folks, GE power-on-wheel generating stations, anaconda-sized power cables, and all in all was quite the adventure.

FYI,
/John
Re: Famous operational issues [ In reply to ]
Stolen isn’t nearly as exciting as what happens when your (used) 6509 arrives and
gets installed and operational before anyone realizes that the conductive packing
peanuts that it was packed in have managed to work their way into various midplane
connectors. Several hours later someone notices that the box is quite literally
smoldering in the colo and the resulting combination of panic, fire drill, and
management antics that ensue.

Owen


> On Feb 16, 2021, at 2:08 PM, Jared Mauch <jared@puck.nether.net> wrote:
>
> I was thinking about how we need a war stories nanog track. My favorite was being on call when the router was stolen.
>
> Sent from my TI-99/4a
>
>> On Feb 16, 2021, at 2:40 PM, John Kristoff <jtk@dataplane.org> wrote:
>>
>> ?Friends,
>>
>> I'd like to start a thread about the most famous and widespread Internet
>> operational issues, outages or implementation incompatibilities you
>> have seen.
>>
>> Which examples would make up your top three?
>>
>> To get things started, I'd suggest the AS 7007 event is perhaps the
>> most notorious and likely to top many lists including mine. So if
>> that is one for you I'm asking for just two more.
>>
>> I'm particularly interested in this as the first step in developing a
>> future NANOG session. I'd be particularly interested in any issues
>> that also identify key individuals that might still be around and
>> interested in participating in a retrospective. I already have someone
>> that is willing to talk about AS 7007, which shouldn't be hard to guess
>> who.
>>
>> Thanks in advance for your suggestions,
>>
>> John
Re: Famous operational issues [ In reply to ]
On that note, I'd be very interested in hearing stories of actual incidents
that are the cause of why cardboard boxes are banned in many facilities,
due to loose particulate matter getting into the air and setting off very
sensitive fire detection systems.

Or maybe it's more mundane and 99% of the reason is people unpack stuff and
don't always clean up properly after themselves.

On Wed, Feb 17, 2021, 6:21 PM Owen DeLong <owen@delong.com> wrote:

> Stolen isn’t nearly as exciting as what happens when your (used) 6509
> arrives and
> gets installed and operational before anyone realizes that the conductive
> packing
> peanuts that it was packed in have managed to work their way into various
> midplane
> connectors. Several hours later someone notices that the box is quite
> literally
> smoldering in the colo and the resulting combination of panic, fire drill,
> and
> management antics that ensue.
>
> Owen
>
>
> > On Feb 16, 2021, at 2:08 PM, Jared Mauch <jared@puck.nether.net> wrote:
> >
> > I was thinking about how we need a war stories nanog track. My favorite
> was being on call when the router was stolen.
> >
> > Sent from my TI-99/4a
> >
> >> On Feb 16, 2021, at 2:40 PM, John Kristoff <jtk@dataplane.org> wrote:
> >>
> >> ?Friends,
> >>
> >> I'd like to start a thread about the most famous and widespread Internet
> >> operational issues, outages or implementation incompatibilities you
> >> have seen.
> >>
> >> Which examples would make up your top three?
> >>
> >> To get things started, I'd suggest the AS 7007 event is perhaps the
> >> most notorious and likely to top many lists including mine. So if
> >> that is one for you I'm asking for just two more.
> >>
> >> I'm particularly interested in this as the first step in developing a
> >> future NANOG session. I'd be particularly interested in any issues
> >> that also identify key individuals that might still be around and
> >> interested in participating in a retrospective. I already have someone
> >> that is willing to talk about AS 7007, which shouldn't be hard to guess
> >> who.
> >>
> >> Thanks in advance for your suggestions,
> >>
> >> John
>
>
Re: Famous operational issues [ In reply to ]
On Thu, Feb 18, 2021 at 01:07:01AM -0800, Eric Kuhnke wrote:
> On that note, I'd be very interested in hearing stories of actual incidents
> that are the cause of why cardboard boxes are banned in many facilities,
> due to loose particulate matter getting into the air and setting off very
> sensitive fire detection systems.
>
> Or maybe it's more mundane and 99% of the reason is people unpack stuff and
> don't always clean up properly after themselves.

We had a plastic bag sucked into the intake of a router in a
datacenter once that caused it to overheat and take the site down. We
had cameras in our cage and I remember seeing the photo from the site of
the colo (I'll protect their name just because) taken as the tech was on
the phone and pulled the bag out of the router.

The time from the thermal warning syslog that it's getting warm
to overheat and shutdown is short enough you can't really get a tech to
the cage in time to prevent it.

I assume also the latter above, which is people have varying
definitons of clean.

- Jared

--
Jared Mauch | pgp key available via finger from jared@puck.nether.net
clue++; | http://puck.nether.net/~jared/ My statements are only mine.
Re: Famous operational issues [ In reply to ]
On 2/18/21 1:07 AM, Eric Kuhnke wrote:
> On that note, I'd be very interested in hearing stories of actual
> incidents that are the cause of why cardboard boxes are banned in many
> facilities, due to loose particulate matter getting into the air and
> setting off very sensitive fire detection systems.
>


I had a customer that tried to stack their servers - no rails except the
bottom most one - using 2x4's between each server. Up until then I
hadn't imagined anyone would want to fill their cabinet with wood, so I
made a rule to ban wood and anything tangentially related (cardboard,
paper, plastic, etc.). Easier to just ban all things. Fire reasons too
but mainly I thought a cabinet full of wood was too stupid to allow.

The "no wood" rule has become a fun story to tell everyone who asks how
that ended up being a rule. The wood customer turned out to be a
complete a-hole anyway, wood was just the tip of the iceberg.
Re: Famous operational issues [ In reply to ]
Worked a cronic support call where their internet would bounce at noon every workday. The Cisco 1601 or 1700 Router that had there T1 in, ended up being on top a microwave. Weeks of troubleshooting and shipping new routers on this one.

Also had another one where the router was plugged in to an outlet that was controlled by a light switch, discovered this after shipping them two new routers.

Customer had there building remodeled and the techs counldn't find the T1 Smartjack for the building. The contract who did the remodel job, decided it would be a good idea to cut out the section of wall where the telco equipment was and mounted it to the ceiling. It's new location was in the ladys bathroom, above the drop ceiling mounted to the building's rafters 10' in the air.

Customer needed a new router, because the first one died. It was a machine shop and they mounted the router to the wall next to a lathe or drill press that used oil to cool the bit while it was cutting. It looked like some dumped the router in a bucket of oil when we got it back.

Arriving at another large colo for a buildout. Only to find that our ASR9K that arrived 2 weeks ago was stored outside on the load dock which has no roof or locked gate. I guess that why Cisco put the plastic bag over the chassis when there shipped.

Colo techs at another larger colo decided to unpack our router which was a fully loaded 1/2 rack chassis. Since they couldn't lift it, they tipped the router on the side and walked it back by shifting the weight from one corner of the chassis to another. Bending the chassis. I could see the scrap marks in the floor from it.

We had colo space in top floor of an ATT CO where we put a Cisco 7513 to terminate about a dozen CHDS3's. The roof was leaking and instead of fixing the roof. The fix was to put a sheet of plastic over our cabinet. It was more like a tent over the cabinet. A pool of water formed in a diviot at the top and it was 120+ degrees under the plastic tarp.

Our office was in a work loft off an older building and they had the AC unit mounted to the ceiling with a drip pan underneath them. Well, AC on the 2nd floor had the pump for the drip pan died. Who every installed the drip pan didn't secure it or center it under the AC unit. It filled up with water and since it was not secured and was off centered. The drip pan came crashing down with a few gallons of water. The water worked it's way over to the wall and traveled down one story in the building. The floor below had all the telco equipment mounted to that same wall and the water flowed down right through a couple of ATT's Ciena mounted to the wall shorting them out. I was at the Chicago Nanog Hackathon on Sunday and was called out to work that one ????

Was working in the back of a cabinet that had -48 VDC power for a Cisco Router, a screw fell and shorted out the power. My co worker who was standing in front of the rack wasn't happy because the ADC PowerWorx Fuse panel was about 6" from his face where he was working. It had those little black alarm fuses, that had the spring-loaded arm. When it tripped a nice shower of sparks had flew right at his face Luckly he wore glasses.

I was 18 at my first IT job and it was a brand-new building. I was plugging in a 208VAC 30A APC UPS in the server room the electrican had just energized and check the circuit. I plugged in the APC UPS and gave it a good turn for the twist lock plug to catch and KA BAMB!!! Sparks came shooting out of the outlet at me. I think I pooped myself that day. Turns out the electricians deiced that a single Gange electrical box was good enough for a 208 VAC 30A outlet, that barely fit in the box. Didn't put any tape around the wire terminals. When they energized the circuit there was enough of an air gap that the hot screw didn't ground out. When I gave it that good old twist while plugging in the APC, I grounded the hot screw to the side of the electrical box.






________________________________
From: NANOG <nanog-bounces+esundberg=nitelusa.com@nanog.org> on behalf of Seth Mattinen <sethm@rollernet.us>
Sent: Thursday, February 18, 2021 10:23 AM
To: nanog@nanog.org <nanog@nanog.org>
Subject: Re: Famous operational issues

On 2/18/21 1:07 AM, Eric Kuhnke wrote:
> On that note, I'd be very interested in hearing stories of actual
> incidents that are the cause of why cardboard boxes are banned in many
> facilities, due to loose particulate matter getting into the air and
> setting off very sensitive fire detection systems.
>


I had a customer that tried to stack their servers - no rails except the
bottom most one - using 2x4's between each server. Up until then I
hadn't imagined anyone would want to fill their cabinet with wood, so I
made a rule to ban wood and anything tangentially related (cardboard,
paper, plastic, etc.). Easier to just ban all things. Fire reasons too
but mainly I thought a cabinet full of wood was too stupid to allow.

The "no wood" rule has become a fun story to tell everyone who asks how
that ended up being a rule. The wood customer turned out to be a
complete a-hole anyway, wood was just the tip of the iceberg.

________________________________

CONFIDENTIALITY NOTICE: This e-mail transmission, and any documents, files or previous e-mail messages attached to it may contain confidential information that is legally privileged. If you are not the intended recipient, or a person responsible for delivering it to the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of any of the information contained in or attached to this transmission is STRICTLY PROHIBITED. If you have received this transmission in error please notify the sender immediately by replying to this e-mail. You must destroy the original transmission and its attachments without reading or saving in any manner.
Thank you.
Re: Famous operational issues [ In reply to ]
On Thursday, 18 February, 2021 16:23, "Seth Mattinen" <sethm@rollernet.us> said:

> I had a customer that tried to stack their servers - no rails except the
> bottom most one - using 2x4's between each server. Up until then I
> hadn't imagined anyone would want to fill their cabinet with wood, so I
> made a rule to ban wood and anything tangentially related (cardboard,
> paper, plastic, etc.). Easier to just ban all things. Fire reasons too
> but mainly I thought a cabinet full of wood was too stupid to allow.

On the "stupid racking" front, I give you most of a rack dedicated to a single server. Not all that high a server, maybe 2U or so, but *way* too deep for the rack, so it had been installed vertically. By looping some fairly hefty chain through the handles on either side of the front of the chassis, and then bolting the four chain ends to the four rack posts. I wish I'd kept pictures of that one. Not flammable, but a serious WTF moment.

Cheers,
Tim.
Re: Famous operational issues [ In reply to ]
Normally I reference this as an example of terrible government
bureaucracy, but in this case it's also how said bureaucracy can delay
operational changes.

I was a contractor for one of the many branches of the DoD in charge
of the network at a moderate-sized site. I'd been there about 4
months, and it was my first job with FedGov. I was sent a pair of
Cisco 6509-E routers, with all supervisors and blades needed, along
with a small mountain of SFPs, to replace the non-E 6509s we had
installed that were still using GBICs for their downlinks. These were
the distro switches for approximately half the site.

Problem was, we needed 84 new SC-LC fiber jumpers to replace the SC-SC
we had in place for the existing switch - GBICs to SFPs remember. We
hadn't received any with the shipment. So I reached out to the project
manager to ask about getting the fiber jumpers. "Oh, that should be
coming from the server farm folks, since it's being installed in a
server farm." Okay, that seems stupid to me, but $FedGov, who knows. I
tell him we're stalled out until we get those cables - we have the
routers configured and ready to go, just need the jumpers, can he get
them from the server farm folks? He'll do that.

It took FIFTEEN MONTHS to hash out who was going to pay for and order
the fiber jumpers. Any number of times as the months dragged on, I
seriously considered ordering them on Amazon Prime using my corporate
card. We had them installed a week and a half after we got them. Why
that long? Because we had to completely reconfigure them, and after 15
months, the urgency just wasn't there.

By the way, the project ended up buying them, not the server farm team.

On Tue, Feb 16, 2021 at 2:38 PM John Kristoff <jtk@dataplane.org> wrote:
>
> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
>
> To get things started, I'd suggest the AS 7007 event is perhaps the
> most notorious and likely to top many lists including mine. So if
> that is one for you I'm asking for just two more.
>
> I'm particularly interested in this as the first step in developing a
> future NANOG session. I'd be particularly interested in any issues
> that also identify key individuals that might still be around and
> interested in participating in a retrospective. I already have someone
> that is willing to talk about AS 7007, which shouldn't be hard to guess
> who.
>
> Thanks in advance for your suggestions,
>
> John
Re: Famous operational issues [ In reply to ]
A few I remember:

    . Some monitoring server SCSI drive failed (we're talking
State/Province level govt)...  Got a return back stating it will take 6
month delay to get a replacement...

        Ended up choosing to use my own drive instead of leaving
something that could be have been deadly, unmonitored.

    . Metro interruption during rush hour (for a pop of 4M) due to
overload power bar in a MMR (Meet Me Room) during a unplanned deployment;

    . Cherry red and very angry looking 520-600V bus bar =D;

    . Fire fighters hitting the building generator emergency STOP
button because some neighbor reported smoke on top of the building
during a black out...
    ( not their fault, local gov failure as usual )

    . Some idiots poured gasoline into a large pipe under a bridge... 
ended up demonstrating the lack of diversity to the DCs on that urban
island;

    . Underground transformer blow up downtown Mtl and took out the
entire fiber bundle, demonstrating to those customers that their
diversity was actually real =D.

        (took them a year to get that fixed)

and

    . Obviously: Any rack cabling I do...

-----
Alain Hebert ahebert@pubnix.net
PubNIX Inc.
50 boul. St-Charles
P.O. Box 26770 Beaconsfield, Quebec H9W 6G7
Tel: 514-990-5911 http://www.pubnix.net Fax: 514-990-9443

On 2/18/21 2:37 PM, tim@pelican.org wrote:
> On Thursday, 18 February, 2021 16:23, "Seth Mattinen" <sethm@rollernet.us> said:
>
>> I had a customer that tried to stack their servers - no rails except the
>> bottom most one - using 2x4's between each server. Up until then I
>> hadn't imagined anyone would want to fill their cabinet with wood, so I
>> made a rule to ban wood and anything tangentially related (cardboard,
>> paper, plastic, etc.). Easier to just ban all things. Fire reasons too
>> but mainly I thought a cabinet full of wood was too stupid to allow.
> On the "stupid racking" front, I give you most of a rack dedicated to a single server. Not all that high a server, maybe 2U or so, but *way* too deep for the rack, so it had been installed vertically. By looping some fairly hefty chain through the handles on either side of the front of the chassis, and then bolting the four chain ends to the four rack posts. I wish I'd kept pictures of that one. Not flammable, but a serious WTF moment.
>
> Cheers,
> Tim.
>
>
Re: Famous operational issues [ In reply to ]
On Thu, Feb 18, 2021 at 01:07:01AM -0800, Eric Kuhnke wrote:
> On that note, I'd be very interested in hearing stories of actual incidents
> that are the cause of why cardboard boxes are banned in many facilities,

the datacenter manager's daughter's cat.

--
Henry Yen Aegis Information Systems, Inc.
Senior Systems Programmer Hicksville, New York
Re: Famous operational issues [ In reply to ]
On Thu, Feb 18, 2021 at 8:31 AM Jared Mauch <jared@puck.nether.net> wrote:

> On Thu, Feb 18, 2021 at 01:07:01AM -0800, Eric Kuhnke wrote:
> > On that note, I'd be very interested in hearing stories of actual
> incidents
> > that are the cause of why cardboard boxes are banned in many facilities,
> > due to loose particulate matter getting into the air and setting off very
> > sensitive fire detection systems.
> >
> > Or maybe it's more mundane and 99% of the reason is people unpack stuff
> and
> > don't always clean up properly after themselves.
>
> We had a plastic bag sucked into the intake of a router in a
> datacenter once that caused it to overheat and take the site down. We
> had cameras in our cage and I remember seeing the photo from the site of
> the colo (I'll protect their name just because) taken as the tech was on
> the phone and pulled the bag out of the router.
>
> The time from the thermal warning syslog that it's getting warm
> to overheat and shutdown is short enough you can't really get a tech to
> the cage in time to prevent it.
>


1: A previous employer was a large customer of a (now defunct) L3 switch
vendor. The AC power inputs were along the bottom of the power supply, and
the big aluminium heatsinks in the power supplies were just above the AC
socket.
Anyway, the subcontractor who made the power supplies for the vendor
realized that they could save a few cents by not installing the little
metal clip that held the heatsink to the MOSFET, and instead relying on the
thermal adhesive to hold it...
This worked fine, until a certain number of hours had passed, at which
point the goop would dry out and the heatsink would fall down, directly
across the AC socket.... This would A: trip the circuit that this was on,
but, more excitingly, set the aluminum on fire, which would then ignite the
other heatsinks in the PSU, leading to much fire...

2: A somewhat similar thing would happen with the Ascend TNT Max, which had
side-to-side airflow. These were dial termination boxes, and so people
would install racks and racks of them. The first one would draw in cool air
on the left, heat it up and ship it out the right. The next one over would
draw in warm air on the left, heat it up further, and ship it out the
right... Somewhere there is a fairly famous photo of a rack of TNT Maxes,
with the final one literally on fire, and still passing packets.
There is a related (and probably apocryphal) regarding the launch of the
TNT. It was being shipped for a major trade-show, but got stuck in customs.
After many bizarre calls with the customs folk, someone goes to the customs
office to try and sort it out, and get greeted by custom agents with guns.
They all walk into the warehouse, and discover that there is a large empty
area around the crate, which is a wooden cube, with "TNT" stencilled in big
red letters...

3: I used to work for a small ISP in Yonkers, NY. We had a customer in
Florida, and on a Friday morning their site goes down. We (of course) have
not paid for Cisco 4 hour support (or, honestly, any support) and they have
a strict SLA, so we are a little stuck.
We end up driving to JFK, and lugging a fully loaded Cisco 7507 to the
check in counter. It was just before the last flight of the day, so we
shrugged and said it was my checked bag. The excess baggage charges were
eye-watering, but it rode the conveyor belt with the rest of the luggage
onto the plane. It arrived with just a bent ejector handle, and the rest
was fine.

4: Not too long after I started doing networking (and for the same small
ISP in Yonkers), I'm flying off to install a new customer. I (of course)
think that I'm hot stuff because I'm going to do the install, configure the
router, whee, look at me! Anyway, I don't want to check a bag, and so I
stuff the Cisco 2501 in a carryon bag, along with tools, etc (this was all
pre-9/11!). I'm going through security and the TSA[0] person opens my bag
and pulls the router out. "What's this?!" he asks. I politely tell him that
it's a router. He says it's not. I'm still thinking that I'm the new
hotness, and so I tell him in a somewhat condescending way that it is, and
I know what I'm talking about. He tells me that it's not a router, and is
starting to get annoyed. I explain using my "talking to a 5 year old" voice
that it most certainly is a router. He tells me that lying to airport
security is a federal offense, and starts looming at me. I adjust my
attitude and start explaining that it's like a computer and makes the
Internet work. He gruffly hands me back the router, I put it in my bag and
scurry away. As I do so, I hear him telling his colleague that it wasn't a
router, and that he certainly knows what a router is, because he does
woodwork...

5: Another one. In the early 2000s I was working for a dot-com boom
company. We are building out our first datacenter, and I'm installing a
pair of Cisco 7206s in 811 10th Ave. These will run basically the entire
company, we have some transit, we have some peering to configure, we have
an AS, etc. I'm going to be configuring all of this; clearly I'm a
router-god...
Anyway, while I'm getting things configured, this janitor comes past,
wheeling a garbage bin. He stops outside the cage and says "Whatcha
doin'?". I go into this long explanation of how these "routers" <point>
will connect to "the Internet" <wave hands in a big circle> to allow my
"servers" <gesture at big black boxes with blinking lights> to talk to
other "computers" <typing motion> on "the Internet" <again with the waving
of the hands>. He pauses for a second, and says "'K. So, you doing a full
iBGP mesh, or confeds?". I really hadn't intended to be a condescending
ass, but I think of that every time I realize I might be assuming something
about someone based on thier attire/job/etc.





W
[0]: Well, technically pre-TSA, but I cannot remember what we used to call
airport security pre-TSA...



>
> I assume also the latter above, which is people have varying
> definitons of clean.
>
> - Jared
>
> --
> Jared Mauch | pgp key available via finger from jared@puck.nether.net
> clue++; | http://puck.nether.net/~jared/ My statements are only
> mine.
>


--
The computing scientist’s main challenge is not to get confused by the
complexities of his own making.
-- E. W. Dijkstra
Re: Famous operational issues [ In reply to ]
On Thu, 2021-02-18 at 17:37 -0500, Warren Kumari wrote:
> Anyway, the subcontractor who made the power supplies for the vendor
> realized that they could save a few cents by not installing the
> little metal clip that held the heatsink to the MOSFET

I think it was Macchiavelli who said that one should not ascribe to
malice anything adequately explained by incompetence...

> 3: I used to work for a small ISP in Yonkers, NY.

There is actually a place called "Yonkers"?!? I always thought it was a
joke placename. We don't really need joke placenames in Oz, since we
have real ones like Woolloomooloo, Burpengary and Humpty Doo. My
favourite is Numbugga (closely followed by Wonglepong).

> I cannot remember what we used to call airport security pre-TSA...

"Useful"?

Regards, K.

--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Karl Auer (kauer@biplane.com.au)
http://www.biplane.com.au/kauer

GPG fingerprint: 2561 E9EC D868 E73C 8AF1 49CF EE50 4B1D CCA1 5170
Old fingerprint: 8D08 9CAA 649A AFEF E862 062A 2E97 42D4 A2A0 616D
Re: Famous operational issues [ In reply to ]
warren> 2: A somewhat similar thing would happen with the Ascend TNT
warren> Max, which had side-to-side airflow. These were dial termination
warren> boxes, and so people would install racks and racks of them. The
warren> first one would draw in cool air on the left, heat it up and
warren> ship it out the right. The next one over would draw in warm air
warren> on the left, heat it up further, and ship it out the
warren> right... Somewhere there is a fairly famous photo of a rack of
warren> TNT Maxes, with the final one literally on fire, and still
warren> passing packets.

The Ascend MAX (TNT was the T3 version, max took 2 T1s) was originally
an ISDN device. We got the first v.34 rockwell modem version for
testing. An individual card had 4 daughter boards. They were burned in
for 24 hours at Ascend, then shipped to us. We were doing stress testing
in Fairfax VA. Turns out that the boards started to overheat at about 30
hours and caught fire a few hours after that... Completely melted the
daughterboards. They did fix that issue and upped the burnin test period
to 48 hours.

And yeah, they vented side to side. They were designed for enclosed
racks where are flow was forced up. We were colocating at telco POPs so
we had to use center mount open relay racks. The air flow was as you
describe. Good time. Had by all...

Both we (UUNET, for MSN and Earthlink) and AOL were using these for
dialup access. 80k ports before we switched to the TNTs, 3+ million
ports on TNTs by the time I stopped paying attention.
Re: Famous operational issues [ In reply to ]
On 2021-02-17 13:28, John Kristoff wrote:
> On Wed, 17 Feb 2021 14:07:54 -0500
> John Curran <jcurrran@istaff.org> wrote:
>
>> I have no idea what outages were most memorable for others, but the
>> Stanford transfer switch explosion in October 1996 resulted in a much
>> of the Internet in the Bay Area simply not being reachable for
>> several days.
>
> Thanks John.
>
> This reminds me of two I've not seen anyone mention yet. Both
> coincidentally in the Chicago area that I learned before my entry
> into netops full time. One was a flood:
>
> <https://en.wikipedia.org/wiki/Chicago_flood>
>
> The other, at the dawn of an earlier era:
>
>
> <http://telecom-digest.org/telecom-archives/TELECOM_Digest_Online/1309.html>
>
> I wouldn't necessarily put those two in the top 3, but by some standard
> for many they were certainly very significant and noteworthy.
>
> John

Thanks for sharing these links John. I was personally affected by the
Hinsdale CO fire when I was a kid. At the time, my family lived on the
southern border of Hinsdale in the adjacent town of Burr Ridge. It was
weird like a power outage: you're reminded of the loss of service every
time you perform the simple act of requesting service, picking up the
phone or toggling a light switch. But it lasted a lot longer than any
loss of power: It was six or seven weeks that, to this day, felt a lot
longer.

Anytime we needed to talk to someone long-distance, we had to drive to a
cousin's house to make the call. To talk to anyone local, you'd have to
physically go and show up unannounced. At 11 years old, I was the
bicycle messenger between our house and my great-grandmother, who lived
about two blocks away. My mother and father kept the cars gassed up and
extra fuel on hand in case there was an emergency.

Dad ran a home improvement business out of the house, so new business
ground to a halt. Mom worked for a publishing company, so their release
dates were impacted. The local grocery store's scanners wouldn't work,
so they had to punch the orders into the register by hand, using the
paper sticker prices on the items.

I clearly remember from the local papers that they had to special-order
the replacement 5ESS at enormous cost. I saw the big brick building
after the fire with the burn marks around the front door. In late May
and early June, the Greyhound buses with the workers were parked around
the block, power plants outside with huge cables snaking in right
through the wide open front door.

When we heard that dial tone at last, everyone was happier than an
iPhone with full bars. Lol

We're spoiled for choice in telecom networks these days. Also,
facilities management have learned plenty of lessons since then. Like,
install and maintain an FM-200 fire suppression system. But
nevertheless, sometimes when I step into a colo, I think of that outage
and the impact it had.

-Brian
Re: Famous operational issues [ In reply to ]
On Feb 18, 2021, at 6:10 PM, Karl Auer <kauer@biplane.com.au> wrote:
>
> I think it was Macchiavelli who said that one should not ascribe to
> malice anything adequately explained by incompetence…

https://en.wikipedia.org/wiki/Hanlon%27s_razor
Never attribute to malice that which is adequately explained by stupidity.

I personally prefer this version from Robert A. Heinlein:
Never underestimate the power of human stupidity.

And to put it on topic, cover your EPOs

In 1994, there was a major earthquake near the city of Los Angeles. City hall had to be evacuated and it would take over a year to reinforce the building to make it habitable again. My company moved all the systems in the basement of city hall to a new datacenter a mile or so away. After the install, we spent more than a week coaxing their ancient (even for 1994) machines back online, such as a Prime Computer and an AS400 with tons of DASD. Well, tons of cabinets, certainly less storage than my watch has now.

I was in the DC going over something with the lady in charge when someone walked in to ask her something. She said “just a second”. That person took one step to the side of the door and leaned against the wall - right on an EPO which had no cover.

Have you ever heard an entire row of DASD spin down instantly? Or taken 40 minutes to IPL an AS400? In the middle of the business day? For the second most populous city in the country?

Me: Maybe you should get a cover for that?
Her: Good idea.

Couple weeks later, in the same DC, going over final checklist. A fedex guy walks in. (To this day, no idea how he got in a supposedly locked DC.) She says “just a second”, and I get a very strong deja vu feeling. He takes one step to the side and leans against the wall.

Me: Did you order that EPO cover?
Her: Nope.

--
TTFN,
patrick
Re: Famous operational issues [ In reply to ]
when employer had shipped 2xJ to london, had the circuits up, ...
the local office sat on their hands. for weeks. i finally was
pissed enough to throw my toolbag over my shoulder, get on a
plane, and fly over. i walked into the fancy office and said
"hi, i am randy, vp eng, here to help you turn up the routers."
they managed to turn them up pretty quickly.
Re: Famous operational issues [ In reply to ]
> On Feb 18, 2021, at 4:37 PM, Warren Kumari <warren@kumari.net> wrote:
>
> 4: Not too long after I started doing networking (and for the same small ISP in Yonkers), I'm flying off to install a new customer. I (of course) think that I'm hot stuff because I'm going to do the install, configure the router, whee, look at me! Anyway, I don't want to check a bag, and so I stuff the Cisco 2501 in a carryon bag, along with tools, etc (this was all pre-9/11!). I'm going through security and the TSA[0] person opens my bag and pulls the router out. "What's this?!" he asks. I politely tell him that it's a router. He says it's not. I'm still thinking that I'm the new hotness, and so I tell him in a somewhat condescending way that it is, and I know what I'm talking about. He tells me that it's not a router, and is starting to get annoyed. I explain using my "talking to a 5 year old" voice that it most certainly is a router. He tells me that lying to airport security is a federal offense, and starts looming at me. I adjust my attitude and start explaining that it's like a computer and makes the Internet work. He gruffly hands me back the router, I put it in my bag and scurry away. As I do so, I hear him telling his colleague that it wasn't a router, and that he certainly knows what a router is, because he does woodwork…

Well, in his defense, he wasn’t wrong… :-)



----
Andy Ringsmuth
5609 Harding Drive
Lincoln, NE 68521-5831
(402) 304-0083
andy@andyring.com

“Better even die free, than to live slaves.” - Frederick Douglas, 1863
Re: Famous operational issues [ In reply to ]
On 2/19/21 00:37, Warren Kumari wrote:

>
> 5: Another one. In the early 2000s I was working for a dot-com boom
> company. We are building out our first datacenter, and I'm installing
> a pair of Cisco 7206s in 811 10th Ave. These will run basically the
> entire company, we have some transit, we have some peering to
> configure, we have an AS, etc. I'm going to be configuring all of
> this; clearly I'm a router-god...
> Anyway, while I'm getting things configured, this janitor comes past,
> wheeling a garbage bin. He stops outside the cage and says "Whatcha
> doin'?". I go into this long explanation of how these "routers"
> <point> will connect to "the Internet" <wave hands in a big circle> to
> allow my "servers" <gesture at big black boxes with blinking lights>
> to talk to other "computers" <typing motion> on "the Internet" <again
> with the waving of the hands>. He pauses for a second, and says "'K.
> So, you doing a full iBGP mesh, or confeds?". I really hadn't intended
> to be a condescending ass, but I think of that every time I realize I
> might be assuming something about someone based on thier attire/job/etc.

:-), cute.

Mark.
Re: Famous operational issues [ In reply to ]
On Fri, Feb 19, 2021 at 9:40 AM Warren Kumari <warren@kumari.net> wrote:
> 4: Not too long after I started doing networking (and for the same small ISP in Yonkers), I'm flying off to install a new customer. I (of course) think that I'm hot stuff because I'm going to do the install, configure the router, whee, look at me! Anyway, I don't want to check a bag, and so I stuff the Cisco 2501 in a carryon bag, along with tools, etc (this was all pre-9/11!). I'm going through security and the TSA[0] person opens my bag and pulls the router out. "What's this?!" he asks. I politely tell him that it's a router. He says it's not. I'm still thinking that I'm the new hotness, and so I tell him in a somewhat condescending way that it is, and I know what I'm talking about. He tells me that it's not a router, and is starting to get annoyed. I explain using my "talking to a 5 year old" voice that it most certainly is a router. He tells me that lying to airport security is a federal offense, and starts looming at me. I adjust my attitude and start explaining that it's like a computer and makes the Internet work. He gruffly hands me back the router, I put it in my bag and scurry away. As I do so, I hear him telling his colleague that it wasn't a router, and that he certainly knows what a router is, because he does woodwork...

OK, Warren, achievement unlocked. You've just made a network engineer
to google 'router'....

P.S. I guess I'm obliged to tell a story if I respond to this thread...so...
"Servers and the ice cream factory".
Late spring/early summer in Moscow. The temperature above 30C (86°F).
I worked for a local content provided.
Aircons in our server room died, the technician ETA was 2 days ( I
guess we were not the only ones with aircon problems).
So we drove to the nearby ice cream factory and got *a lot* of dry
ice. Then we have a roaster: every few hours one person took a deep
breath, grabbed a box of dry ice, ran into the server room and emptied
the box on top of the racks. The backup person was watching through
the glass door - just in case, you know, ready to start the rescue
operation.
We (and the servers) survived till the technician arrived. And we had
a lot of dry ice to cool the beer..

--
SY, Jen Linkova aka Furry
Re: Famous operational issues [ In reply to ]
One day I got called into the office supplies area because there was a
smell of something burning. Uh-oh.

To make a long story short there was a stainless steel bowl which was
focusing the sun from a window such that it was igniting a cardboard
box.

Talk about SMH and random bad luck which could have been a lot worse,
nothing really happened other than some smoke and char.

On February 18, 2021 at 01:07 eric.kuhnke@gmail.com (Eric Kuhnke) wrote:
> On that note, I'd be very interested in hearing stories of actual incidents
> that are the cause of why cardboard boxes are banned in many facilities, due to
> loose particulate matter getting into the air and setting off very sensitive
> fire detection systems.
>
> Or maybe it's more mundane and 99% of the reason is people unpack stuff and
> don't always clean up properly after themselves.
>
> On Wed, Feb 17, 2021, 6:21 PM Owen DeLong <owen@delong.com> wrote:
>
> Stolen isn’t nearly as exciting as what happens when your (used) 6509
> arrives and
> gets installed and operational before anyone realizes that the conductive
> packing
> peanuts that it was packed in have managed to work their way into various
> midplane
> connectors. Several hours later someone notices that the box is quite
> literally
> smoldering in the colo and the resulting combination of panic, fire drill,
> and
> management antics that ensue.
>
> Owen
>
>
> > On Feb 16, 2021, at 2:08 PM, Jared Mauch <jared@puck.nether.net> wrote:
> >
> > I was thinking about how we need a war stories nanog track. My favorite
> was being on call when the router was stolen.
> >
> > Sent from my TI-99/4a
> >
> >> On Feb 16, 2021, at 2:40 PM, John Kristoff <jtk@dataplane.org> wrote:
> >>
> >> Friends,
> >>
> >> I'd like to start a thread about the most famous and widespread Internet
> >> operational issues, outages or implementation incompatibilities you
> >> have seen.
> >>
> >> Which examples would make up your top three?
> >>
> >> To get things started, I'd suggest the AS 7007 event is perhaps  the
> >> most notorious and likely to top many lists including mine.  So if
> >> that is one for you I'm asking for just two more.
> >>
> >> I'm particularly interested in this as the first step in developing a
> >> future NANOG session.  I'd be particularly interested in any issues
> >> that also identify key individuals that might still be around and
> >> interested in participating in a retrospective.  I already have someone
> >> that is willing to talk about AS 7007, which shouldn't be hard to guess
> >> who.
> >>
> >> Thanks in advance for your suggestions,
> >>
> >> John
>
>

--
-Barry Shein

Software Tool & Die | bzs@TheWorld.com | http://www.TheWorld.com
Purveyors to the Trade | Voice: +1 617-STD-WRLD | 800-THE-WRLD
The World: Since 1989 | A Public Information Utility | *oo*
Re: Famous operational issues [ In reply to ]
Northridge quake. I was #2 and on call at CRL. That One Guy on dialup in Atlanta playing MUDs 23x7 pages that things are down. I wander out to my computer to dial in and see what’s up, turned on TV walking past it, sat down and turned computer on, as it was booting on comes a live helicopter shot over Northridge showing the 1.5 remaining floors of the 3-story Cable and Wireless building our east coast connector went through.

Took a second to listen and make sure I understood what was happening, changed channels to verify it wasn’t a stunt, logged on and pinged our router there to confirm nothing there, call & wake up Jim: “East coast’s down because earthquake in Northridge and the C&W center fell down.”

“....oh.”

And then there was the Sidekick outage...


-George

Sent from my iPhone

> On Feb 18, 2021, at 4:37 PM, Patrick W. Gilmore <patrick@ianai.net> wrote:
>
> ?On Feb 18, 2021, at 6:10 PM, Karl Auer <kauer@biplane.com.au> wrote:
>>
>> I think it was Macchiavelli who said that one should not ascribe to
>> malice anything adequately explained by incompetence…
>
> https://en.wikipedia.org/wiki/Hanlon%27s_razor
> Never attribute to malice that which is adequately explained by stupidity.
>
> I personally prefer this version from Robert A. Heinlein:
> Never underestimate the power of human stupidity.
>
> And to put it on topic, cover your EPOs
>
> In 1994, there was a major earthquake near the city of Los Angeles. City hall had to be evacuated and it would take over a year to reinforce the building to make it habitable again. My company moved all the systems in the basement of city hall to a new datacenter a mile or so away. After the install, we spent more than a week coaxing their ancient (even for 1994) machines back online, such as a Prime Computer and an AS400 with tons of DASD. Well, tons of cabinets, certainly less storage than my watch has now.
>
> I was in the DC going over something with the lady in charge when someone walked in to ask her something. She said “just a second”. That person took one step to the side of the door and leaned against the wall - right on an EPO which had no cover.
>
> Have you ever heard an entire row of DASD spin down instantly? Or taken 40 minutes to IPL an AS400? In the middle of the business day? For the second most populous city in the country?
>
> Me: Maybe you should get a cover for that?
> Her: Good idea.
>
> Couple weeks later, in the same DC, going over final checklist. A fedex guy walks in. (To this day, no idea how he got in a supposedly locked DC.) She says “just a second”, and I get a very strong deja vu feeling. He takes one step to the side and leans against the wall.
>
> Me: Did you order that EPO cover?
> Her: Nope.
>
> --
> TTFN,
> patrick
>
Re: Famous operational issues [ In reply to ]
Did you at least hire the janitor?

From: NANOG <nanog-bounces+ops.lists=gmail.com@nanog.org> on behalf of Mark Tinka <mark@tinka.africa>
Date: Friday, 19 February 2021 at 10:20 AM
To: nanog@nanog.org <nanog@nanog.org>
Subject: Re: Famous operational issues

On 2/19/21 00:37, Warren Kumari wrote:

5: Another one. In the early 2000s I was working for a dot-com boom company. We are building out our first datacenter, and I'm installing a pair of Cisco 7206s in 811 10th Ave. These will run basically the entire company, we have some transit, we have some peering to configure, we have an AS, etc. I'm going to be configuring all of this; clearly I'm a router-god...
Anyway, while I'm getting things configured, this janitor comes past, wheeling a garbage bin. He stops outside the cage and says "Whatcha doin'?". I go into this long explanation of how these "routers" <point> will connect to "the Internet" <wave hands in a big circle> to allow my "servers" <gesture at big black boxes with blinking lights> to talk to other "computers" <typing motion> on "the Internet" <again with the waving of the hands>. He pauses for a second, and says "'K. So, you doing a full iBGP mesh, or confeds?". I really hadn't intended to be a condescending ass, but I think of that every time I realize I might be assuming something about someone based on thier attire/job/etc.

:-), cute.

Mark.
Re: Famous operational issues [ In reply to ]
On Feb 18, 2021, at 11:51 PM, Suresh Ramasubramanian <ops.lists@gmail.com> wrote:

>> On 2/19/21 00:37, Warren Kumari wrote:

>> and says "'K. So, you doing a full iBGP mesh, or confeds?". I really hadn't
>> intended to be a condescending ass, but I think of that every time I realize I
>> might be assuming something about someone based on thier attire/job/etc.

> Did you at least hire the janitor?

Well, it's funny that you mention that because I worked at a place where the
company ended up hiring a young lady who worked in the cafeteria. When she
graduated she was offered a job in HR, and turned out to be absolutely awesome.

At some point in my life, I was carrying 50lbs bags of potato starch. Now I have
two graduate degrees and am working on a third. That janitor may be awesome, too!

Thanks,

Sabri
Re: Famous operational issues [ In reply to ]
He is. He asked a perfectly relevant question based on what he saw of the physical setup in front of him.

And he kept his cool when being talked down to.

I?d hire him the next minute, personally speaking.

From: Sabri Berisha <sabri@cluecentral.net>
Date: Friday, 19 February 2021 at 2:02 PM
To: Suresh Ramasubramanian <ops.lists@gmail.com>
Cc: nanog <nanog@nanog.org>
Subject: Re: Famous operational issues
On Feb 18, 2021, at 11:51 PM, Suresh Ramasubramanian <ops.lists@gmail.com> wrote:

>> On 2/19/21 00:37, Warren Kumari wrote:

>> and says "'K. So, you doing a full iBGP mesh, or confeds?". I really hadn't
>> intended to be a condescending ass, but I think of that every time I realize I
>> might be assuming something about someone based on thier attire/job/etc.

> Did you at least hire the janitor?

Well, it's funny that you mention that because I worked at a place where the
company ended up hiring a young lady who worked in the cafeteria. When she
graduated she was offered a job in HR, and turned out to be absolutely awesome.

At some point in my life, I was carrying 50lbs bags of potato starch. Now I have
two graduate degrees and am working on a third. That janitor may be awesome, too!

Thanks,

Sabri
Re: Famous operational issues [ In reply to ]
On 2/19/21 10:40, Suresh Ramasubramanian wrote:

> He is. He asked a perfectly relevant question based on what he saw of
> the physical setup in front of him.
>
> And he kept his cool when being talked down to.
>
> I’d hire him the next minute, personally speaking.
>

In the early 2000's, with that level of deduction, I'd have been
surprised if he wasn't snatched up quickly. Unless, of course, it
ultimately wasn't his passion.

Mark.
Re: Famous operational issues [ In reply to ]
Do you remember the Cisco HDCI connectors?
https://en.wikipedia.org/wiki/HDCI

I once shipped a Cisco 4500 plus some cables to a remote data center and asked the local guys to cable them for me.
With Cisco you could check the cable type and if they were properly attached. They were not.

I asked for a check and the local guy confirmed me three times that the cables were properly plugged.
At the end I gave up, and took the 3 hour drive to the datacenter to check myself.

Problem was that, while the casing of the connector is asymmetrical, the pins inside are symmetrical.
And the local guy was quite strong.

Yes, he managed to plug in the cables 180° flipped, bending the case, but he got them in.
He was quite embarrassed when I fixed the cabling problem in 10 seconds.

That must have been 1995 or so....

Wolfgang



> On 16. Feb 2021, at 20:37, John Kristoff <jtk@dataplane.org> wrote:
>
> Which examples would make up your top three?

--
Wolfgang Tremmel

Phone +49 69 1730902 0 | wolfgang.tremmel@de-cix.net
Executive Directors: Harald A. Summa and Sebastian Seifert | Trade Registry: AG Cologne, HRB 51135
DE-CIX Management GmbH | Lindleystrasse 12 | 60314 Frankfurt am Main | Germany | www.de-cix.net
Re: Famous operational issues [ In reply to ]
On 16 Feb 2021, at 20:37, John Kristoff wrote:

> I'd like to start a thread about the most famous and widespread
> Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?


My absolute top one happened 1995. Traffic engineering was not a widely
used term then. A bright colleague who will remain un-named decided that
he could make AS paths longer by repeating the same AS number more than
once. Unfortunately the prevalent software on CISCO routers was not
resilient to such trickery and reacted with a reboot. This caused an
avalanche of jo-jo-ing routers. Think it through!

It took some time before that offending path could be purged from the
whole Internet; yes we all roughly knew the topology and the players of
the BGP speaking parts of it at that time. Luckily this happened
during the set-up for the Danvers IETF and co-ordination between major
operators was quick because most of their routing geeks happened to be
in the same room, the ‘terminal room’; remember those?

Since at the time I personally had no responsibility for operations any
more I went back to pulling cables and crimping RJ45s.

Lessons: HW/SW mono-cultures are dangerous. Input testing is good
practice at all levels software. Operational co-ordination is key in
times of crisis.

Daniel
Re: Famous operational issues [ In reply to ]
In the case of Exodus when I was working there, it was literally dictated to us by
the fire marshal of the city of Santa Clara (and enough other cities where we had
datacenters to make a universal policy the only sensible choice).

Owen

> On Feb 18, 2021, at 1:07 AM, Eric Kuhnke <eric.kuhnke@gmail.com> wrote:
>
> On that note, I'd be very interested in hearing stories of actual incidents that are the cause of why cardboard boxes are banned in many facilities, due to loose particulate matter getting into the air and setting off very sensitive fire detection systems.
>
> Or maybe it's more mundane and 99% of the reason is people unpack stuff and don't always clean up properly after themselves.
>
> On Wed, Feb 17, 2021, 6:21 PM Owen DeLong <owen@delong.com <mailto:owen@delong.com>> wrote:
> Stolen isn’t nearly as exciting as what happens when your (used) 6509 arrives and
> gets installed and operational before anyone realizes that the conductive packing
> peanuts that it was packed in have managed to work their way into various midplane
> connectors. Several hours later someone notices that the box is quite literally
> smoldering in the colo and the resulting combination of panic, fire drill, and
> management antics that ensue.
>
> Owen
>
>
> > On Feb 16, 2021, at 2:08 PM, Jared Mauch <jared@puck.nether.net <mailto:jared@puck.nether.net>> wrote:
> >
> > I was thinking about how we need a war stories nanog track. My favorite was being on call when the router was stolen.
> >
> > Sent from my TI-99/4a
> >
> >> On Feb 16, 2021, at 2:40 PM, John Kristoff <jtk@dataplane.org <mailto:jtk@dataplane.org>> wrote:
> >>
> >> ?Friends,
> >>
> >> I'd like to start a thread about the most famous and widespread Internet
> >> operational issues, outages or implementation incompatibilities you
> >> have seen.
> >>
> >> Which examples would make up your top three?
> >>
> >> To get things started, I'd suggest the AS 7007 event is perhaps the
> >> most notorious and likely to top many lists including mine. So if
> >> that is one for you I'm asking for just two more.
> >>
> >> I'm particularly interested in this as the first step in developing a
> >> future NANOG session. I'd be particularly interested in any issues
> >> that also identify key individuals that might still be around and
> >> interested in participating in a retrospective. I already have someone
> >> that is willing to talk about AS 7007, which shouldn't be hard to guess
> >> who.
> >>
> >> Thanks in advance for your suggestions,
> >>
> >> John
>
Re: Famous operational issues [ In reply to ]
All these stories remind me of two of my own from back in the late 90s.
I worked for a regional ISP doing some network stuff (under the real
engineer), and some software development.

Like a lot of ISPs in the 90s, this one started out in a rental house.
Over the months and years rooms were slowly converted to host more and more
equipment as we expanded our customer base and presence in the region.
If we needed a "rack", someone would go to the store and buy a 4-post metal
shelf [1] or...in some cases the dump to see what they had.

We had one that looked like an oversized filing cabinet with some sort of
rails on the sides. I don't recall how the equipment was mounted, but I
think it was by drilling holes into the front lip and tapping the screws
in. This was the big super-important rack. It had the main router that
connected lines between 5 POPs around the region, and also several
connections to Portland Oregon about 60 miles away. Since we were
making tons of money, we decided we should update our image and install
real racks in the "bedroom server room". It was decided we were going to
do it with no downtime.

I was on the 2-man team that stood behind and in front of the rack with
2x4s dead-lifting them as equipment was unscrewed and lowered onto the
boards. I was on the back side of the rack. After all the equipment was
unscrewed, someone came in with a sawzall and cut the filing cabinet thing
apart. The top half was removed and taken away, then we lifted up on the
boards and the bottom half was slid out of the way. The new rack was
brought in, bolted to the floor, and then one by one equipment was taken
off the pile we were holding up with 2x4s, brought through the back of the
new rack, and then mounted.

I was pleasantly surprised and very relieved when we finished moving the
big router, several switches, a few servers, and a UPS unit over to the new
rack with zero downtime. The entire team cheered and cracked beers. I
stepped out from behind the rack...
...and snagged the power cable to the main router with my foot. I don't
recall the Cisco model number after all this time...but I do remember the
excruciating 6-8 minutes it took for the damn thing to reboot, and the
sight of the 7 PRI cards in our phone system almost immediately jumping
from 5 channels in-use to being 100% full.

It's been 20 years, but I swear my arms are still sore from holding all
that equipment up for ~20 minutes, and I always pick my feet up very slowly
when I'm near a rack. ;)

The second story is a short one from the same time period. Our POPs
consisted of the afore-mentioned 4-post metal shelves stacked with piles of
US Robotics 56k modems [2] stacked on top of each other. They were wired
back to some sort of serial box that was in-turn connected to an ISA card
stuck in a Windows NT 4 server that used RADIUS to authenticate sessions
with an NT4 server back at the main office that had user accounts for all
our customers. Every single modem had a wall-wart power brick for power,
an RJ11 phone line, and a big old serial cable. It was an absolute rats
nest of cables. The small POP (which I think was a TuffShed in someone's
yard about 50 feet from the telco building) was always 100 degrees--even in
the dead of winter.

One year we made the decision to switch to 3Com Total Control Chassis with
PRI cards. The cut-over was pretty seamless and immediately made shelves
stacked full of hundreds of modems completely useless. As we started
disconnecting modems with the intent of selling them for a few bucks to
existing customers who wanted to upgrade or giving them to new customers to
get them signed up, we found a bunch of the stacks of modems had actually
melted together due to the temps. That explained the handful of numbers in
the hunt group that would just ring and ring with no answer. In the end we
went from a completely packed 10x20 shed to two small 3Com TCH boxes packed
with PRI cards and a handful of PRI cables with much more normal
temperatures.

I thoroughly enjoyed the "wild west" days of the internet.

If Eric and Dan are reading this, thanks for everything you taught me about
networking, business, hard work, and generally being a good person.

-A

[1] -
https://www.amazon.com/dp/B01D54TICS/ref=redir_mobile_desktop?_encoding=UTF8&aaxitk=Pe4xuew1D1PkrRA9cq8Cdg&hsa_cr_id=5048111780901&pd_rd_plhdr=t&pd_rd_r=4d9e3b6b-3360-41e8-9901-d079ac063f03&pd_rd_w=uRxXq&pd_rd_wg=CDibq&ref_=sbx_be_s_sparkle_td_asin_0_img

[2] - https://www.usr.com/products/56k-dialup-modem/usr5686g/



On Tue, Feb 16, 2021 at 11:39 AM John Kristoff <jtk@dataplane.org> wrote:

> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
>
> To get things started, I'd suggest the AS 7007 event is perhaps the
> most notorious and likely to top many lists including mine. So if
> that is one for you I'm asking for just two more.
>
> I'm particularly interested in this as the first step in developing a
> future NANOG session. I'd be particularly interested in any issues
> that also identify key individuals that might still be around and
> interested in participating in a retrospective. I already have someone
> that is willing to talk about AS 7007, which shouldn't be hard to guess
> who.
>
> Thanks in advance for your suggestions,
>
> John
>
Re: Famous operational issues [ In reply to ]
Jen Linkova ????? 2021-02-19 00:04:

> OK, Warren, achievement unlocked. You've just made a network engineer
> to google 'router'....

He meant that we call "frezer" machine... (in our language ;)

I heard a similar story from my colleague who was working at that time
for Huawei as DWDM engineer and had to fly frequently with testing
devices.
One time he tried to explain at airport security control what DWDM
spectrum analyser is for, the officer called another for help and he
said something like this: "DWDM spectrum analyser? Pass it, usual
thing..."

--
Kind regards,
Andrey Kostin
Re: Famous operational issues [ In reply to ]
On 2/16/2021 2:37 PM, John Kristoff wrote:
> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?


I don't believe I've seen this in any of the replies, but the AT&T
cascading switch crashes of 1990 is a good one. This link even has some
pseudocode
https://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collapse
Re: Famous operational issues [ In reply to ]
At a previous company we had a large number of Foundry Networks layer-3
switches. They participated in our OSPF network and had a *really* annoying
bug. Every now and then one of them would get somewhat confused and would
corrupt its OSPF database (there seemed to be some pointer that would end
up off by one).

It would then cleverly realize that its LSDB was different to everyone
else's and so would flood this corrupt database to all other OSPF speakers.
Some vendors would do a better job of sanity checking the LSAs and would
ignore the bad LSAs, other vendors would install them... and now you have
different link state databases on different devices and OSPF becomes
unhappy.

Nov 24 22:23:53.633 EST: %OSPF-4-BADLSAMASK: Bad LSA mask: Type 5,
LSID 0.9.32.5

Mask 10.160.8.0 from 10.178.255.252
NOTE: This route will not be installed in the routing table.
Nov 26 11:01:32.997 EST: %OSPF-4-BADLSAMASK: Bad LSA mask: Type 5, LSID 0.4.2.3

Mask 10.2.153.0 from 10.178.255.252
NOTE: This route will not be installed in the routing table.
Nov 27 23:14:00.660 EST: %OSPF-4-BADLSAMASK: Bad LSA mask: Type 5, LSID 0.4.2.3

Mask 10.2.153.0 from 10.178.255.252
NOTE: This route will not be installed in the routing table.

If you look at the output, you can see that there is some garbage in the
LSID field and the bit that should be there is now in the Mask section. I
also saw some more extreme version of the same bug, in my favorite example
the mask was 115.104.111.119 and further down there was 105.110.116.114 --
if you take these as decimal number and look up their ASCII values we get
"show" and "inte" -- I wrote a tool to scrape bits from these errors and
ended up with a large amount of the CLI help text.




Many years ago I worked for a small Mom-and-Pop type ISP in New York state
(I was the only network / technical person there) -- it was a very free
wheeling place and I built the network by doing whatever made sense at the
time.

One of my "favorite" customers (Joe somebody) was somehow related to the
owner of the ISP and was a gamer. This was back in the day when the gaming
magazines would give you useful tips like "Type 'tracert $gameserver' and
make sure that there are less than N hops". Joe would call up tech
support, me, the owner, etc and complain that there was N+3 hops and most
of them were in our network. I spent much time explaining things about
packet-loss, latency, etc but couldn't shake his belief that hop count was
the only metric that mattered.

Finally, one night he called me at home well after midnight (no, I didn't
give him my home phone number, he looked me up in the phonebook!) to
complain that his gaming was suffering because it was "too many hops to get
out of your network". I finally snapped and built a static GRE tunnel from
the RAS box that he connected to all over the network -- it was a thing of
beauty, it went through almost every device that we owned and took the most
convoluted path I could come up with. "Yay!", I figured, "now I can
demonstrate that latency is more important than hop count" and I went to
bed.

The next morning I get a call from him. He is ecstatic and wildly impressed
by how well the network is working for him now and how great his gaming
performance is. "Oh well", I think, "at least he is happy and will leave me
alone now". I don't document the purpose of this GRE anywhere and after
some time forget about it.

A few months later I am doing some routine cleanup work and stumble across
a weird looking tunnel -- its bizarre, it goes all over the place and is
all kinds of crufty -- there are static routes and policy routing and
bizarre things being done on the RADIUS server to make sure some user
always gets a certain IP... I look in my pile of notes and old configs and
then decide to just yank it out.

That night I get an enraged call (at home again) from Joe *screaming* that
the network is all broken again because it is now way too many hops to get
out of the network and that people keep shooting him...

*What I learnt from this:*
1: Make sure you document everything (and no, the network isn't
documentation)
2: Gamers are weird.
3: Making changes to your network in anger provides short term pleasure but
long term pain.



On Fri, Feb 19, 2021 at 1:10 PM Andrew Gallo <akg1330@gmail.com> wrote:

>
>
> On 2/16/2021 2:37 PM, John Kristoff wrote:
> > Friends,
> >
> > I'd like to start a thread about the most famous and widespread Internet
> > operational issues, outages or implementation incompatibilities you
> > have seen.
> >
> > Which examples would make up your top three?
>
>
> I don't believe I've seen this in any of the replies, but the AT&T
> cascading switch crashes of 1990 is a good one. This link even has some
> pseudocode
> https://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collapse
>
>

--
The computing scientist’s main challenge is not to get confused by the
complexities of his own making.
-- E. W. Dijkstra
Re: Famous operational issues [ In reply to ]
----- On Feb 19, 2021, at 3:07 AM, Daniel Karrenberg dfk@ripe.net wrote:

Hi,

> Lessons: HW/SW mono-cultures are dangerous. Input testing is good
> practice at all levels software. Operational co-ordination is key in
> times of crisis.

Well... Here is a very similar, fairly recent one. Albeit in this case, the
opposite is true: running one software train would have prevented an outage.
Some members on this list (hi, Brian!) will recognize the story.

Group XX within $company decided to deploy EVPN. All of backbone was running
single $vendor, but different software trains. Turns out that between an
early draft, implemented in version X, and the RFC, implemented in version Y,
a change was made in NLRI formats which were not backwards compatible.

Version X was in use on virtually all DC egress boxes, version Y was in use
on route reflectors. The moment the first EVPN NLRI was advertised, the
entire backbone melted down. Dept-wide alert issued (at night), people trying
to log on to the VPN. Oh wait, the VPN requires yubikey, which requires the
corp network to access the interwebs, which is not accessible due to said
issue.

And, despite me complaining since the day of hire, no out of band network.

I didn't stay much longer after that.

Thanks,

Sabri
Re: Famous operational issues [ In reply to ]
On 16/02/2021 22:08, Jared Mauch wrote:
> I was thinking about how we need a war stories nanog track. My favorite was being on call when the router was stolen.

Enough time has (probably) elapsed since my escapades in a small data
centre in Manchester. The RFO was ten pages long, and I don't want to
spoil the ending, but ... I later discovered that Cumulus' then VP of
Engineering had elevated me to a veritable 'Hall of Infamy' for the
support ticket attached to that particular tale.

One day I'll be able to buy the guy that handled it a *lot* of whisky.
He deserved it.

--
Tom
Re: Famous operational issues [ In reply to ]
On Thu, Feb 18, 2021 at 07:34:39AM -0500, Patrick W. Gilmore wrote:
> In 1994, there was a major earthquake near the city of Los Angeles. City hall had to be evacuated and it would take over a year to reinforce the building to make it habitable again. My company moved all the systems in the basement of city hall to a new datacenter a mile or so away. After the install, we spent more than a week coaxing their ancient (even for 1994) machines back online, such as a Prime Computer and an AS400 with tons of DASD. Well, tons of cabinets, certainly less storage than my watch has now.
>
> I was in the DC going over something with the lady in charge when someone walked in to ask her something. She said “just a second”. That person took one step to the side of the door and leaned against the wall - right on an EPO which had no cover.
>
> Have you ever heard an entire row of DASD spin down instantly? Or taken 40 minutes to IPL an AS400? In the middle of the business day? For the second most populous city in the country?
>
> Me: Maybe you should get a cover for that?
> Her: Good idea.
>
> Couple weeks later, in the same DC, going over final checklist. A fedex guy walks in. (To this day, no idea how he got in a supposedly locked DC.) She says “just a second”, and I get a very strong deja vu feeling. He takes one step to the side and leans against the wall.
>
> Me: Did you order that EPO cover?
> Her: Nope.

some of the ibm 4300 series mini-mainframes came with a console terminal
that had a very large, raised (completely not flush), alternate power
button on the upper panel of the keyboard, facing the operator. in later
versions, the button was inset in a little open box with high sides. in
earlier versions, there was just a pair of raised ribs on either side of the
button. in the earliest version, if that panel needed to be replaced, the
replacement part didn't even have those protective ribs, this huge button
was just sitting there. on our 4341, someone had dropped the keyboard during
installation and the damaged panel was replaced with the
no-protection-whatsoever part.

i had an operator who, working a double shift into the overnight run,
fell asleep and managed to bang his head square on the button.
the overnight jobs running were left in various states of ruin.

third party manufacturers had an easy sell for lucite power/EPO button covers.

--
Henry Yen Aegis Information Systems, Inc.
Senior Systems Programmer Hicksville, New York
Re: Famous operational issues [ In reply to ]
From a datacenter ROI and economics, cooling, HVAC perspective that might
just be the best colo customer ever. As long as they're paying full price
for the cabinet and nothing is *dangerous* about how they've hung the 2U
server vertically, using up all that space for just one thing has to be a
lot better than a customer that makes full and efficient use of space and
all the amperage allotted to them.


On Thu, Feb 18, 2021 at 11:38 AM tim@pelican.org <tim@pelican.org> wrote:

> On Thursday, 18 February, 2021 16:23, "Seth Mattinen" <sethm@rollernet.us>
> said:
>
> > I had a customer that tried to stack their servers - no rails except the
> > bottom most one - using 2x4's between each server. Up until then I
> > hadn't imagined anyone would want to fill their cabinet with wood, so I
> > made a rule to ban wood and anything tangentially related (cardboard,
> > paper, plastic, etc.). Easier to just ban all things. Fire reasons too
> > but mainly I thought a cabinet full of wood was too stupid to allow.
>
> On the "stupid racking" front, I give you most of a rack dedicated to a
> single server. Not all that high a server, maybe 2U or so, but *way* too
> deep for the rack, so it had been installed vertically. By looping some
> fairly hefty chain through the handles on either side of the front of the
> chassis, and then bolting the four chain ends to the four rack posts. I
> wish I'd kept pictures of that one. Not flammable, but a serious WTF
> moment.
>
> Cheers,
> Tim.
>
>
>
Re: Famous operational issues [ In reply to ]
Not a famous operational issue, but in 2000, we had a major outage of
our dialup modem pool.

The owner of the building was re-skinning the outside using Styrofoam
and stucco. A bunch of the Styrofoam
had blocked the roof drains on the podium section of the building,
immediately above our equipment room.

A flash rainstorm filled the entire flat roof, and water came back in
over the flashings, and poured directly in
to our dialup modem pool through the hole in the concrete roof deck
where the drain pipe protruded through.

In retrospect, it was a monumentally stupid place to put our main
modem pool, but we didn't realize what was
above the drop ceiling - and that it was roof, not the other 11
floors of the building.

1 bay of 6 shelves of USR TC 1000 HiperDSPs were now very wet and
blinking funny patterns on their LEDs.

Fortunately, our vendor in Toronto (4 hour drive away) had stock of
equipment that another customer kept
delaying shipment on. They got their staff in, started un-boxing
and, slotting cards. We spent a few hours
tearing out the old gear and getting ready for replacements.

We left Windsor, Ontario at around 12:00am - same time they left
Toronto, heading towards us. We coordinated
a meet at one of the rural exits along Highway 401 at a closed gas
station at around 2am.

Everything was going so well until a cop pulled up, and asked us what
we were doing, as we were slinging
modem chassis between the back of the vendor's SUV and our van... We
calmly explained
what happened. He looked between us a couple of times, shook his
head and said "well, good luck with that",
got back in his car and drove away.

We had everything back online within 14 hours of the initial outage.

At 02:37 PM 16/02/2021, John Kristoff wrote:
>Friends,
>
>I'd like to start a thread about the most famous and widespread Internet
>operational issues, outages or implementation incompatibilities you
>have seen.
>
>Which examples would make up your top three?
>
>To get things started, I'd suggest the AS 7007 event is perhaps the
>most notorious and likely to top many lists including mine. So if
>that is one for you I'm asking for just two more.
>
>I'm particularly interested in this as the first step in developing a
>future NANOG session. I'd be particularly interested in any issues
>that also identify key individuals that might still be around and
>interested in participating in a retrospective. I already have someone
>that is willing to talk about AS 7007, which shouldn't be hard to guess
>who.
>
>Thanks in advance for your suggestions,
>
>John

--

Clayton Zekelman
Managed Network Systems Inc. (MNSi)
3363 Tecumseh Rd. E
Windsor, Ontario
N8W 1H4

tel. 519-985-8410
fax. 519-985-8409
Re: Famous operational issues [ In reply to ]
On Thu, Feb 18, 2021 at 07:34:39PM -0500, Patrick W. Gilmore wrote:
> And to put it on topic, cover your EPOs

I worked somewhere with an uncovered EPO, which was okay until we had a
telco tech in who was used to a different data center where a similar
looking button controlled the door access, so he reflexively hit it
on his way out to unlock the door. Oops.

Also, consider what's on generator and what's not. I worked in a corporate
data center where we lost power. The backup system kept all the machines
running, but the ventilation system was still down, so it was very warm very
fast as everyone went around trying to shut servers down gracefully while
other folks propped the doors open to get some cooler air in.

--r
Re: Famous operational issues [ In reply to ]
Oh,

I actually wanted to keep this for my memoirs, but if we can name danger
datacenter operational issues …. somehow 2000s:

Somebody ran its own datacenter,
- once had an active ant colony living under the raised floor and in the
climate system,
- for a while had several electric grounding defects, leading to the
work instruction of “don’t touch any metallic or conducting
materials”,
- for a minute, had a “look what we have bought on Ebay” - UPS
system, until started to roast after turned on,
- from time to time had climate issues, leading to temperatures around
peaks with 68 centigrade room temperature, and yes, some equipment
survived and even continued to work.

Decided not to go back there, after “look what we have bought on Ebay,
an argon fire distinguisher, we just need to mount it”.

On 20 Feb 2021, at 10:15, Eric Kuhnke wrote:

> From a datacenter ROI and economics, cooling, HVAC perspective that
> might
> just be the best colo customer ever. As long as they're paying full
> price
> for the cabinet and nothing is *dangerous* about how they've hung the
> 2U
> server vertically, using up all that space for just one thing has to
> be a
> lot better than a customer that makes full and efficient use of space
> and
> all the amperage allotted to them.
>
>
Re: Famous operational issues [ In reply to ]
I’m embarrassed to say, I’ve done this.

Ms. Lady Benjamin PD Cannon, ASCE
6x7 Networks & 6x7 Telecom, LLC
CEO
ben@6by7.net
"The only fully end-to-end encrypted global telecommunications company in the world.”

FCC License KJ6FJJ

Sent from my iPhone via RFC1149.

> On Feb 19, 2021, at 12:55 AM, Wolfgang Tremmel <wolfgang.tremmel@de-cix.net> wrote:
>
> ?Do you remember the Cisco HDCI connectors?
> https://en.wikipedia.org/wiki/HDCI
>
> I once shipped a Cisco 4500 plus some cables to a remote data center and asked the local guys to cable them for me.
> With Cisco you could check the cable type and if they were properly attached. They were not.
>
> I asked for a check and the local guy confirmed me three times that the cables were properly plugged.
> At the end I gave up, and took the 3 hour drive to the datacenter to check myself.
>
> Problem was that, while the casing of the connector is asymmetrical, the pins inside are symmetrical.
> And the local guy was quite strong.
>
> Yes, he managed to plug in the cables 180° flipped, bending the case, but he got them in.
> He was quite embarrassed when I fixed the cabling problem in 10 seconds.
>
> That must have been 1995 or so....
>
> Wolfgang
>
>
>
>> On 16. Feb 2021, at 20:37, John Kristoff <jtk@dataplane.org> wrote:
>>
>> Which examples would make up your top three?
>
> --
> Wolfgang Tremmel
>
> Phone +49 69 1730902 0 | wolfgang.tremmel@de-cix.net
> Executive Directors: Harald A. Summa and Sebastian Seifert | Trade Registry: AG Cologne, HRB 51135
> DE-CIX Management GmbH | Lindleystrasse 12 | 60314 Frankfurt am Main | Germany | www.de-cix.net
>
Re: Famous operational issues [ In reply to ]
> On Feb 18, 2021, at 9:04 PM, Jen Linkova <furry13@gmail.com> wrote:
>
> On Fri, Feb 19, 2021 at 9:40 AM Warren Kumari <warren@kumari.net> wrote:
>> 4: Not too long after I started doing networking (and for the same small ISP in Yonkers), I'm flying off to install a new customer. I (of course) think that I'm hot stuff because I'm going to do the install, configure the router, whee, look at me! Anyway, I don't want to check a bag, and so I stuff the Cisco 2501 in a carryon bag, along with tools, etc (this was all pre-9/11!). I'm going through security and the TSA[0] person opens my bag and pulls the router out. "What's this?!" he asks. I politely tell him that it's a router. He says it's not. I'm still thinking that I'm the new hotness, and so I tell him in a somewhat condescending way that it is, and I know what I'm talking about. He tells me that it's not a router, and is starting to get annoyed. I explain using my "talking to a 5 year old" voice that it most certainly is a router. He tells me that lying to airport security is a federal offense, and starts looming at me. I adjust my attitude and start explaining that it's like a computer and makes the Internet work. He gruffly hands me back the router, I put it in my bag and scurry away. As I do so, I hear him telling his colleague that it wasn't a router, and that he certainly knows what a router is, because he does woodwork...
>
> OK, Warren, achievement unlocked. You've just made a network engineer
> to google 'router'....
>
> P.S. I guess I'm obliged to tell a story if I respond to this thread...so...
> "Servers and the ice cream factory".
> Late spring/early summer in Moscow. The temperature above 30C (86°F).
> I worked for a local content provided.
> Aircons in our server room died, the technician ETA was 2 days ( I
> guess we were not the only ones with aircon problems).
> So we drove to the nearby ice cream factory and got *a lot* of dry
> ice. Then we have a roaster: every few hours one person took a deep
> breath, grabbed a box of dry ice, ran into the server room and emptied
> the box on top of the racks. The backup person was watching through
> the glass door - just in case, you know, ready to start the rescue
> operation.
> We (and the servers) survived till the technician arrived. And we had
> a lot of dry ice to cool the beer..
>
> --
> SY, Jen Linkova aka Furry

During a wood-working project for the Southern California Linux Expo (the tech team that
(among other things) runs the network for the show was building new equipment carts), I
came up with the following meme:



[.I don’t know if NANOG will pass the image despite its small size, so textual description:
A bandaged hand with the index finger amputated at the second knuckle with overlaid red
text stating “Carless Routing May Lead to Urgent Test of Self Healing Network”]

Fortunately, we didn’t have any such issues with the router, though we did have one person
suffer a crushed toe from a cabinet tip-over. Fortunately, the person made a full recovery.

Owen
Re: Famous operational issues [ In reply to ]
On Thursday, 18 February, 2021 22:37, "Warren Kumari" <warren@kumari.net> said:

> 4: Not too long after I started doing networking (and for the same small
> ISP in Yonkers), I'm flying off to install a new customer. I (of course)
> think that I'm hot stuff because I'm going to do the install, configure the
> router, whee, look at me! Anyway, I don't want to check a bag, and so I
> stuff the Cisco 2501 in a carryon bag, along with tools, etc (this was all
> pre-9/11!). I'm going through security and the TSA[0] person opens my bag
> and pulls the router out. "What's this?!" he asks. I politely tell him that
> it's a router. He says it's not. I'm still thinking that I'm the new
> hotness, and so I tell him in a somewhat condescending way that it is, and
> I know what I'm talking about. He tells me that it's not a router, and is
> starting to get annoyed. I explain using my "talking to a 5 year old" voice
> that it most certainly is a router. He tells me that lying to airport
> security is a federal offense, and starts looming at me. I adjust my
> attitude and start explaining that it's like a computer and makes the
> Internet work. He gruffly hands me back the router, I put it in my bag and
> scurry away. As I do so, I hear him telling his colleague that it wasn't a
> router, and that he certainly knows what a router is, because he does
> woodwork...

Here in the UK we avoid that issue by pronouncing the packet-shifter as "rooter", and only the wood-working tool as "rowter" :)

Of course, it raises a different set of problems when talking to the Australians...

Cheers,
Tim.
Re: Famous operational issues [ In reply to ]
    Well...

    During my younger days, that button was used a few time by the
operator of a VM/370 to regain control from someone with a "curious
mind" *cought* *cought*...

-----
Alain Hebert ahebert@pubnix.net
PubNIX Inc.
50 boul. St-Charles
P.O. Box 26770 Beaconsfield, Quebec H9W 6G7
Tel: 514-990-5911 http://www.pubnix.net Fax: 514-990-9443

On 2/20/21 4:07 AM, Henry Yen wrote:
> On Thu, Feb 18, 2021 at 07:34:39AM -0500, Patrick W. Gilmore wrote:
>> In 1994, there was a major earthquake near the city of Los Angeles. City hall had to be evacuated and it would take over a year to reinforce the building to make it habitable again. My company moved all the systems in the basement of city hall to a new datacenter a mile or so away. After the install, we spent more than a week coaxing their ancient (even for 1994) machines back online, such as a Prime Computer and an AS400 with tons of DASD. Well, tons of cabinets, certainly less storage than my watch has now.
>>
>> I was in the DC going over something with the lady in charge when someone walked in to ask her something. She said “just a second”. That person took one step to the side of the door and leaned against the wall - right on an EPO which had no cover.
>>
>> Have you ever heard an entire row of DASD spin down instantly? Or taken 40 minutes to IPL an AS400? In the middle of the business day? For the second most populous city in the country?
>>
>> Me: Maybe you should get a cover for that?
>> Her: Good idea.
>>
>> Couple weeks later, in the same DC, going over final checklist. A fedex guy walks in. (To this day, no idea how he got in a supposedly locked DC.) She says “just a second”, and I get a very strong deja vu feeling. He takes one step to the side and leans against the wall.
>>
>> Me: Did you order that EPO cover?
>> Her: Nope.
> some of the ibm 4300 series mini-mainframes came with a console terminal
> that had a very large, raised (completely not flush), alternate power
> button on the upper panel of the keyboard, facing the operator. in later
> versions, the button was inset in a little open box with high sides. in
> earlier versions, there was just a pair of raised ribs on either side of the
> button. in the earliest version, if that panel needed to be replaced, the
> replacement part didn't even have those protective ribs, this huge button
> was just sitting there. on our 4341, someone had dropped the keyboard during
> installation and the damaged panel was replaced with the
> no-protection-whatsoever part.
>
> i had an operator who, working a double shift into the overnight run,
> fell asleep and managed to bang his head square on the button.
> the overnight jobs running were left in various states of ruin.
>
> third party manufacturers had an easy sell for lucite power/EPO button covers.
>
> --
> Henry Yen Aegis Information Systems, Inc.
> Senior Systems Programmer Hicksville, New York
Re: Famous operational issues [ In reply to ]
On 2/22/21 9:14 AM, Alain Hebert wrote:
> *[External Email]*
>
>     Well...
>
>     During my younger days, that button was used a few time by the
> operator of a VM/370 to regain control from someone with a "curious
> mind" *cought* *cought*...
>
Two horror stories I remember from long ago when I was a console jockey
for a federal space agency that will remain nameless :P

1. A coworker brought her daughter to work with her on a Saturday
overtime shift because she couldn't get a babysitter. She parked the kid
with a coloring book and a pile of crayons at the only table in the
console room with some space, right next to the master console for our
3081. I asked her to make sure sh was well away from the console, and as
she reached over to scoot the girl and her coloring books further away
she slipped, and reached out to steady herself. Yep, planted her finger
right down on the IML button (plexi covers? We don' need no STEENKIN'
plexi covers!). MVS and VM vanished, two dozen tape drives rewound and
several hours' worth of data merge jobs went blooey.

2. The 3081 was water cooled via a heat exchanger. The building chilled
water feed had a very old, very clogged filter that was bypassed until
it could be replaced. One day a new maintenance foreman came through the
building doing his "clipboard and harried expression" thing, and spotted
the filter in bypass (NO, I don't know WHY it hadn't been red-tagged.
Someone clearly dropped that ball.) He thought, "Well that's not right"
and reset all the valves to put it back inline, which of course, pretty
much killed the chilled water flow through the heat exchanger. First
thing we knew about it in Operations was when the 3081 started throwing
thermal alarms and MVS crashed hard. IBM had to replace several modules
in the CPUs.

--
--------------------------------------------
Bruce H. McIntosh
Network Engineer II
University of Florida Information Technology
bhm@ufl.edu
352-273-1066
Re: Famous operational issues [ In reply to ]
Patrick W. Gilmore <patrick@ianai.net> wrote:
>
> Me: Did you order that EPO cover?
> Her: Nope.

There are apparently two kinds of EPO cover:

- the kind that stops you from pressing the button by mistake;

- and the kind that doesn't, and instead locks the button down to make
sure it isn't un-pressed until everything is safe.

We had a series of incidents similar to yours, so an EPO cover was
belatedly installed. We learned about the second kind of EPO cover when a
colleague proudly demonstrated that the EPO button should no longer be
pressed by accident, or so he thought.

Tony.
--
f.anthony.n.finch <dot@dotat.at> http://dotat.at/
the quest for freedom and justice can never end
Re: Famous operational issues [ In reply to ]
On Mon, Feb 22, 2021 at 7:09 AM tim@pelican.org <tim@pelican.org> wrote:

> On Thursday, 18 February, 2021 22:37, "Warren Kumari" <warren@kumari.net>
> said:
>
> > 4: Not too long after I started doing networking (and for the same small
> > ISP in Yonkers), I'm flying off to install a new customer. I (of course)
> > think that I'm hot stuff because I'm going to do the install, configure
> the
> > router, whee, look at me! Anyway, I don't want to check a bag, and so I
> > stuff the Cisco 2501 in a carryon bag, along with tools, etc (this was
> all
> > pre-9/11!). I'm going through security and the TSA[0] person opens my bag
> > and pulls the router out. "What's this?!" he asks. I politely tell him
> that
> > it's a router. He says it's not. I'm still thinking that I'm the new
> > hotness, and so I tell him in a somewhat condescending way that it is,
> and
> > I know what I'm talking about. He tells me that it's not a router, and is
> > starting to get annoyed. I explain using my "talking to a 5 year old"
> voice
> > that it most certainly is a router. He tells me that lying to airport
> > security is a federal offense, and starts looming at me. I adjust my
> > attitude and start explaining that it's like a computer and makes the
> > Internet work. He gruffly hands me back the router, I put it in my bag
> and
> > scurry away. As I do so, I hear him telling his colleague that it wasn't
> a
> > router, and that he certainly knows what a router is, because he does
> > woodwork...
>
> Here in the UK we avoid that issue by pronouncing the packet-shifter as
> "rooter", and only the wood-working tool as "rowter" :)
>
> Of course, it raises a different set of problems when talking to the
> Australians...
>

Yes. I discovered this while walking around Sydney wearing my "I have root
@ Google" t-shirt.... got some odd looks/snickers...

W




>
> Cheers,
> Tim.
>
>
>

--
The computing scientist’s main challenge is not to get confused by the
complexities of his own making.
-- E. W. Dijkstra
Re: Famous operational issues [ In reply to ]
Long ago, in a galaxy far away I worked for a gov't contractor on site
at a gov't site...

We had our own cute little datacenter, and our 4 building complex had
a central power distribution setup from utility -> buildings.
It was really quite nice :) (the job, the buildings, the power and
cute little datacenter)

One fine Tues afternoon ~2pm local time, the building engineers
decided they would make a copy of the key used to turn the main /
utility power off...
Of course they also needed to make sure their copy worked, so... they
put the key in and turned it.

Shockingly, the key worked! and no power was provided to the buildings :(
It was very suddenly very dark and very quiet... (then the yelling started)

Ok, fast forward 7 days... rerun the movie... Yes, the same building
engineers made a new copy, and .. tested that new copy in the same
manner.

For neither of these events did someone tell the rest of us (and our
customers): "Hey, we MAY interrupt power to the buildings... FYI, BTW,
make sure your backups are current..." I recall we got the name of the
engineer the 1st time around, but not the second.

On Mon, Feb 22, 2021 at 12:26 PM Tony Finch <dot@dotat.at> wrote:
>
> Patrick W. Gilmore <patrick@ianai.net> wrote:
> >
> > Me: Did you order that EPO cover?
> > Her: Nope.
>
> There are apparently two kinds of EPO cover:
>
> - the kind that stops you from pressing the button by mistake;
>
> - and the kind that doesn't, and instead locks the button down to make
> sure it isn't un-pressed until everything is safe.
>
> We had a series of incidents similar to yours, so an EPO cover was
> belatedly installed. We learned about the second kind of EPO cover when a
> colleague proudly demonstrated that the EPO button should no longer be
> pressed by accident, or so he thought.
>
> Tony.
> --
> f.anthony.n.finch <dot@dotat.at> http://dotat.at/
> the quest for freedom and justice can never end
Re: Famous operational issues [ In reply to ]
On Mon, Feb 22, 2021 at 12:50 PM Regis M. Donovan <regis-nanog@offhand.org>
wrote:

> On Thu, Feb 18, 2021 at 07:34:39PM -0500, Patrick W. Gilmore wrote:
> > And to put it on topic, cover your EPOs
>
> I worked somewhere with an uncovered EPO, which was okay until we had a
> telco tech in who was used to a different data center where a similar
> looking button controlled the door access, so he reflexively hit it
> on his way out to unlock the door. Oops.
>
> Also, consider what's on generator and what's not. I worked in a corporate
> data center where we lost power. The backup system kept all the machines
> running, but the ventilation system was still down, so it was very warm
> very
> fast as everyone went around trying to shut servers down gracefully while
> other folks propped the doors open to get some cooler air in.
>

That reminds me of another one...

In parts of NYC, there are noise abatement requirements, and so many places
have their generators mounted on the roof -- it's cheap real-estate, the
exhaust is easier, the noise issues are less, etc.

The generators usually have a smallish diesel tank, and then a much larger
one in the basement (diesel is heavy)...

So, one of the buildings that I was in was really good about testing thier
gensets - they'd do weekly tests (usually at night), and the generators
always worked perfectly -- right up until the time that it was actually
needed.
The generator fired up, the lights kept blinking, the disks kept spinning -
but the transfer pump that pumped diesel from the basement to the roof was
one of the few things that was not on the generator....

W



>
> --r
>
>

--
The computing scientist’s main challenge is not to get confused by the
complexities of his own making.
-- E. W. Dijkstra
Re: Famous operational issues [ In reply to ]
On Fri, 19 Feb 2021, Andy Ringsmuth wrote:

> > I explain using my "talking to a 5 year old" voice that it
> > most certainly is a router. He tells me that lying to airport security
> > is a federal offense, and starts looming at me. I adjust my attitude
> > and start explaining that it's like a computer and makes the Internet
> > work. He gruffly hands me back the router, I put it in my bag and
> > scurry away. As I do so, I hear him telling his colleague that it
> > wasn't a router, and that he certainly knows what a router is, because
> > he does woodwork…
>
> Well, in his defense, he wasn’t wrong… :-)

This is wjy, in the UK, we tend to pronounce "router" as "router", and
"router" as "router", so there's no confusion.

You're welcome.

Jethro.

. . . . . . . . . . . . . . . . . . . . . . . . .
Jethro R Binks, Network Manager,
Information Services Directorate, University Of Strathclyde, Glasgow, UK

The University of Strathclyde is a charitable body, registered in
Scotland, number SC015263.
Re: Famous operational issues [ In reply to ]
On Mon, Feb 22, 2021 at 2:05 PM Warren Kumari <warren@kumari.net> wrote:

>
>
> On Mon, Feb 22, 2021 at 12:50 PM Regis M. Donovan <regis-nanog@offhand.org>
> wrote:
>
>> On Thu, Feb 18, 2021 at 07:34:39PM -0500, Patrick W. Gilmore wrote:
>> > And to put it on topic, cover your EPOs
>>
>> I worked somewhere with an uncovered EPO, which was okay until we had a
>> telco tech in who was used to a different data center where a similar
>> looking button controlled the door access, so he reflexively hit it
>> on his way out to unlock the door. Oops.
>>
>> Also, consider what's on generator and what's not. I worked in a
>> corporate
>> data center where we lost power. The backup system kept all the machines
>> running, but the ventilation system was still down, so it was very warm
>> very
>> fast as everyone went around trying to shut servers down gracefully while
>> other folks propped the doors open to get some cooler air in.
>>
>
> That reminds me of another one...
>
> In parts of NYC, there are noise abatement requirements, and so many
> places have their generators mounted on the roof -- it's cheap real-estate,
> the exhaust is easier, the noise issues are less, etc.
>
> The generators usually have a smallish diesel tank, and then a much larger
> one in the basement (diesel is heavy)...
>
> So, one of the buildings that I was in was really good about testing thier
> gensets - they'd do weekly tests (usually at night), and the generators
> always worked perfectly -- right up until the time that it was actually
> needed.
> The generator fired up, the lights kept blinking, the disks kept spinning
> - but the transfer pump that pumped diesel from the basement to the roof
> was one of the few things that was not on the generator....
>
>
When we were looking at one of the big carrier hotels in NYC they said
that they had the same issue (could be it was the same one). The elevators
were also out as well. They resorted to having techs climb up an down 9
flights of stairs all day long with 5 gallon buckets of diesel and throwing
it into the generator.

>
RE: Famous operational issues [ In reply to ]
Many years ago I experienced a very similar thing. The DC/Integrator I worked for outsourced the co-location and operation of mainframe services for several banks and government organisations. One of these banks had a significant investment in AS/400's and they decided that it was so much hassle and expense using our datacentres that they would start putting those nice small AS/400's in computer rooms in their office buildings instead. One particular computer room contained large line printers that the developers would use to print out whatever it is such people print out. One Saturday morning I received a frantic call from the customer to say that all their primary production as/400's had gone offline. After a short investigation I realised that all the offline devices wire in this particular computer room. It turn's out that one of the developers had bought his six year old son to work that Saturday and upon retrieval of a printout said son had dutifully followed dad in to the computer room and was unable to resist the big red button sitting exposed on the wall by the door. Shortly thereafter the embarrassed customer decided that perhaps it was worth relocating their as/400's to our expensive datacentres.



>
> During my younger days, that button was used a few time by the
> operator of a VM/370 to regain control from someone with a "curious
> mind" *cought* *cought*...
>
Two horror stories I remember from long ago when I was a console jockey for a federal space agency that will remain nameless :P

1. A coworker brought her daughter to work with her on a Saturday overtime shift because she couldn't get a babysitter. She parked the kid with a coloring book and a pile of crayons at the only table in the console room with some space, right next to the master console for our 3081. I asked her to make sure sh was well away from the console, and as she reached over to scoot the girl and her coloring books further away she slipped, and reached out to steady herself. Yep, planted her finger right down on the IML button (plexi covers? We don' need no STEENKIN'
plexi covers!). MVS and VM vanished, two dozen tape drives rewound and several hours' worth of data merge jobs went blooey.
Re: Famous operational issues [ In reply to ]
On Feb 22, 2021, at 7:02 AM, tim@pelican.org wrote:
> On Thursday, 18 February, 2021 22:37, "Warren Kumari" <warren@kumari.net> said:
>
>> 4: Not too long after I started doing networking (and for the same small
>> ISP in Yonkers), I'm flying off to install a new customer. I (of course)
>> think that I'm hot stuff because I'm going to do the install, configure the
>> router, whee, look at me! Anyway, I don't want to check a bag, and so I
>> stuff the Cisco 2501 in a carryon bag, along with tools, etc (this was all
>> pre-9/11!). I'm going through security and the TSA[0] person opens my bag
>> and pulls the router out. "What's this?!" he asks. I politely tell him that
>> it's a router. He says it's not. I'm still thinking that I'm the new
>> hotness, and so I tell him in a somewhat condescending way that it is, and
>> I know what I'm talking about. He tells me that it's not a router, and is
>> starting to get annoyed. I explain using my "talking to a 5 year old" voice
>> that it most certainly is a router. He tells me that lying to airport
>> security is a federal offense, and starts looming at me. I adjust my
>> attitude and start explaining that it's like a computer and makes the
>> Internet work. He gruffly hands me back the router, I put it in my bag and
>> scurry away. As I do so, I hear him telling his colleague that it wasn't a
>> router, and that he certainly knows what a router is, because he does
>> woodwork...
>
> Here in the UK we avoid that issue by pronouncing the packet-shifter as "rooter", and only the wood-working tool as "rowter" :)

So wrong.

A “root” server is part of the DNS. A “route” server is part of BGP.


> Of course, it raises a different set of problems when talking to the Australians…

Everything is weird down down. But I still like them. :-)

--
TTFN,
patrick
Re: Famous operational issues [ In reply to ]
At Boston Univ we discovered the hard way that a security guard's
walkie-talkie could cause a $5,000 (or $10K for the big machine room)
Halon dump.

Took a couple of times before we figured out the connection tho once
someone made it to the hold button before it actually dumped.

Speaking of halon one very hot day I'm goofing off drinking coffee at
a nearby sub shop when the owner tells me someone from the computing
center was on the phone, that never happened before.

Some poor operator was holding the halon shot, it's a deadman's switch
(well, button) and the building was doing its 110db thing could I come
help? The building is being evac'd.

So my boss who wasn't the sharpest knife in the drawer follows me down
as I enter and I'm sweating like a pig with a floor panel sucker
trying to figure out which zone tripped.

And he shouts at me over the alarms: WHY TF DOES IT DO THIS?! Angrily.

I answered: well, maybe THERE'S A FIRE!!!

At which point I notice the back of my shoulder is really bothering
me, which I say to him, and he says hmmm there's a big bee on your
back maybe he's stinging you?

Fun day.

--
-Barry Shein

Software Tool & Die | bzs@TheWorld.com | http://www.TheWorld.com
Purveyors to the Trade | Voice: +1 617-STD-WRLD | 800-THE-WRLD
The World: Since 1989 | A Public Information Utility | *oo*
Re: Famous operational issues [ In reply to ]
Let me tell you about my personal favorite.

It’s 2002 and I am working as an engineer for an electronic stock trading platform (ECN), this platform happened to be the biggest platform for trading stocks electronically, on some days bigger than NASDAQ itself. This platform also happened to be run on DOS, FoxPro and a Novell file share, on a cluster of roughly 1,000 computers, two of which were the “engine” that matched all of the trades.

Well FoxPro has this “feature” where the ESC key halts the running program. We had the ability to remote control these DOS/FoxPro machines via some program we had written. Someone asked me to check the status of the process running on the primary matching engine, and when I was done, out of habit, I hit ESC. Trade processing grinds to a halt (phone calls have to be made to the SEC). I immediately called the NOC and told them it was me. Next thing I know, someone from the NOC is at my desk with a screwdriver putting the ESC key from my keyboard. I remained ESC keyless for the next several years until I left the company. I was hazed pretty good over it, but was essentially given a one time pass.



> On Feb 22, 2021, at 7:30 PM, bzs@theworld.com wrote:
>
> ?
> At Boston Univ we discovered the hard way that a security guard's
> walkie-talkie could cause a $5,000 (or $10K for the big machine room)
> Halon dump.
>
> Took a couple of times before we figured out the connection tho once
> someone made it to the hold button before it actually dumped.
>
> Speaking of halon one very hot day I'm goofing off drinking coffee at
> a nearby sub shop when the owner tells me someone from the computing
> center was on the phone, that never happened before.
>
> Some poor operator was holding the halon shot, it's a deadman's switch
> (well, button) and the building was doing its 110db thing could I come
> help? The building is being evac'd.
>
> So my boss who wasn't the sharpest knife in the drawer follows me down
> as I enter and I'm sweating like a pig with a floor panel sucker
> trying to figure out which zone tripped.
>
> And he shouts at me over the alarms: WHY TF DOES IT DO THIS?! Angrily.
>
> I answered: well, maybe THERE'S A FIRE!!!
>
> At which point I notice the back of my shoulder is really bothering
> me, which I say to him, and he says hmmm there's a big bee on your
> back maybe he's stinging you?
>
> Fun day.
>
> --
> -Barry Shein
>
> Software Tool & Die | bzs@TheWorld.com | http://www.TheWorld.com
> Purveyors to the Trade | Voice: +1 617-STD-WRLD | 800-THE-WRLD
> The World: Since 1989 | A Public Information Utility | *oo*
Re: Famous operational issues [ In reply to ]
On Mon, Feb 22, 2021 at 7:31 PM <bzs@theworld.com> wrote:

>
> At Boston Univ we discovered the hard way that a security guard's
> walkie-talkie could cause a $5,000 (or $10K for the big machine room)
> Halon dump.
>

At one of the AOL datacenters there was some convoluted fire marshal reason
why a specific door could not be locked "during business hours" (?!), and
so there was a guard permanently stationed outside. The door was all the
way around the back of the building, and so basically never used - and so
the guard would fall asleep outside it with a piece of cardboard saying
"Please wake me before entering". He was a nice guy (and it was less faff
than the main entrance), and so we'd either sneak in and just not tell
anyone, or talk loudly while going round the corner so he could pretend to
have been awake the whole time...

W




>
> Took a couple of times before we figured out the connection tho once
> someone made it to the hold button before it actually dumped.
>
> Speaking of halon one very hot day I'm goofing off drinking coffee at
> a nearby sub shop when the owner tells me someone from the computing
> center was on the phone, that never happened before.
>
> Some poor operator was holding the halon shot, it's a deadman's switch
> (well, button) and the building was doing its 110db thing could I come
> help? The building is being evac'd.
>
> So my boss who wasn't the sharpest knife in the drawer follows me down
> as I enter and I'm sweating like a pig with a floor panel sucker
> trying to figure out which zone tripped.
>
> And he shouts at me over the alarms: WHY TF DOES IT DO THIS?! Angrily.
>
> I answered: well, maybe THERE'S A FIRE!!!
>
> At which point I notice the back of my shoulder is really bothering
> me, which I say to him, and he says hmmm there's a big bee on your
> back maybe he's stinging you?
>
> Fun day.
>
> --
> -Barry Shein
>
> Software Tool & Die | bzs@TheWorld.com |
> http://www.TheWorld.com
> Purveyors to the Trade | Voice: +1 617-STD-WRLD | 800-THE-WRLD
> The World: Since 1989 | A Public Information Utility | *oo*
>


--
The computing scientist’s main challenge is not to get confused by the
complexities of his own making.
-- E. W. Dijkstra
Re: Famous operational issues [ In reply to ]
On Thu, Feb 18, 2021 at 5:38 PM Warren Kumari <warren@kumari.net> wrote:

>
> 2: A somewhat similar thing would happen with the Ascend TNT Max, which
> had side-to-side airflow. These were dial termination boxes, and so people
> would install racks and racks of them. The first one would draw in cool air
> on the left, heat it up and ship it out the right. The next one over would
> draw in warm air on the left, heat it up further, and ship it out the
> right... Somewhere there is a fairly famous photo of a rack of TNT Maxes,
> with the final one literally on fire, and still passing packets.
>

We had several racks of TNTs at the peak of our dial POP phase, and I
believe we ended up designing baffles for the sides of those racks to pull
in cool air from the front of the rack to the left side of the chassis and
exhaust it out the back from the right side. It wasn't perfect, but it did
the job.

The TNTs with channelized T3 interfaces were a great way to terminate lots
of modems in a reasonable amount of rack space with minimal cabling.

Thank you
jms
Re: Famous operational issues [ In reply to ]
Beyond the widespread outages, I have so many personal war stories that
it's hard to pick a favorite.

My first job out of college in the mid-late 90s was at an ISP in Pittsburgh
that I joined pretty early in its existence, and everyone did a bit of
everything. I was hired to do sysadmin stuff, networking, pretty much
whatever was needed. About a year after I started, we brought up a new mail
system with an external RAID enclosure for the mail store itself. One day,
we saw indications that one of the disks in the RAID enclosure was starting
to fail, so I scheduled a maintenance window to replace the disk and let
the controller rebuild the data and integrate it back into the RAID set.
No big worries, right?

It's Tuesday at about 2 AM.

Well, the kernel on the RAID controller itself decided that when I pulled
the failing drive would be a fine time to panic, and more or less turn
itself into a bit-blender, and take all the mailstore down with it. After
a few hours of watching fsck make no progress on anything, in terms of
trying to un-fsck the mailstore, we made the decision in consultation with
the CEO to pull the plug on trying to bring the old RAID enclosure back to
life, and focus on finding suitable replacement hardware and rebuild from
scratch. We also discovered that the most recent backups of the mailstore
were over a month old :(

I think our CEO ended up driving several hours to procure a suitable
enclosure. By the time we got the enclosure installed, filesystems built,
and got whatever tape backups we had restored, and tested the integrity of
the system, it was now Thursday around 8 AM. Coincidentally, that was the
same day the company hosted a big VIP gathering (the mayor was there, along
with lots of investors and other bigwigs), so I had to come back and put on
a suit to hobnob with the VIPs after getting a total of 6 hours of sleep in
about the previous 3 days. I still don't know how I got home that night
without wrapping my vehicle around a utility pole (due to being over-tired,
not due to alcohol).

Many painful lessons learned over that stretch of days, as often the case
as a company grows from startup mode and builds more robust technology and
business processes as a consequence of growth.

jms

On Tue, Feb 16, 2021 at 2:37 PM John Kristoff <jtk@dataplane.org> wrote:

> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
>
> To get things started, I'd suggest the AS 7007 event is perhaps the
> most notorious and likely to top many lists including mine. So if
> that is one for you I'm asking for just two more.
>
> I'm particularly interested in this as the first step in developing a
> future NANOG session. I'd be particularly interested in any issues
> that also identify key individuals that might still be around and
> interested in participating in a retrospective. I already have someone
> that is willing to talk about AS 7007, which shouldn't be hard to guess
> who.
>
> Thanks in advance for your suggestions,
>
> John
>
Re: Famous operational issues [ In reply to ]
An interesting sub-thread to this could be:

Have you ever unintentionally crashed a device by running a perfectly
innocuous command?
1. Crashed a 6500/Sup2 by typing "show ip dhcp binding".
2. "clear interface XXX" on a Nexus 7K triggered a cascading/undocument
Sev1 bug that caused two linecards to crash and reload, and take down about
two dozen buildings on campus at the .edu where I used to work.
3. For those that ever had the misfortune of using early versions of the
"bcc" command shell* on Bay Networks routers, which was intended to make
the CLI make look and feel more like a Cisco router, you have my
condolences. One would reasonably expect "delete ?" to respond with a list
of valid arguments for that command. Instead, it deleted, well...
everything, and prompted an on-site restore/reboot.

BCC originally stood for "Bay Command Console", but we joked that it really
stood for "Blatant Cisco Clone".

On Tue, Feb 16, 2021 at 2:37 PM John Kristoff <jtk@dataplane.org> wrote:

> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
>
> To get things started, I'd suggest the AS 7007 event is perhaps the
> most notorious and likely to top many lists including mine. So if
> that is one for you I'm asking for just two more.
>
> I'm particularly interested in this as the first step in developing a
> future NANOG session. I'd be particularly interested in any issues
> that also identify key individuals that might still be around and
> interested in participating in a retrospective. I already have someone
> that is willing to talk about AS 7007, which shouldn't be hard to guess
> who.
>
> Thanks in advance for your suggestions,
>
> John
>
Re: Famous operational issues [ In reply to ]
That brings back memories....I had a similar experience. First month on the job, large Sun raid array storing ~ 5k of mailboxes dies in the middle of the afternoon. So, I start troubleshooting and determine it's most likely a bad disk. The CEO walked into the server room right about the time I had 20 disks laid out on a table. He had a fit and called the desktop support guy to come and 'show me how to fix a pc'.

Never mind the fact that we had a 90% ready to go replacement box sitting at another site, and just needed to either go get it, or bring the disks to it..... So we sat there until the desktop who was 30 minutes away guy got there. He took one look at it and said 'never touched that thing before, looks like he knows what he's doing' and pointed to me. 4 hours later we were driving the new server to the data center strapped down in the back of a pickup. Fun times.


-----Original Message-----
From: "Justin Streiner" <streinerj@gmail.com>
Sent: Tuesday, February 23, 2021 5:11pm
To: "John Kristoff" <jtk@dataplane.org>
Cc: "NANOG" <nanog@nanog.org>
Subject: Re: Famous operational issues



Beyond the widespread outages, I have so many personal war stories that it's hard to pick a favorite.
My first job out of college in the mid-late 90s was at an ISP in Pittsburgh that I joined pretty early in its existence, and everyone did a bit of everything. I was hired to do sysadmin stuff, networking, pretty much whatever was needed. About a year after I started, we brought up a new mail system with an external RAID enclosure for the mail store itself. One day, we saw indications that one of the disks in the RAID enclosure was starting to fail, so I scheduled a maintenance window to replace the disk and let the controller rebuild the data and integrate it back into the RAID set. No big worries, right?
It's Tuesday at about 2 AM.
Well, the kernel on the RAID controller itself decided that when I pulled the failing drive would be a fine time to panic, and more or less turn itself into a bit-blender, and take all the mailstore down with it. After a few hours of watching fsck make no progress on anything, in terms of trying to un-fsck the mailstore, we made the decision in consultation with the CEO to pull the plug on trying to bring the old RAID enclosure back to life, and focus on finding suitable replacement hardware and rebuild from scratch. We also discovered that the most recent backups of the mailstore were over a month old :(
I think our CEO ended up driving several hours to procure a suitable enclosure. By the time we got the enclosure installed, filesystems built, and got whatever tape backups we had restored, and tested the integrity of the system, it was now Thursday around 8 AM. Coincidentally, that was the same day the company hosted a big VIP gathering (the mayor was there, along with lots of investors and other bigwigs), so I had to come back and put on a suit to hobnob with the VIPs after getting a total of 6 hours of sleep in about the previous 3 days. I still don't know how I got home that night without wrapping my vehicle around a utility pole (due to being over-tired, not due to alcohol).
Many painful lessons learned over that stretch of days, as often the case as a company grows from startup mode and builds more robust technology and business processes as a consequence of growth.
jms


On Tue, Feb 16, 2021 at 2:37 PM John Kristoff <[ jtk@dataplane.org ]( mailto:jtk@dataplane.org )> wrote:Friends,

I'd like to start a thread about the most famous and widespread Internet
operational issues, outages or implementation incompatibilities you
have seen.

Which examples would make up your top three?

To get things started, I'd suggest the AS 7007 event is perhaps the
most notorious and likely to top many lists including mine. So if
that is one for you I'm asking for just two more.

I'm particularly interested in this as the first step in developing a
future NANOG session. I'd be particularly interested in any issues
that also identify key individuals that might still be around and
interested in participating in a retrospective. I already have someone
that is willing to talk about AS 7007, which shouldn't be hard to guess
who.

Thanks in advance for your suggestions,

John
Re: Famous operational issues [ In reply to ]
I would be more interested in seeing someone who HASN'T crashed a Cisco
6500/7600, particularly one with a long uptime, by typing in a supposedly
harmless 'show' command.


On Tue, Feb 23, 2021 at 2:26 PM Justin Streiner <streinerj@gmail.com> wrote:

> An interesting sub-thread to this could be:
>
> Have you ever unintentionally crashed a device by running a perfectly
> innocuous command?
> 1. Crashed a 6500/Sup2 by typing "show ip dhcp binding".
> 2. "clear interface XXX" on a Nexus 7K triggered a cascading/undocument
> Sev1 bug that caused two linecards to crash and reload, and take down about
> two dozen buildings on campus at the .edu where I used to work.
> 3. For those that ever had the misfortune of using early versions of the
> "bcc" command shell* on Bay Networks routers, which was intended to make
> the CLI make look and feel more like a Cisco router, you have my
> condolences. One would reasonably expect "delete ?" to respond with a list
> of valid arguments for that command. Instead, it deleted, well...
> everything, and prompted an on-site restore/reboot.
>
> BCC originally stood for "Bay Command Console", but we joked that it
> really stood for "Blatant Cisco Clone".
>
> On Tue, Feb 16, 2021 at 2:37 PM John Kristoff <jtk@dataplane.org> wrote:
>
>> Friends,
>>
>> I'd like to start a thread about the most famous and widespread Internet
>> operational issues, outages or implementation incompatibilities you
>> have seen.
>>
>> Which examples would make up your top three?
>>
>> To get things started, I'd suggest the AS 7007 event is perhaps the
>> most notorious and likely to top many lists including mine. So if
>> that is one for you I'm asking for just two more.
>>
>> I'm particularly interested in this as the first step in developing a
>> future NANOG session. I'd be particularly interested in any issues
>> that also identify key individuals that might still be around and
>> interested in participating in a retrospective. I already have someone
>> that is willing to talk about AS 7007, which shouldn't be hard to guess
>> who.
>>
>> Thanks in advance for your suggestions,
>>
>> John
>>
>
Re: Famous operational issues [ In reply to ]
On Tue, Feb 23, 2021 at 5:14 PM Justin Streiner <streinerj@gmail.com> wrote:

> Beyond the widespread outages, I have so many personal war stories that
> it's hard to pick a favorite.
>
> My first job out of college in the mid-late 90s was at an ISP in
> Pittsburgh that I joined pretty early in its existence, and everyone did a
> bit of everything. I was hired to do sysadmin stuff, networking, pretty
> much whatever was needed. About a year after I started, we brought up a new
> mail system with an external RAID enclosure for the mail store itself. One
> day, we saw indications that one of the disks in the RAID enclosure was
> starting to fail, so I scheduled a maintenance window to replace the disk
> and let the controller rebuild the data and integrate it back into the RAID
> set. No big worries, right?
>
> It's Tuesday at about 2 AM.
>
> Well, the kernel on the RAID controller itself decided that when I pulled
> the failing drive would be a fine time to panic, and more or less turn
> itself into a bit-blender, and take all the mailstore down with it. After
> a few hours of watching fsck make no progress on anything, in terms of
> trying to un-fsck the mailstore, we made the decision in consultation with
> the CEO to pull the plug on trying to bring the old RAID enclosure back to
> life, and focus on finding suitable replacement hardware and rebuild from
> scratch. We also discovered that the most recent backups of the mailstore
> were over a month old :(
>
> I think our CEO ended up driving several hours to procure a suitable
> enclosure. By the time we got the enclosure installed, filesystems built,
> and got whatever tape backups we had restored, and tested the integrity of
> the system, it was now Thursday around 8 AM. Coincidentally, that was the
> same day the company hosted a big VIP gathering (the mayor was there, along
> with lots of investors and other bigwigs), so I had to come back and put on
> a suit to hobnob with the VIPs after getting a total of 6 hours of sleep in
> about the previous 3 days. I still don't know how I got home that night
> without wrapping my vehicle around a utility pole (due to being over-tired,
> not due to alcohol).
>
> Many painful lessons learned over that stretch of days, as often the case
> as a company grows from startup mode and builds more robust technology and
> business processes as a consequence of growth.
>

Oh, dear. RAID.... that triggered 2 stories.
1: I worked at a small ISP in Westchester, NY. One day I'm doing stuff, and
want to kill process 1742, so I type 'kill -9 1' ... and then, before
pressing enter, I get distracted by our "Cisco AGS+ monitor" (a separate
story). After I get back to my desk I unlock my terminal, and call over a
friend to show just how close I'd gotten to making something go Boom. He
says "Nah, BSD is cleverer than that. I'm sure the kill command has some
check in to stop you killing init.". I disagree. He disagrees. I disagree
again. He calls me stupid. I bet him a soda.
He proves his point by typing 'su; kill -9 1' in the window he's logged
into -- and our primary NFS server (with all of the user sites)
obediently kills off init, and all of the child processes.... we run over
to the front of the box and hit the power switch, while desperately looking
for a monitor and keyboard to watch it boot.
It does the BIOS checks, and then stops on the RAID controller, complaining
about the fact that there are *2* dead drives, and that the array is now
sad.....
This makes no sense. I can understand one drive not recovering from a power
outage, but 2 seems a bit unlikely, especially because the machine hadn't
been beeping or anything like that.... we try turning it off and on again a
few times, no change... We pull the machine out of the rack and rip the
cover off.
Sure enough, there is a RAID card - but the piezo-buzzer on it is, for some
reason, wrapped in a bunch of napkins, held in place with electrical tape.
I pull that off, and there is also some paper towel jammed into the hole
in the buzzer, and bits of a broken pencil....

After replacing the drives, starting an rsync restore from a backup server
we investigate more....
...
it turns out that a few months ago(!) the machine had started beeping. The
night crew naturally found this annoying, and so they'd gone investigating
and discovered that it was this machine, and lifted the lid while still in
the rack. They traced the annoying noise to this small black thingie, and
made poked it until it stopped, thus solving the problem once and for
all.... yay!





2: I used to work at a company which was in one of the buildings next to
the twin-towers. For various clever reasons, they had their "datacenter" in
a corner of the office space... anyway, the planes hit, power goes out and
the building is evacuated - luckily no one is injured, but the entire
company/site is down. After a few weeks, my friend Joe is able to arrange
with a fire marshal to get access to the building so he can go and grab the
disks with all the data. The fire marshal and Joe trudge up the 15 flights
of stairs.... When they reach the suite, Joe discovers that the windows
where his desk was are blown in, there is debris everywhere, etc. He's
somewhat shaken by all this, but goes over to the datacenter area, pulls
the drives out of the Sun storage arrays, and puts them in his backpack.
They then trudge down the 15 flights of stairs, and Joe takes them home.
We've managed to scrounge up 3 identical (empty) arrays, and some servers,
and the plan is to temporarily run the service from his basement...

Anyway, I get a panic'ed call from Joe. He's got the empty RAID arrays.
He's got the servers. He's got a pile of 42 drives (3 enclosures, 14 drives
per enclosure). Unfortunately he completely didn't think to mark the order
of the drives, and now we have *no* idea which drives goes in which array,
nor in which slot in the array....

We spent some time trying to figure out how many ways you can arrange 42
things into 3 piles, and how long it would take to try all combinations....
I cannot remember the actual number, but it approached the lifetime of the
universe....
After much time and poking, we eventually worked out that the RAID
controller wrote a slot number at sector 0 on each physical drive, and it
became a solvable problem, but...


W


> jms
>
> On Tue, Feb 16, 2021 at 2:37 PM John Kristoff <jtk@dataplane.org> wrote:
>
>> Friends,
>>
>> I'd like to start a thread about the most famous and widespread Internet
>> operational issues, outages or implementation incompatibilities you
>> have seen.
>>
>> Which examples would make up your top three?
>>
>> To get things started, I'd suggest the AS 7007 event is perhaps the
>> most notorious and likely to top many lists including mine. So if
>> that is one for you I'm asking for just two more.
>>
>> I'm particularly interested in this as the first step in developing a
>> future NANOG session. I'd be particularly interested in any issues
>> that also identify key individuals that might still be around and
>> interested in participating in a retrospective. I already have someone
>> that is willing to talk about AS 7007, which shouldn't be hard to guess
>> who.
>>
>> Thanks in advance for your suggestions,
>>
>> John
>>
>

--
The computing scientist’s main challenge is not to get confused by the
complexities of his own making.
-- E. W. Dijkstra
Re: Famous operational issues [ In reply to ]
On 2/23/2021 12:22 PM, Justin Streiner wrote:
> An interesting sub-thread to this could be:
> Have you ever unintentionally crashed a device by running a perfectly
> innocuous command?
> ---------------------------------------------------------------

There was that time in the later 1990s where I took most of a global
network down several
times by typing "show ip bgp regexp <regex here>" on most all of the
core routers.  It turned
out to be a cisco bug.  I looked for a reference, but cannot find one. 
Ahh, the earlier days of
the commercial internet...gotta love'em.

scott
Re: Famous operational issues [ In reply to ]
Anyone remember when DEC delivered a new VMS version (V5 I think)
whose backups didn't work, couldn't be restored?

BU did, the hard way, when the engineering dept's faculty and student
disk failed.

DEC actually paid thousands of dollars for typist services to come and
re-enter whatever was on paper and could be re-entered.

I think that was the day I won the Unix vs VMS wars at BU anyhow.

--
-Barry Shein

Software Tool & Die | bzs@TheWorld.com | http://www.TheWorld.com
Purveyors to the Trade | Voice: +1 617-STD-WRLD | 800-THE-WRLD
The World: Since 1989 | A Public Information Utility | *oo*
Re: Famous operational issues [ In reply to ]
My war story.

At one of our major POPs in DC we had a row of 7513's, and one of them
had intermittent problems. I had replaced every piece of removable
card/part in it over time, and it kept failing. Even the vendor flew in
a team to the site to try to figure out what was wrong. It was finally
decided to replace the whole router (about 200lbs?). Being the local
field tech, that was my Job. On the night of the maintenance at 3am, the
work started. I switched off the rack power, which included a 2511
terminal server that was connected to half the routers in the row and
started to remove the router. A few minutes later I got a text, "You're
taking out the wrong router!" You can imagine the "Damn it, what have I
done?" feeling that runs through your mind and the way your heart stops
for a moment.

Okay, I wasn't taking out the wrong router. But unknown at the time,
terminal servers when turned off, had a nasty habit of sending a break
to all the routers it was connected to, and all those routers
effectively stopped. The remote engineer that was in charge saw the
whole POP go red and assumed I was the cause. I was, but not because of
anything I could have known about. I had to power cycle the downed
routers to bring them back on-line, and then continue with the
maintenance. A disaster to all involved, but the router got replaced.

I gave a very detailed account of my actions in the postmortem. It was
clear they knew I had turned off the wrong rack/router, and wasn't being
honest about it. I was adamant I had done exactly what I said, and even
swore I would fess up if I had error-ed, and always would, even if it
cost me the job. I rarely made mistakes, if any, so it was an easy thing
for me to say. For the next two weeks everyone that aware of the work
gave me the side eye.

About a week after that, the same thing happened to another field tech
in another state. That helped my case. They used my account to figure
out it was the TS that caused the problem. A few of them that had
questioned me harshly admitted to me my account helped them figure out
the cause.

And the worst part of this story? That router, completely replaced,
still had the same intermittent problem as before. It was a DC powered
POP, so they were all wired with the same clean DC power. In the end
they chalked it up to cosmic rays and gave up on it. I believe this
break issue was unique to the DC powered 2511's, and that we were the
first to use them, but I might be wrong on that.


On 2/16/21 2:37 PM, John Kristoff wrote:
> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
>
> To get things started, I'd suggest the AS 7007 event is perhaps the
> most notorious and likely to top many lists including mine. So if
> that is one for you I'm asking for just two more.
>
> I'm particularly interested in this as the first step in developing a
> future NANOG session. I'd be particularly interested in any issues
> that also identify key individuals that might still be around and
> interested in participating in a retrospective. I already have someone
> that is willing to talk about AS 7007, which shouldn't be hard to guess
> who.
>
> Thanks in advance for your suggestions,
>
> John
Re: Famous operational issues [ In reply to ]
While we're talking about raid types...

A few acquisitions ago, between 2006-2010, I worked at a Wireless ISP in
Northern Indiana. Our CEO decided to sell Internet service to school
systems because the e-rate funding was too much to resist. He had the idea
to install towers on the schools and sell service off that while paying the
school for roof rights. About two years into the endeavor, I wake up one
morning and walk to my car. Two FBI agents get out of an unmarked towncar.
About an hour later, they let me go to the office where I found an entire
barrage of FBI agents. It was a full raid and not the kind you want to see.
Hard drives were involved and being made redundant, but the redundant
copies were labeled and placed into boxes that were carried out to SUVs
that were as dark as the morning coffee these guys drank. There were a lot
of drives, all of our servers were in our server room at the office. There
were roughly five or six racks of varying amounts of equipment in each.

After some questioning and assisting them in their cataloging adventure,
the agents left us with a ton of questions and just enough equipment to
keep the customers connected. CEO became extremely paranoid at this point.
He told us to prepare to move servers to a different building. He went into
a tailspin trying to figure out where he could hide the servers to keep
things going without the bank or FBI seizing the assets. He was extremely
worried the bank would close the office down. We started moving all network
routing around to avoid using the office as our primary DIA.

One morning I get into the office and we hear the words we've been
dreading: "We're moving the servers". The plan was to move them to a tower
site that had a decent-sized shack on site. Connectivity was decent, we had
a licensed 11GHz microwave backhaul capable of about 155mbps. The site was
part of the old MCI microwave long-distance network in the 80s and 90s. It
had redundant air conditioners, a large propane tank, and a generator
capable of keeping the site alive for about three days. We were told not to
notify any customers, which became problematic because two customers had
servers colocated in our building. We consolidated the servers into three
racks and managed to get things prepared with a decent UPS in each rack.
CEO decided to move the servers at nightfall to "avoid suspicion". Our
office was in an unsavory part of town, moving anything at night was
suspicious. So, under the cover of half-ass darkness, we loaded the racks
onto a flatbed truck and drove them 20 minutes to the tower. While we
unloaded the racks, an electrician we knew was wiring up the L5-20 outlets
for the UPS in each rack. We got the racks plugged in, servers powered up,
and then the two customers came that had colocated equipment. They got
their equipment powered up and all seemed ok.

Back at the office the next day we were told to gather our workstations and
start working from home. I've been working from home ever since and quite
enjoy it, but that's beside the point.

Summer starts and I tell the CEO we need to repair the AC units because
they are failing. He ignores it, claiming he doesn't want to lose money the
bank could take at any minute. About a month later, a nice hot summer day
rolls in and the AC units both die. I stumble upon an old portable AC unit
and put that at the site. Temperatures rise to 140F ambient. Server
overheat alarms start going off, things start failing. Our colocation
customers are extremely upset. They pull their servers and drop service.
The heat subsides, CEO finally pays to repair one of the AC units.

Eventually, the company declares bankruptcy and goes into liquidation.
Luckily another WISP catches wind of it, buys the customers and assets, and
hires me. My happiest day that year was moving all the servers into a
better-suited home, a real data center. I don't know what happened to the
CEO, but I know that I'll never trust anything he has his hands in ever
again.

Adam Kennedy
Systems Engineer
adamkennedy@watchcomm.net | 800-589-3837 x120 <800-589-3837;120>
Watch Communications | www.watchcomm.net
<https://www.watchcomm.net?utm_source=signature&utm_medium=email&utm_campaign=general_signature>
3225 W Elm St, Suite A
Lima, OH 45805
<https://twitter.com/watchcommnet>
<https://www.facebook.com/watchcommunications>
<http://www.linkedin.com/company/watch-communications>


On Tue, Feb 23, 2021 at 8:55 PM brutal8z via NANOG <nanog@nanog.org> wrote:

> My war story.
>
> At one of our major POPs in DC we had a row of 7513's, and one of them had
> intermittent problems. I had replaced every piece of removable card/part in
> it over time, and it kept failing. Even the vendor flew in a team to the
> site to try to figure out what was wrong. It was finally decided to replace
> the whole router (about 200lbs?). Being the local field tech, that was my
> Job. On the night of the maintenance at 3am, the work started. I switched
> off the rack power, which included a 2511 terminal server that was
> connected to half the routers in the row and started to remove the router.
> A few minutes later I got a text, "You're taking out the wrong router!" You
> can imagine the "Damn it, what have I done?" feeling that runs through your
> mind and the way your heart stops for a moment.
>
> Okay, I wasn't taking out the wrong router. But unknown at the time,
> terminal servers when turned off, had a nasty habit of sending a break to
> all the routers it was connected to, and all those routers effectively
> stopped. The remote engineer that was in charge saw the whole POP go red
> and assumed I was the cause. I was, but not because of anything I could
> have known about. I had to power cycle the downed routers to bring them
> back on-line, and then continue with the maintenance. A disaster to all
> involved, but the router got replaced.
>
> I gave a very detailed account of my actions in the postmortem. It was
> clear they knew I had turned off the wrong rack/router, and wasn't being
> honest about it. I was adamant I had done exactly what I said, and even
> swore I would fess up if I had error-ed, and always would, even if it cost
> me the job. I rarely made mistakes, if any, so it was an easy thing for me
> to say. For the next two weeks everyone that aware of the work gave me the
> side eye.
>
> About a week after that, the same thing happened to another field tech in
> another state. That helped my case. They used my account to figure out it
> was the TS that caused the problem. A few of them that had questioned me
> harshly admitted to me my account helped them figure out the cause.
>
> And the worst part of this story? That router, completely replaced, still
> had the same intermittent problem as before. It was a DC powered POP, so
> they were all wired with the same clean DC power. In the end they chalked
> it up to cosmic rays and gave up on it. I believe this break issue was
> unique to the DC powered 2511's, and that we were the first to use them,
> but I might be wrong on that.
>
>
> On 2/16/21 2:37 PM, John Kristoff wrote:
>
> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
>
> To get things started, I'd suggest the AS 7007 event is perhaps the
> most notorious and likely to top many lists including mine. So if
> that is one for you I'm asking for just two more.
>
> I'm particularly interested in this as the first step in developing a
> future NANOG session. I'd be particularly interested in any issues
> that also identify key individuals that might still be around and
> interested in participating in a retrospective. I already have someone
> that is willing to talk about AS 7007, which shouldn't be hard to guess
> who.
>
> Thanks in advance for your suggestions,
>
> John
>
>
>
Re: Famous operational issues [ In reply to ]
maybe late '60s or so, we had a few 2314 dasd monsters[0]. think maybe
4m x 2m with 9 drives with removable disk packs.

a grave shift operator gets errors on a drive and wonders if maybe they
swap it into another spindle. no luck, so swapped those two drives with
two others. one more iteration, and they had wiped out the entire
array. at that point they called me; so i missed the really creative
part.

[0] https://www.ibm.com/ibm/history/exhibits/storage/storage_2314.html

randy

---
randy@psg.com
`gpg --locate-external-keys --auto-key-locate wkd randy@psg.com`
signatures are back, thanks to dmarc header mangling
Re: Famous operational issues [ In reply to ]
On Tue, 23 Feb 2021 20:46:38 -0800, Randy Bush said:
> maybe late '60s or so, we had a few 2314 dasd monsters[0]. think maybe
> 4m x 2m with 9 drives with removable disk packs.
>
> a grave shift operator gets errors on a drive and wonders if maybe they
> swap it into another spindle. no luck, so swapped those two drives with
> two others. one more iteration, and they had wiped out the entire
> array. at that point they called me; so i missed the really creative
> part.

I suspect every S/360 site that had 2314's had an operator who did that, as I
was witness to the same thing. For at least a decade after that debacle, the
Manager of Operations was awarding Gold, Silver, and Bronze Danny awards for
operational screw-ups. (The 2314 event was the sole Platinum Danny :)

And yes, IBM 4341 consoles were all too easy to hit the EPO button on the
keyboard, we got guards for the consoles after one of our operators nailed the
button a second time in a month.

And to tie the S/360 and 4341 together - we were one of the last sites that was
still running an S/360 Mod 65J. And plans came through for a new server room
on the top floor of a new building. Architect comes through, measures the S/360
and all the peripherals for floorspace and power/cooling - and the CPU, plus
*4* meg of memory, and 3 strings of 2314 drives chewed a lot of both.

Construction starts. Meanwhile, IBM announces the 4341, and offers us a real
sweetheart deal because even at the high maintenance charges we were paying,
IBM was losing money. Something insane like the system and peripherals and
first 3 years of maintenance, for less than the old system per-year
maintenance. Oh, and the power requirements are like 10% of the 360s.

So we take delivery of the new system and it's looking pitiful, just one box
and 2 small strings of disk in 10K square feet. Lots of empty space. Do all
the migrations to the new system over the summer, and life is good. Until
fall and winter arrive, and we discover there is zero heat in the room, and the
ceiling is uninsulated, and it's below zero outside because this is way upstate
NY. And if there was a 360 in the room, it would *still* be needing cooling
rather than heating. But it's a 4341 that's shedding only 10% of the heat...

Finally, one February morning, the 4341 throws a thermal check. Air was too
cold at the intakes. Our IBM CE did a double-take because he'd been doing IBM
mainframes for 3 decades and had never seen a thermal check for too cold
before.

Lots of legal action threatened against the architect, who simply said "If you
had *told* me that the system was being replaced, I'd have put heat in the
room". A settlement was reached, revised plans were drawn up, there was a whole
mess of construction to get ductwork and insulation and other stuff into place,
and life was good for the decade or so before I left for a better gig....
Re: Famous operational issues [ In reply to ]
    I personally did "disable vlan Xyz" instead of "delete vlan Xyz" on
Extreme Network... which proceeded to disable all the ports where the
VLAN was present...

    Good thing it was a (local) remote pop and not on the core.

-----
Alain Hebert ahebert@pubnix.net
PubNIX Inc.
50 boul. St-Charles
P.O. Box 26770 Beaconsfield, Quebec H9W 6G7
Tel: 514-990-5911 http://www.pubnix.net Fax: 514-990-9443

On 2/23/21 5:22 PM, Justin Streiner wrote:
> An interesting sub-thread to this could be:
>
> Have you ever unintentionally crashed a device by running a perfectly
> innocuous command?
> 1. Crashed a 6500/Sup2 by typing "show ip dhcp binding".
> 2. "clear interface XXX" on a Nexus 7K triggered a
> cascading/undocument Sev1 bug that caused two linecards to crash and
> reload, and take down about two dozen buildings on campus at the .edu
> where I used to work.
> 3. For those that ever had the misfortune of using early versions of
> the "bcc" command shell* on Bay Networks routers, which was intended
> to make the CLI make look and feel more like a Cisco router, you have
> my condolences.  One would reasonably expect "delete ?" to respond
> with a list of valid arguments for that command.  Instead, it deleted,
> well... everything, and prompted an on-site restore/reboot.
>
> BCC originally stood for "Bay Command Console", but we joked that it
> really stood for "Blatant Cisco Clone".
>
> On Tue, Feb 16, 2021 at 2:37 PM John Kristoff <jtk@dataplane.org
> <mailto:jtk@dataplane.org>> wrote:
>
> Friends,
>
> I'd like to start a thread about the most famous and widespread
> Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
>
> To get things started, I'd suggest the AS 7007 event is perhaps  the
> most notorious and likely to top many lists including mine. So if
> that is one for you I'm asking for just two more.
>
> I'm particularly interested in this as the first step in developing a
> future NANOG session.  I'd be particularly interested in any issues
> that also identify key individuals that might still be around and
> interested in participating in a retrospective.  I already have
> someone
> that is willing to talk about AS 7007, which shouldn't be hard to
> guess
> who.
>
> Thanks in advance for your suggestions,
>
> John
>
Re: Famous operational issues [ In reply to ]
anyone else have the privilege of running 2321 data cells? had a bunch.
unreliable as hell. there was a job running continuously recovering
transactions off of log tapes. one night at 3am, head of apps program
(i was systems) got a call that a tran tape was unmounted with a console
message that recovery was complete. ops did not know what it meant or
what to do. was the first time in over five years the data were stable.

wife of same head of apps grew more and more tired of 2am calls.
finally she answered one "david? he said he was going in to work."
ops never called in the night again.

randy

---
randy@psg.com
`gpg --locate-external-keys --auto-key-locate wkd randy@psg.com`
signatures are back, thanks to dmarc header mangling
Re: Famous operational issues [ In reply to ]
Hardly famous and not service-affecting in the end, but figured I'd
share an incident from our side that occurred back in 2018.

While commissioning a new node in our Metro-E network, an IPv6
point-to-point address was mis-typed. Instead of ending in /126, it
ended in /12. This happened in Johannesburg.

We actually came across this by chance while examining the IGP table of
another router located in Slough, and found an entry for 2c00::/12
floating around. That definitely looked out of place, as we never carry
parent blocks in our IGP.

Running the trace from Slough led us back to this one Metro-E device in
Jo'burg.

It took everyone nearly an hour to figure out the typo, because for all
the laser focus we had on the supposed link of the supposed box that was
creating this problem, we all overlooked the fact that the /12
configured on the point-to-point link was actually supposed to have been
a /126.

The reason this never caused a service problem was because we do not
redistribute our IGP into BGP (not that anyone should). And even if we
did, there are a ton of filters and BGP communities on all devices to
ensure a route such as that would have never made it out of our AS.

Also, the IGP contains the most specific paths to every node in our
network, so the presence of the 2c00::/12 was mostly cosmetic. It would
have never been used for routing decisions.

Mark.
Re: Famous operational issues [ In reply to ]
I only just now found this thread, so I'm sorry I'm late to the party, but here, I put it on Medium.

https://gushi.medium.com/the-worst-day-ever-at-my-day-job-beff7f4170aa

> On Mar 12, 2021, at 10:07 PM, Mark Tinka <mark@tinka.africa> wrote:
>
> Hardly famous and not service-affecting in the end, but figured I'd share an incident from our side that occurred back in 2018.
>
> While commissioning a new node in our Metro-E network, an IPv6 point-to-point address was mis-typed. Instead of ending in /126, it ended in /12. This happened in Johannesburg.
>
> We actually came across this by chance while examining the IGP table of another router located in Slough, and found an entry for 2c00::/12 floating around. That definitely looked out of place, as we never carry parent blocks in our IGP.
>
> Running the trace from Slough led us back to this one Metro-E device in Jo'burg.
>
> It took everyone nearly an hour to figure out the typo, because for all the laser focus we had on the supposed link of the supposed box that was creating this problem, we all overlooked the fact that the /12 configured on the point-to-point link was actually supposed to have been a /126.
>
> The reason this never caused a service problem was because we do not redistribute our IGP into BGP (not that anyone should). And even if we did, there are a ton of filters and BGP communities on all devices to ensure a route such as that would have never made it out of our AS.
>
> Also, the IGP contains the most specific paths to every node in our network, so the presence of the 2c00::/12 was mostly cosmetic. It would have never been used for routing decisions.
>
> Mark.
Re: Famous operational issues [ In reply to ]
What a day.. hope you are better now :)


On 6/12/2021 2:42 AM, Dan Mahoney wrote:
> I only just now found this thread, so I'm sorry I'm late to the party,
> but here, I put it on Medium.
>
> https://gushi.medium.com/the-worst-day-ever-at-my-day-job-beff7f4170aa
> <https://gushi.medium.com/the-worst-day-ever-at-my-day-job-beff7f4170aa>
>
>> On Mar 12, 2021, at 10:07 PM, Mark Tinka <mark@tinka.africa> wrote:
>>
>> Hardly famous and not service-affecting in the end, but figured I'd
>> share an incident from our side that occurred back in 2018.
>>
>> While commissioning a new node in our Metro-E network, an IPv6
>> point-to-point address was mis-typed. Instead of ending in /126, it
>> ended in /12. This happened in Johannesburg.
>>
>> We actually came across this by chance while examining the IGP table
>> of another router located in Slough, and found an entry for 2c00::/12
>> floating around. That definitely looked out of place, as we never
>> carry parent blocks in our IGP.
>>
>> Running the trace from Slough led us back to this one Metro-E device
>> in Jo'burg.
>>
>> It took everyone nearly an hour to figure out the typo, because for
>> all the laser focus we had on the supposed link of the supposed box
>> that was creating this problem, we all overlooked the fact that the
>> /12 configured on the point-to-point link was actually supposed to
>> have been a /126.
>>
>> The reason this never caused a service problem was because we do not
>> redistribute our IGP into BGP (not that anyone should). And even if
>> we did, there are a ton of filters and BGP communities on all devices
>> to ensure a route such as that would have never made it out of our AS.
>>
>> Also, the IGP contains the most specific paths to every node in our
>> network, so the presence of the 2c00::/12 was mostly cosmetic. It
>> would have never been used for routing decisions.
>>
>> Mark.
>
Re: Famous operational issues [ In reply to ]
opening the link currently gives me a HTTP 500 error, very fitting :)

Am 12.06.2021 um 04:42 schrieb Dan Mahoney:
> I only just now found this thread, so I'm sorry I'm late to the party, but here, I put it on Medium.
>
> https://gushi.medium.com/the-worst-day-ever-at-my-day-job-beff7f4170aa
>
>> On Mar 12, 2021, at 10:07 PM, Mark Tinka <mark@tinka.africa> wrote:
>>
>> Hardly famous and not service-affecting in the end, but figured I'd share an incident from our side that occurred back in 2018.
>>
>> While commissioning a new node in our Metro-E network, an IPv6 point-to-point address was mis-typed. Instead of ending in /126, it ended in /12. This happened in Johannesburg.
>>
>> We actually came across this by chance while examining the IGP table of another router located in Slough, and found an entry for 2c00::/12 floating around. That definitely looked out of place, as we never carry parent blocks in our IGP.
>>
>> Running the trace from Slough led us back to this one Metro-E device in Jo'burg.
>>
>> It took everyone nearly an hour to figure out the typo, because for all the laser focus we had on the supposed link of the supposed box that was creating this problem, we all overlooked the fact that the /12 configured on the point-to-point link was
>> actually supposed to have been a /126.
>>
>> The reason this never caused a service problem was because we do not redistribute our IGP into BGP (not that anyone should). And even if we did, there are a ton of filters and BGP communities on all devices to ensure a route such as that would have
>> never made it out of our AS.
>>
>> Also, the IGP contains the most specific paths to every node in our network, so the presence of the 2c00::/12 was mostly cosmetic. It would have never been used for routing decisions.
>>
>> Mark.
>
Re: [EXTERNAL] Re: Famous operational issues [ In reply to ]
On 16/02/2021 22:51, Compton, Rich A wrote:

> There was the outage in 2014 when we got to 512K routes. http://www.bgpmon.net/what-caused-todays-internet-hiccup/

There was a similar issue in 1998/9 or so when we got to 64K routes,
which broke the routing table index (which defaulted to a uint16_t) on
any FreeBSD box doing BGP.

Fortunately a quick kernel recompile with the type changed to uint32_t
fixed that.

Ray