Mailing List Archive: Famous operational issues

Re: Famous operational issues [ In reply to ]

Feb 22, 2021, 9:26 AM

Post #76 of 105 (1895 views)

Patrick W. Gilmore <patrick@ianai.net> wrote:
>
> Me: Did you order that EPO cover?
> Her: Nope.

There are apparently two kinds of EPO cover:

- the kind that stops you from pressing the button by mistake;

- and the kind that doesn't, and instead locks the button down to make
sure it isn't un-pressed until everything is safe.

We had a series of incidents similar to yours, so an EPO cover was
belatedly installed. We learned about the second kind of EPO cover when a
colleague proudly demonstrated that the EPO button should no longer be
pressed by accident, or so he thought.

Tony.
--
f.anthony.n.finch <dot@dotat.at> http://dotat.at/
the quest for freedom and justice can never end

Re: Famous operational issues [ In reply to ]

warren at kumari

Feb 22, 2021, 9:34 AM

Post #77 of 105 (1895 views)

Permalink

On Mon, Feb 22, 2021 at 7:09 AM tim@pelican.org <tim@pelican.org> wrote:

> On Thursday, 18 February, 2021 22:37, "Warren Kumari" <warren@kumari.net>
> said:
>
> > 4: Not too long after I started doing networking (and for the same small
> > ISP in Yonkers), I'm flying off to install a new customer. I (of course)
> > think that I'm hot stuff because I'm going to do the install, configure
> the
> > router, whee, look at me! Anyway, I don't want to check a bag, and so I
> > stuff the Cisco 2501 in a carryon bag, along with tools, etc (this was
> all
> > pre-9/11!). I'm going through security and the TSA[0] person opens my bag
> > and pulls the router out. "What's this?!" he asks. I politely tell him
> that
> > it's a router. He says it's not. I'm still thinking that I'm the new
> > hotness, and so I tell him in a somewhat condescending way that it is,
> and
> > I know what I'm talking about. He tells me that it's not a router, and is
> > starting to get annoyed. I explain using my "talking to a 5 year old"
> voice
> > that it most certainly is a router. He tells me that lying to airport
> > security is a federal offense, and starts looming at me. I adjust my
> > attitude and start explaining that it's like a computer and makes the
> > Internet work. He gruffly hands me back the router, I put it in my bag
> and
> > scurry away. As I do so, I hear him telling his colleague that it wasn't
> a
> > router, and that he certainly knows what a router is, because he does
> > woodwork...
>
> Here in the UK we avoid that issue by pronouncing the packet-shifter as
> "rooter", and only the wood-working tool as "rowter" :)
>
> Of course, it raises a different set of problems when talking to the
> Australians...
>

Yes. I discovered this while walking around Sydney wearing my "I have root
@ Google" t-shirt.... got some odd looks/snickers...

W

>
> Cheers,
> Tim.
>
>
>

--
The computing scientist’s main challenge is not to get confused by the
complexities of his own making.
-- E. W. Dijkstra

Re: Famous operational issues [ In reply to ]

morrowc.lists at gmail

Feb 22, 2021, 9:35 AM

Post #78 of 105 (1895 views)

Permalink

Long ago, in a galaxy far away I worked for a gov't contractor on site
at a gov't site...

We had our own cute little datacenter, and our 4 building complex had
a central power distribution setup from utility -> buildings.
It was really quite nice :) (the job, the buildings, the power and
cute little datacenter)

One fine Tues afternoon ~2pm local time, the building engineers
decided they would make a copy of the key used to turn the main /
utility power off...
Of course they also needed to make sure their copy worked, so... they
put the key in and turned it.

Shockingly, the key worked! and no power was provided to the buildings :(
It was very suddenly very dark and very quiet... (then the yelling started)

Ok, fast forward 7 days... rerun the movie... Yes, the same building
engineers made a new copy, and .. tested that new copy in the same
manner.

For neither of these events did someone tell the rest of us (and our
customers): "Hey, we MAY interrupt power to the buildings... FYI, BTW,
make sure your backups are current..." I recall we got the name of the
engineer the 1st time around, but not the second.

On Mon, Feb 22, 2021 at 12:26 PM Tony Finch <dot@dotat.at> wrote:
>
> Patrick W. Gilmore <patrick@ianai.net> wrote:
> >
> > Me: Did you order that EPO cover?
> > Her: Nope.
>
> There are apparently two kinds of EPO cover:
>
> - the kind that stops you from pressing the button by mistake;
>
> - and the kind that doesn't, and instead locks the button down to make
> sure it isn't un-pressed until everything is safe.
>
> We had a series of incidents similar to yours, so an EPO cover was
> belatedly installed. We learned about the second kind of EPO cover when a
> colleague proudly demonstrated that the EPO button should no longer be
> pressed by accident, or so he thought.
>
> Tony.
> --
> f.anthony.n.finch <dot@dotat.at> http://dotat.at/
> the quest for freedom and justice can never end

Re: Famous operational issues [ In reply to ]

warren at kumari

Feb 22, 2021, 11:01 AM

Post #79 of 105 (1895 views)

Permalink

On Mon, Feb 22, 2021 at 12:50 PM Regis M. Donovan <regis-nanog@offhand.org>
wrote:

> On Thu, Feb 18, 2021 at 07:34:39PM -0500, Patrick W. Gilmore wrote:
> > And to put it on topic, cover your EPOs
>
> I worked somewhere with an uncovered EPO, which was okay until we had a
> telco tech in who was used to a different data center where a similar
> looking button controlled the door access, so he reflexively hit it
> on his way out to unlock the door. Oops.
>
> Also, consider what's on generator and what's not. I worked in a corporate
> data center where we lost power. The backup system kept all the machines
> running, but the ventilation system was still down, so it was very warm
> very
> fast as everyone went around trying to shut servers down gracefully while
> other folks propped the doors open to get some cooler air in.
>

That reminds me of another one...

In parts of NYC, there are noise abatement requirements, and so many places
have their generators mounted on the roof -- it's cheap real-estate, the
exhaust is easier, the noise issues are less, etc.

The generators usually have a smallish diesel tank, and then a much larger
one in the basement (diesel is heavy)...

So, one of the buildings that I was in was really good about testing thier
gensets - they'd do weekly tests (usually at night), and the generators
always worked perfectly -- right up until the time that it was actually
needed.
The generator fired up, the lights kept blinking, the disks kept spinning -
but the transfer pump that pumped diesel from the basement to the roof was
one of the few things that was not on the generator....

W

>
> --r
>
>

--
The computing scientist’s main challenge is not to get confused by the
complexities of his own making.
-- E. W. Dijkstra

Re: Famous operational issues [ In reply to ]

jethro.binks at strath

Feb 22, 2021, 11:17 AM

Post #80 of 105 (1895 views)

Permalink

On Fri, 19 Feb 2021, Andy Ringsmuth wrote:

> > I explain using my "talking to a 5 year old" voice that it
> > most certainly is a router. He tells me that lying to airport security
> > is a federal offense, and starts looming at me. I adjust my attitude
> > and start explaining that it's like a computer and makes the Internet
> > work. He gruffly hands me back the router, I put it in my bag and
> > scurry away. As I do so, I hear him telling his colleague that it
> > wasn't a router, and that he certainly knows what a router is, because
> > he does woodwork…
>
> Well, in his defense, he wasn’t wrong… :-)

This is wjy, in the UK, we tend to pronounce "router" as "router", and
"router" as "router", so there's no confusion.

You're welcome.

Jethro.

. . . . . . . . . . . . . . . . . . . . . . . . .
Jethro R Binks, Network Manager,
Information Services Directorate, University Of Strathclyde, Glasgow, UK

The University of Strathclyde is a charitable body, registered in
Scotland, number SC015263.

Re: Famous operational issues [ In reply to ]

dovid at telecurve

Feb 22, 2021, 11:21 AM

Post #81 of 105 (1895 views)

Permalink

On Mon, Feb 22, 2021 at 2:05 PM Warren Kumari <warren@kumari.net> wrote:

>
>
> On Mon, Feb 22, 2021 at 12:50 PM Regis M. Donovan <regis-nanog@offhand.org>
> wrote:
>
>> On Thu, Feb 18, 2021 at 07:34:39PM -0500, Patrick W. Gilmore wrote:
>> > And to put it on topic, cover your EPOs
>>
>> I worked somewhere with an uncovered EPO, which was okay until we had a
>> telco tech in who was used to a different data center where a similar
>> looking button controlled the door access, so he reflexively hit it
>> on his way out to unlock the door. Oops.
>>
>> Also, consider what's on generator and what's not. I worked in a
>> corporate
>> data center where we lost power. The backup system kept all the machines
>> running, but the ventilation system was still down, so it was very warm
>> very
>> fast as everyone went around trying to shut servers down gracefully while
>> other folks propped the doors open to get some cooler air in.
>>
>
> That reminds me of another one...
>
> In parts of NYC, there are noise abatement requirements, and so many
> places have their generators mounted on the roof -- it's cheap real-estate,
> the exhaust is easier, the noise issues are less, etc.
>
> The generators usually have a smallish diesel tank, and then a much larger
> one in the basement (diesel is heavy)...
>
> So, one of the buildings that I was in was really good about testing thier
> gensets - they'd do weekly tests (usually at night), and the generators
> always worked perfectly -- right up until the time that it was actually
> needed.
> The generator fired up, the lights kept blinking, the disks kept spinning
> - but the transfer pump that pumped diesel from the basement to the roof
> was one of the few things that was not on the generator....
>
>
When we were looking at one of the big carrier hotels in NYC they said
that they had the same issue (could be it was the same one). The elevators
were also out as well. They resorted to having techs climb up an down 9
flights of stairs all day long with 5 gallon buckets of diesel and throwing
it into the generator.

>

RE: Famous operational issues [ In reply to ]

tony at wicks

Feb 22, 2021, 12:42 PM

Post #82 of 105 (1895 views)

Permalink

Many years ago I experienced a very similar thing. The DC/Integrator I worked for outsourced the co-location and operation of mainframe services for several banks and government organisations. One of these banks had a significant investment in AS/400's and they decided that it was so much hassle and expense using our datacentres that they would start putting those nice small AS/400's in computer rooms in their office buildings instead. One particular computer room contained large line printers that the developers would use to print out whatever it is such people print out. One Saturday morning I received a frantic call from the customer to say that all their primary production as/400's had gone offline. After a short investigation I realised that all the offline devices wire in this particular computer room. It turn's out that one of the developers had bought his six year old son to work that Saturday and upon retrieval of a printout said son had dutifully followed dad in to the computer room and was unable to resist the big red button sitting exposed on the wall by the door. Shortly thereafter the embarrassed customer decided that perhaps it was worth relocating their as/400's to our expensive datacentres.

>
> During my younger days, that button was used a few time by the
> operator of a VM/370 to regain control from someone with a "curious
> mind" *cought* *cought*...
>
Two horror stories I remember from long ago when I was a console jockey for a federal space agency that will remain nameless :P

1. A coworker brought her daughter to work with her on a Saturday overtime shift because she couldn't get a babysitter. She parked the kid with a coloring book and a pile of crayons at the only table in the console room with some space, right next to the master console for our 3081. I asked her to make sure sh was well away from the console, and as she reached over to scoot the girl and her coloring books further away she slipped, and reached out to steady herself. Yep, planted her finger right down on the IML button (plexi covers? We don' need no STEENKIN'
plexi covers!). MVS and VM vanished, two dozen tape drives rewound and several hours' worth of data merge jobs went blooey.

Re: Famous operational issues [ In reply to ]

patrick at ianai

Feb 22, 2021, 2:21 PM

Post #83 of 105 (1895 views)

Permalink

On Feb 22, 2021, at 7:02 AM, tim@pelican.org wrote:
> On Thursday, 18 February, 2021 22:37, "Warren Kumari" <warren@kumari.net> said:
>
>> 4: Not too long after I started doing networking (and for the same small
>> ISP in Yonkers), I'm flying off to install a new customer. I (of course)
>> think that I'm hot stuff because I'm going to do the install, configure the
>> router, whee, look at me! Anyway, I don't want to check a bag, and so I
>> stuff the Cisco 2501 in a carryon bag, along with tools, etc (this was all
>> pre-9/11!). I'm going through security and the TSA[0] person opens my bag
>> and pulls the router out. "What's this?!" he asks. I politely tell him that
>> it's a router. He says it's not. I'm still thinking that I'm the new
>> hotness, and so I tell him in a somewhat condescending way that it is, and
>> I know what I'm talking about. He tells me that it's not a router, and is
>> starting to get annoyed. I explain using my "talking to a 5 year old" voice
>> that it most certainly is a router. He tells me that lying to airport
>> security is a federal offense, and starts looming at me. I adjust my
>> attitude and start explaining that it's like a computer and makes the
>> Internet work. He gruffly hands me back the router, I put it in my bag and
>> scurry away. As I do so, I hear him telling his colleague that it wasn't a
>> router, and that he certainly knows what a router is, because he does
>> woodwork...
>
> Here in the UK we avoid that issue by pronouncing the packet-shifter as "rooter", and only the wood-working tool as "rowter" :)

So wrong.

A “root” server is part of the DNS. A “route” server is part of BGP.

> Of course, it raises a different set of problems when talking to the Australians…

Everything is weird down down. But I still like them. :-)

--
TTFN,
patrick

Re: Famous operational issues [ In reply to ]

bzs at theworld

Feb 22, 2021, 4:27 PM

Post #84 of 105 (1895 views)

Permalink

At Boston Univ we discovered the hard way that a security guard's
walkie-talkie could cause a $5,000 (or $10K for the big machine room)
Halon dump.

Took a couple of times before we figured out the connection tho once
someone made it to the hold button before it actually dumped.

Speaking of halon one very hot day I'm goofing off drinking coffee at
a nearby sub shop when the owner tells me someone from the computing
center was on the phone, that never happened before.

Some poor operator was holding the halon shot, it's a deadman's switch
(well, button) and the building was doing its 110db thing could I come
help? The building is being evac'd.

So my boss who wasn't the sharpest knife in the drawer follows me down
as I enter and I'm sweating like a pig with a floor panel sucker
trying to figure out which zone tripped.

And he shouts at me over the alarms: WHY TF DOES IT DO THIS?! Angrily.

I answered: well, maybe THERE'S A FIRE!!!

At which point I notice the back of my shoulder is really bothering
me, which I say to him, and he says hmmm there's a big bee on your
back maybe he's stinging you?

Fun day.

--
-Barry Shein

Software Tool & Die | bzs@TheWorld.com | http://www.TheWorld.com
Purveyors to the Trade | Voice: +1 617-STD-WRLD | 800-THE-WRLD
The World: Since 1989 | A Public Information Utility | *oo*

Re: Famous operational issues [ In reply to ]

sronan at ronan-online

Feb 22, 2021, 4:58 PM

Post #85 of 105 (1895 views)

Permalink

Let me tell you about my personal favorite.

It’s 2002 and I am working as an engineer for an electronic stock trading platform (ECN), this platform happened to be the biggest platform for trading stocks electronically, on some days bigger than NASDAQ itself. This platform also happened to be run on DOS, FoxPro and a Novell file share, on a cluster of roughly 1,000 computers, two of which were the “engine” that matched all of the trades.

Well FoxPro has this “feature” where the ESC key halts the running program. We had the ability to remote control these DOS/FoxPro machines via some program we had written. Someone asked me to check the status of the process running on the primary matching engine, and when I was done, out of habit, I hit ESC. Trade processing grinds to a halt (phone calls have to be made to the SEC). I immediately called the NOC and told them it was me. Next thing I know, someone from the NOC is at my desk with a screwdriver putting the ESC key from my keyboard. I remained ESC keyless for the next several years until I left the company. I was hazed pretty good over it, but was essentially given a one time pass.

> On Feb 22, 2021, at 7:30 PM, bzs@theworld.com wrote:
>
> ?
> At Boston Univ we discovered the hard way that a security guard's
> walkie-talkie could cause a $5,000 (or $10K for the big machine room)
> Halon dump.
>
> Took a couple of times before we figured out the connection tho once
> someone made it to the hold button before it actually dumped.
>
> Speaking of halon one very hot day I'm goofing off drinking coffee at
> a nearby sub shop when the owner tells me someone from the computing
> center was on the phone, that never happened before.
>
> Some poor operator was holding the halon shot, it's a deadman's switch
> (well, button) and the building was doing its 110db thing could I come
> help? The building is being evac'd.
>
> So my boss who wasn't the sharpest knife in the drawer follows me down
> as I enter and I'm sweating like a pig with a floor panel sucker
> trying to figure out which zone tripped.
>
> And he shouts at me over the alarms: WHY TF DOES IT DO THIS?! Angrily.
>
> I answered: well, maybe THERE'S A FIRE!!!
>
> At which point I notice the back of my shoulder is really bothering
> me, which I say to him, and he says hmmm there's a big bee on your
> back maybe he's stinging you?
>
> Fun day.
>
> --
> -Barry Shein
>
> Software Tool & Die | bzs@TheWorld.com | http://www.TheWorld.com
> Purveyors to the Trade | Voice: +1 617-STD-WRLD | 800-THE-WRLD
> The World: Since 1989 | A Public Information Utility | *oo*

Re: Famous operational issues [ In reply to ]

warren at kumari

Feb 23, 2021, 11:06 AM

Post #86 of 105 (1895 views)

Permalink

On Mon, Feb 22, 2021 at 7:31 PM <bzs@theworld.com> wrote:

>
> At Boston Univ we discovered the hard way that a security guard's
> walkie-talkie could cause a $5,000 (or $10K for the big machine room)
> Halon dump.
>

At one of the AOL datacenters there was some convoluted fire marshal reason
why a specific door could not be locked "during business hours" (?!), and
so there was a guard permanently stationed outside. The door was all the
way around the back of the building, and so basically never used - and so
the guard would fall asleep outside it with a piece of cardboard saying
"Please wake me before entering". He was a nice guy (and it was less faff
than the main entrance), and so we'd either sneak in and just not tell
anyone, or talk loudly while going round the corner so he could pretend to
have been awake the whole time...

W

>
> Took a couple of times before we figured out the connection tho once
> someone made it to the hold button before it actually dumped.
>
> Speaking of halon one very hot day I'm goofing off drinking coffee at
> a nearby sub shop when the owner tells me someone from the computing
> center was on the phone, that never happened before.
>
> Some poor operator was holding the halon shot, it's a deadman's switch
> (well, button) and the building was doing its 110db thing could I come
> help? The building is being evac'd.
>
> So my boss who wasn't the sharpest knife in the drawer follows me down
> as I enter and I'm sweating like a pig with a floor panel sucker
> trying to figure out which zone tripped.
>
> And he shouts at me over the alarms: WHY TF DOES IT DO THIS?! Angrily.
>
> I answered: well, maybe THERE'S A FIRE!!!
>
> At which point I notice the back of my shoulder is really bothering
> me, which I say to him, and he says hmmm there's a big bee on your
> back maybe he's stinging you?
>
> Fun day.
>
> --
> -Barry Shein
>
> Software Tool & Die | bzs@TheWorld.com |
> http://www.TheWorld.com
> Purveyors to the Trade | Voice: +1 617-STD-WRLD | 800-THE-WRLD
> The World: Since 1989 | A Public Information Utility | *oo*
>

--
The computing scientist’s main challenge is not to get confused by the
complexities of his own making.
-- E. W. Dijkstra

Re: Famous operational issues [ In reply to ]

streinerj at gmail

Feb 23, 2021, 1:40 PM

Post #87 of 105 (1895 views)

Permalink

On Thu, Feb 18, 2021 at 5:38 PM Warren Kumari <warren@kumari.net> wrote:

>
> 2: A somewhat similar thing would happen with the Ascend TNT Max, which
> had side-to-side airflow. These were dial termination boxes, and so people
> would install racks and racks of them. The first one would draw in cool air
> on the left, heat it up and ship it out the right. The next one over would
> draw in warm air on the left, heat it up further, and ship it out the
> right... Somewhere there is a fairly famous photo of a rack of TNT Maxes,
> with the final one literally on fire, and still passing packets.
>

We had several racks of TNTs at the peak of our dial POP phase, and I
believe we ended up designing baffles for the sides of those racks to pull
in cool air from the front of the rack to the left side of the chassis and
exhaust it out the back from the right side. It wasn't perfect, but it did
the job.

The TNTs with channelized T3 interfaces were a great way to terminate lots
of modems in a reasonable amount of rack space with minimal cabling.

Thank you
jms

Re: Famous operational issues [ In reply to ]

streinerj at gmail

Feb 23, 2021, 2:11 PM

Post #88 of 105 (1895 views)

Permalink

Beyond the widespread outages, I have so many personal war stories that
it's hard to pick a favorite.

My first job out of college in the mid-late 90s was at an ISP in Pittsburgh
that I joined pretty early in its existence, and everyone did a bit of
everything. I was hired to do sysadmin stuff, networking, pretty much
whatever was needed. About a year after I started, we brought up a new mail
system with an external RAID enclosure for the mail store itself. One day,
we saw indications that one of the disks in the RAID enclosure was starting
to fail, so I scheduled a maintenance window to replace the disk and let
the controller rebuild the data and integrate it back into the RAID set.
No big worries, right?

It's Tuesday at about 2 AM.

Well, the kernel on the RAID controller itself decided that when I pulled
the failing drive would be a fine time to panic, and more or less turn
itself into a bit-blender, and take all the mailstore down with it. After
a few hours of watching fsck make no progress on anything, in terms of
trying to un-fsck the mailstore, we made the decision in consultation with
the CEO to pull the plug on trying to bring the old RAID enclosure back to
life, and focus on finding suitable replacement hardware and rebuild from
scratch. We also discovered that the most recent backups of the mailstore
were over a month old :(

I think our CEO ended up driving several hours to procure a suitable
enclosure. By the time we got the enclosure installed, filesystems built,
and got whatever tape backups we had restored, and tested the integrity of
the system, it was now Thursday around 8 AM. Coincidentally, that was the
same day the company hosted a big VIP gathering (the mayor was there, along
with lots of investors and other bigwigs), so I had to come back and put on
a suit to hobnob with the VIPs after getting a total of 6 hours of sleep in
about the previous 3 days. I still don't know how I got home that night
without wrapping my vehicle around a utility pole (due to being over-tired,
not due to alcohol).

Many painful lessons learned over that stretch of days, as often the case
as a company grows from startup mode and builds more robust technology and
business processes as a consequence of growth.

jms

On Tue, Feb 16, 2021 at 2:37 PM John Kristoff <jtk@dataplane.org> wrote:

> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
>
> To get things started, I'd suggest the AS 7007 event is perhaps the
> most notorious and likely to top many lists including mine. So if
> that is one for you I'm asking for just two more.
>
> I'm particularly interested in this as the first step in developing a
> future NANOG session. I'd be particularly interested in any issues
> that also identify key individuals that might still be around and
> interested in participating in a retrospective. I already have someone
> that is willing to talk about AS 7007, which shouldn't be hard to guess
> who.
>
> Thanks in advance for your suggestions,
>
> John
>

Re: Famous operational issues [ In reply to ]

streinerj at gmail

Feb 23, 2021, 2:22 PM

Post #89 of 105 (1894 views)

Permalink

An interesting sub-thread to this could be:

Have you ever unintentionally crashed a device by running a perfectly
innocuous command?
1. Crashed a 6500/Sup2 by typing "show ip dhcp binding".
2. "clear interface XXX" on a Nexus 7K triggered a cascading/undocument
Sev1 bug that caused two linecards to crash and reload, and take down about
two dozen buildings on campus at the .edu where I used to work.
3. For those that ever had the misfortune of using early versions of the
"bcc" command shell* on Bay Networks routers, which was intended to make
the CLI make look and feel more like a Cisco router, you have my
condolences. One would reasonably expect "delete ?" to respond with a list
of valid arguments for that command. Instead, it deleted, well...
everything, and prompted an on-site restore/reboot.

BCC originally stood for "Bay Command Console", but we joked that it really
stood for "Blatant Cisco Clone".

On Tue, Feb 16, 2021 at 2:37 PM John Kristoff <jtk@dataplane.org> wrote:

> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
>
> To get things started, I'd suggest the AS 7007 event is perhaps the
> most notorious and likely to top many lists including mine. So if
> that is one for you I'm asking for just two more.
>
> I'm particularly interested in this as the first step in developing a
> future NANOG session. I'd be particularly interested in any issues
> that also identify key individuals that might still be around and
> interested in participating in a retrospective. I already have someone
> that is willing to talk about AS 7007, which shouldn't be hard to guess
> who.
>
> Thanks in advance for your suggestions,
>
> John
>

Re: Famous operational issues [ In reply to ]

nanog at nanog

Feb 23, 2021, 2:37 PM

Post #90 of 105 (1894 views)

Permalink

That brings back memories....I had a similar experience. First month on the job, large Sun raid array storing ~ 5k of mailboxes dies in the middle of the afternoon. So, I start troubleshooting and determine it's most likely a bad disk. The CEO walked into the server room right about the time I had 20 disks laid out on a table. He had a fit and called the desktop support guy to come and 'show me how to fix a pc'.

Never mind the fact that we had a 90% ready to go replacement box sitting at another site, and just needed to either go get it, or bring the disks to it..... So we sat there until the desktop who was 30 minutes away guy got there. He took one look at it and said 'never touched that thing before, looks like he knows what he's doing' and pointed to me. 4 hours later we were driving the new server to the data center strapped down in the back of a pickup. Fun times.

-----Original Message-----
From: "Justin Streiner" <streinerj@gmail.com>
Sent: Tuesday, February 23, 2021 5:11pm
To: "John Kristoff" <jtk@dataplane.org>
Cc: "NANOG" <nanog@nanog.org>
Subject: Re: Famous operational issues

Beyond the widespread outages, I have so many personal war stories that it's hard to pick a favorite.
My first job out of college in the mid-late 90s was at an ISP in Pittsburgh that I joined pretty early in its existence, and everyone did a bit of everything. I was hired to do sysadmin stuff, networking, pretty much whatever was needed. About a year after I started, we brought up a new mail system with an external RAID enclosure for the mail store itself. One day, we saw indications that one of the disks in the RAID enclosure was starting to fail, so I scheduled a maintenance window to replace the disk and let the controller rebuild the data and integrate it back into the RAID set. No big worries, right?
It's Tuesday at about 2 AM.
Well, the kernel on the RAID controller itself decided that when I pulled the failing drive would be a fine time to panic, and more or less turn itself into a bit-blender, and take all the mailstore down with it. After a few hours of watching fsck make no progress on anything, in terms of trying to un-fsck the mailstore, we made the decision in consultation with the CEO to pull the plug on trying to bring the old RAID enclosure back to life, and focus on finding suitable replacement hardware and rebuild from scratch. We also discovered that the most recent backups of the mailstore were over a month old :(
I think our CEO ended up driving several hours to procure a suitable enclosure. By the time we got the enclosure installed, filesystems built, and got whatever tape backups we had restored, and tested the integrity of the system, it was now Thursday around 8 AM. Coincidentally, that was the same day the company hosted a big VIP gathering (the mayor was there, along with lots of investors and other bigwigs), so I had to come back and put on a suit to hobnob with the VIPs after getting a total of 6 hours of sleep in about the previous 3 days. I still don't know how I got home that night without wrapping my vehicle around a utility pole (due to being over-tired, not due to alcohol).
Many painful lessons learned over that stretch of days, as often the case as a company grows from startup mode and builds more robust technology and business processes as a consequence of growth.
jms

On Tue, Feb 16, 2021 at 2:37 PM John Kristoff <[ jtk@dataplane.org ]( mailto:jtk@dataplane.org )> wrote:Friends,

I'd like to start a thread about the most famous and widespread Internet
operational issues, outages or implementation incompatibilities you
have seen.

Which examples would make up your top three?

To get things started, I'd suggest the AS 7007 event is perhaps the
most notorious and likely to top many lists including mine. So if
that is one for you I'm asking for just two more.

I'm particularly interested in this as the first step in developing a
future NANOG session. I'd be particularly interested in any issues
that also identify key individuals that might still be around and
interested in participating in a retrospective. I already have someone
that is willing to talk about AS 7007, which shouldn't be hard to guess
who.

Thanks in advance for your suggestions,

John

Re: Famous operational issues [ In reply to ]

eric.kuhnke at gmail

Feb 23, 2021, 3:17 PM

Post #91 of 105 (1894 views)

Permalink

I would be more interested in seeing someone who HASN'T crashed a Cisco
6500/7600, particularly one with a long uptime, by typing in a supposedly
harmless 'show' command.

On Tue, Feb 23, 2021 at 2:26 PM Justin Streiner <streinerj@gmail.com> wrote:

> An interesting sub-thread to this could be:
>
> Have you ever unintentionally crashed a device by running a perfectly
> innocuous command?
> 1. Crashed a 6500/Sup2 by typing "show ip dhcp binding".
> 2. "clear interface XXX" on a Nexus 7K triggered a cascading/undocument
> Sev1 bug that caused two linecards to crash and reload, and take down about
> two dozen buildings on campus at the .edu where I used to work.
> 3. For those that ever had the misfortune of using early versions of the
> "bcc" command shell* on Bay Networks routers, which was intended to make
> the CLI make look and feel more like a Cisco router, you have my
> condolences. One would reasonably expect "delete ?" to respond with a list
> of valid arguments for that command. Instead, it deleted, well...
> everything, and prompted an on-site restore/reboot.
>
> BCC originally stood for "Bay Command Console", but we joked that it
> really stood for "Blatant Cisco Clone".
>
> On Tue, Feb 16, 2021 at 2:37 PM John Kristoff <jtk@dataplane.org> wrote:
>
>> Friends,
>>
>> I'd like to start a thread about the most famous and widespread Internet
>> operational issues, outages or implementation incompatibilities you
>> have seen.
>>
>> Which examples would make up your top three?
>>
>> To get things started, I'd suggest the AS 7007 event is perhaps the
>> most notorious and likely to top many lists including mine. So if
>> that is one for you I'm asking for just two more.
>>
>> I'm particularly interested in this as the first step in developing a
>> future NANOG session. I'd be particularly interested in any issues
>> that also identify key individuals that might still be around and
>> interested in participating in a retrospective. I already have someone
>> that is willing to talk about AS 7007, which shouldn't be hard to guess
>> who.
>>
>> Thanks in advance for your suggestions,
>>
>> John
>>
>

Re: Famous operational issues [ In reply to ]

warren at kumari

Feb 23, 2021, 3:56 PM

Post #92 of 105 (1894 views)

Permalink

On Tue, Feb 23, 2021 at 5:14 PM Justin Streiner <streinerj@gmail.com> wrote:

> Beyond the widespread outages, I have so many personal war stories that
> it's hard to pick a favorite.
>
> My first job out of college in the mid-late 90s was at an ISP in
> Pittsburgh that I joined pretty early in its existence, and everyone did a
> bit of everything. I was hired to do sysadmin stuff, networking, pretty
> much whatever was needed. About a year after I started, we brought up a new
> mail system with an external RAID enclosure for the mail store itself. One
> day, we saw indications that one of the disks in the RAID enclosure was
> starting to fail, so I scheduled a maintenance window to replace the disk
> and let the controller rebuild the data and integrate it back into the RAID
> set. No big worries, right?
>
> It's Tuesday at about 2 AM.
>
> Well, the kernel on the RAID controller itself decided that when I pulled
> the failing drive would be a fine time to panic, and more or less turn
> itself into a bit-blender, and take all the mailstore down with it. After
> a few hours of watching fsck make no progress on anything, in terms of
> trying to un-fsck the mailstore, we made the decision in consultation with
> the CEO to pull the plug on trying to bring the old RAID enclosure back to
> life, and focus on finding suitable replacement hardware and rebuild from
> scratch. We also discovered that the most recent backups of the mailstore
> were over a month old :(
>
> I think our CEO ended up driving several hours to procure a suitable
> enclosure. By the time we got the enclosure installed, filesystems built,
> and got whatever tape backups we had restored, and tested the integrity of
> the system, it was now Thursday around 8 AM. Coincidentally, that was the
> same day the company hosted a big VIP gathering (the mayor was there, along
> with lots of investors and other bigwigs), so I had to come back and put on
> a suit to hobnob with the VIPs after getting a total of 6 hours of sleep in
> about the previous 3 days. I still don't know how I got home that night
> without wrapping my vehicle around a utility pole (due to being over-tired,
> not due to alcohol).
>
> Many painful lessons learned over that stretch of days, as often the case
> as a company grows from startup mode and builds more robust technology and
> business processes as a consequence of growth.
>

Oh, dear. RAID.... that triggered 2 stories.
1: I worked at a small ISP in Westchester, NY. One day I'm doing stuff, and
want to kill process 1742, so I type 'kill -9 1' ... and then, before
pressing enter, I get distracted by our "Cisco AGS+ monitor" (a separate
story). After I get back to my desk I unlock my terminal, and call over a
friend to show just how close I'd gotten to making something go Boom. He
says "Nah, BSD is cleverer than that. I'm sure the kill command has some
check in to stop you killing init.". I disagree. He disagrees. I disagree
again. He calls me stupid. I bet him a soda.
He proves his point by typing 'su; kill -9 1' in the window he's logged
into -- and our primary NFS server (with all of the user sites)
obediently kills off init, and all of the child processes.... we run over
to the front of the box and hit the power switch, while desperately looking
for a monitor and keyboard to watch it boot.
It does the BIOS checks, and then stops on the RAID controller, complaining
about the fact that there are *2* dead drives, and that the array is now
sad.....
This makes no sense. I can understand one drive not recovering from a power
outage, but 2 seems a bit unlikely, especially because the machine hadn't
been beeping or anything like that.... we try turning it off and on again a
few times, no change... We pull the machine out of the rack and rip the
cover off.
Sure enough, there is a RAID card - but the piezo-buzzer on it is, for some
reason, wrapped in a bunch of napkins, held in place with electrical tape.
I pull that off, and there is also some paper towel jammed into the hole
in the buzzer, and bits of a broken pencil....

After replacing the drives, starting an rsync restore from a backup server
we investigate more....
...
it turns out that a few months ago(!) the machine had started beeping. The
night crew naturally found this annoying, and so they'd gone investigating
and discovered that it was this machine, and lifted the lid while still in
the rack. They traced the annoying noise to this small black thingie, and
made poked it until it stopped, thus solving the problem once and for
all.... yay!

2: I used to work at a company which was in one of the buildings next to
the twin-towers. For various clever reasons, they had their "datacenter" in
a corner of the office space... anyway, the planes hit, power goes out and
the building is evacuated - luckily no one is injured, but the entire
company/site is down. After a few weeks, my friend Joe is able to arrange
with a fire marshal to get access to the building so he can go and grab the
disks with all the data. The fire marshal and Joe trudge up the 15 flights
of stairs.... When they reach the suite, Joe discovers that the windows
where his desk was are blown in, there is debris everywhere, etc. He's
somewhat shaken by all this, but goes over to the datacenter area, pulls
the drives out of the Sun storage arrays, and puts them in his backpack.
They then trudge down the 15 flights of stairs, and Joe takes them home.
We've managed to scrounge up 3 identical (empty) arrays, and some servers,
and the plan is to temporarily run the service from his basement...

Anyway, I get a panic'ed call from Joe. He's got the empty RAID arrays.
He's got the servers. He's got a pile of 42 drives (3 enclosures, 14 drives
per enclosure). Unfortunately he completely didn't think to mark the order
of the drives, and now we have *no* idea which drives goes in which array,
nor in which slot in the array....

We spent some time trying to figure out how many ways you can arrange 42
things into 3 piles, and how long it would take to try all combinations....
I cannot remember the actual number, but it approached the lifetime of the
universe....
After much time and poking, we eventually worked out that the RAID
controller wrote a slot number at sector 0 on each physical drive, and it
became a solvable problem, but...

W

> jms
>
> On Tue, Feb 16, 2021 at 2:37 PM John Kristoff <jtk@dataplane.org> wrote:
>
>> Friends,
>>
>> I'd like to start a thread about the most famous and widespread Internet
>> operational issues, outages or implementation incompatibilities you
>> have seen.
>>
>> Which examples would make up your top three?
>>
>> To get things started, I'd suggest the AS 7007 event is perhaps the
>> most notorious and likely to top many lists including mine. So if
>> that is one for you I'm asking for just two more.
>>
>> I'm particularly interested in this as the first step in developing a
>> future NANOG session. I'd be particularly interested in any issues
>> that also identify key individuals that might still be around and
>> interested in participating in a retrospective. I already have someone
>> that is willing to talk about AS 7007, which shouldn't be hard to guess
>> who.
>>
>> Thanks in advance for your suggestions,
>>
>> John
>>
>

--
The computing scientist’s main challenge is not to get confused by the
complexities of his own making.
-- E. W. Dijkstra

Re: Famous operational issues [ In reply to ]

surfer at mauigateway

Feb 23, 2021, 3:58 PM

Post #93 of 105 (1894 views)

Permalink

On 2/23/2021 12:22 PM, Justin Streiner wrote:
> An interesting sub-thread to this could be:
> Have you ever unintentionally crashed a device by running a perfectly
> innocuous command?
> ---------------------------------------------------------------

There was that time in the later 1990s where I took most of a global
network down several
times by typing "show ip bgp regexp <regex here>" on most all of the
core routers. It turned
out to be a cisco bug. I looked for a reference, but cannot find one.
Ahh, the earlier days of
the commercial internet...gotta love'em.

scott

Re: Famous operational issues [ In reply to ]

bzs at theworld

Feb 23, 2021, 4:08 PM

Post #94 of 105 (1894 views)

Permalink

Anyone remember when DEC delivered a new VMS version (V5 I think)
whose backups didn't work, couldn't be restored?

BU did, the hard way, when the engineering dept's faculty and student
disk failed.

DEC actually paid thousands of dollars for typist services to come and
re-enter whatever was on paper and could be re-entered.

I think that was the day I won the Unix vs VMS wars at BU anyhow.

--
-Barry Shein

Software Tool & Die | bzs@TheWorld.com | http://www.TheWorld.com
Purveyors to the Trade | Voice: +1 617-STD-WRLD | 800-THE-WRLD
The World: Since 1989 | A Public Information Utility | *oo*

Re: Famous operational issues [ In reply to ]

nanog at nanog

Feb 23, 2021, 5:52 PM

Post #95 of 105 (1894 views)

Permalink

My war story.

At one of our major POPs in DC we had a row of 7513's, and one of them
had intermittent problems. I had replaced every piece of removable
card/part in it over time, and it kept failing. Even the vendor flew in
a team to the site to try to figure out what was wrong. It was finally
decided to replace the whole router (about 200lbs?). Being the local
field tech, that was my Job. On the night of the maintenance at 3am, the
work started. I switched off the rack power, which included a 2511
terminal server that was connected to half the routers in the row and
started to remove the router. A few minutes later I got a text, "You're
taking out the wrong router!" You can imagine the "Damn it, what have I
done?" feeling that runs through your mind and the way your heart stops
for a moment.

Okay, I wasn't taking out the wrong router. But unknown at the time,
terminal servers when turned off, had a nasty habit of sending a break
to all the routers it was connected to, and all those routers
effectively stopped. The remote engineer that was in charge saw the
whole POP go red and assumed I was the cause. I was, but not because of
anything I could have known about. I had to power cycle the downed
routers to bring them back on-line, and then continue with the
maintenance. A disaster to all involved, but the router got replaced.

I gave a very detailed account of my actions in the postmortem. It was
clear they knew I had turned off the wrong rack/router, and wasn't being
honest about it. I was adamant I had done exactly what I said, and even
swore I would fess up if I had error-ed, and always would, even if it
cost me the job. I rarely made mistakes, if any, so it was an easy thing
for me to say. For the next two weeks everyone that aware of the work
gave me the side eye.

About a week after that, the same thing happened to another field tech
in another state. That helped my case. They used my account to figure
out it was the TS that caused the problem. A few of them that had
questioned me harshly admitted to me my account helped them figure out
the cause.

And the worst part of this story? That router, completely replaced,
still had the same intermittent problem as before. It was a DC powered
POP, so they were all wired with the same clean DC power. In the end
they chalked it up to cosmic rays and gave up on it. I believe this
break issue was unique to the DC powered 2511's, and that we were the
first to use them, but I might be wrong on that.

On 2/16/21 2:37 PM, John Kristoff wrote:
> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
>
> To get things started, I'd suggest the AS 7007 event is perhaps the
> most notorious and likely to top many lists including mine. So if
> that is one for you I'm asking for just two more.
>
> I'm particularly interested in this as the first step in developing a
> future NANOG session. I'd be particularly interested in any issues
> that also identify key individuals that might still be around and
> interested in participating in a retrospective. I already have someone
> that is willing to talk about AS 7007, which shouldn't be hard to guess
> who.
>
> Thanks in advance for your suggestions,
>
> John

Re: Famous operational issues [ In reply to ]

nanog at nanog

Feb 23, 2021, 8:24 PM

Post #96 of 105 (1894 views)

Permalink

While we're talking about raid types...

A few acquisitions ago, between 2006-2010, I worked at a Wireless ISP in
Northern Indiana. Our CEO decided to sell Internet service to school
systems because the e-rate funding was too much to resist. He had the idea
to install towers on the schools and sell service off that while paying the
school for roof rights. About two years into the endeavor, I wake up one
morning and walk to my car. Two FBI agents get out of an unmarked towncar.
About an hour later, they let me go to the office where I found an entire
barrage of FBI agents. It was a full raid and not the kind you want to see.
Hard drives were involved and being made redundant, but the redundant
copies were labeled and placed into boxes that were carried out to SUVs
that were as dark as the morning coffee these guys drank. There were a lot
of drives, all of our servers were in our server room at the office. There
were roughly five or six racks of varying amounts of equipment in each.

After some questioning and assisting them in their cataloging adventure,
the agents left us with a ton of questions and just enough equipment to
keep the customers connected. CEO became extremely paranoid at this point.
He told us to prepare to move servers to a different building. He went into
a tailspin trying to figure out where he could hide the servers to keep
things going without the bank or FBI seizing the assets. He was extremely
worried the bank would close the office down. We started moving all network
routing around to avoid using the office as our primary DIA.

One morning I get into the office and we hear the words we've been
dreading: "We're moving the servers". The plan was to move them to a tower
site that had a decent-sized shack on site. Connectivity was decent, we had
a licensed 11GHz microwave backhaul capable of about 155mbps. The site was
part of the old MCI microwave long-distance network in the 80s and 90s. It
had redundant air conditioners, a large propane tank, and a generator
capable of keeping the site alive for about three days. We were told not to
notify any customers, which became problematic because two customers had
servers colocated in our building. We consolidated the servers into three
racks and managed to get things prepared with a decent UPS in each rack.
CEO decided to move the servers at nightfall to "avoid suspicion". Our
office was in an unsavory part of town, moving anything at night was
suspicious. So, under the cover of half-ass darkness, we loaded the racks
onto a flatbed truck and drove them 20 minutes to the tower. While we
unloaded the racks, an electrician we knew was wiring up the L5-20 outlets
for the UPS in each rack. We got the racks plugged in, servers powered up,
and then the two customers came that had colocated equipment. They got
their equipment powered up and all seemed ok.

Back at the office the next day we were told to gather our workstations and
start working from home. I've been working from home ever since and quite
enjoy it, but that's beside the point.

Summer starts and I tell the CEO we need to repair the AC units because
they are failing. He ignores it, claiming he doesn't want to lose money the
bank could take at any minute. About a month later, a nice hot summer day
rolls in and the AC units both die. I stumble upon an old portable AC unit
and put that at the site. Temperatures rise to 140F ambient. Server
overheat alarms start going off, things start failing. Our colocation
customers are extremely upset. They pull their servers and drop service.
The heat subsides, CEO finally pays to repair one of the AC units.

Eventually, the company declares bankruptcy and goes into liquidation.
Luckily another WISP catches wind of it, buys the customers and assets, and
hires me. My happiest day that year was moving all the servers into a
better-suited home, a real data center. I don't know what happened to the
CEO, but I know that I'll never trust anything he has his hands in ever
again.

Adam Kennedy
Systems Engineer
adamkennedy@watchcomm.net | 800-589-3837 x120 <800-589-3837;120>
Watch Communications | www.watchcomm.net
<https://www.watchcomm.net?utm_source=signature&utm_medium=email&utm_campaign=general_signature>
3225 W Elm St, Suite A
Lima, OH 45805
<https://twitter.com/watchcommnet>
<https://www.facebook.com/watchcommunications>
<http://www.linkedin.com/company/watch-communications>

On Tue, Feb 23, 2021 at 8:55 PM brutal8z via NANOG <nanog@nanog.org> wrote:

> My war story.
>
> At one of our major POPs in DC we had a row of 7513's, and one of them had
> intermittent problems. I had replaced every piece of removable card/part in
> it over time, and it kept failing. Even the vendor flew in a team to the
> site to try to figure out what was wrong. It was finally decided to replace
> the whole router (about 200lbs?). Being the local field tech, that was my
> Job. On the night of the maintenance at 3am, the work started. I switched
> off the rack power, which included a 2511 terminal server that was
> connected to half the routers in the row and started to remove the router.
> A few minutes later I got a text, "You're taking out the wrong router!" You
> can imagine the "Damn it, what have I done?" feeling that runs through your
> mind and the way your heart stops for a moment.
>
> Okay, I wasn't taking out the wrong router. But unknown at the time,
> terminal servers when turned off, had a nasty habit of sending a break to
> all the routers it was connected to, and all those routers effectively
> stopped. The remote engineer that was in charge saw the whole POP go red
> and assumed I was the cause. I was, but not because of anything I could
> have known about. I had to power cycle the downed routers to bring them
> back on-line, and then continue with the maintenance. A disaster to all
> involved, but the router got replaced.
>
> I gave a very detailed account of my actions in the postmortem. It was
> clear they knew I had turned off the wrong rack/router, and wasn't being
> honest about it. I was adamant I had done exactly what I said, and even
> swore I would fess up if I had error-ed, and always would, even if it cost
> me the job. I rarely made mistakes, if any, so it was an easy thing for me
> to say. For the next two weeks everyone that aware of the work gave me the
> side eye.
>
> About a week after that, the same thing happened to another field tech in
> another state. That helped my case. They used my account to figure out it
> was the TS that caused the problem. A few of them that had questioned me
> harshly admitted to me my account helped them figure out the cause.
>
> And the worst part of this story? That router, completely replaced, still
> had the same intermittent problem as before. It was a DC powered POP, so
> they were all wired with the same clean DC power. In the end they chalked
> it up to cosmic rays and gave up on it. I believe this break issue was
> unique to the DC powered 2511's, and that we were the first to use them,
> but I might be wrong on that.
>
>
> On 2/16/21 2:37 PM, John Kristoff wrote:
>
> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
>
> To get things started, I'd suggest the AS 7007 event is perhaps the
> most notorious and likely to top many lists including mine. So if
> that is one for you I'm asking for just two more.
>
> I'm particularly interested in this as the first step in developing a
> future NANOG session. I'd be particularly interested in any issues
> that also identify key individuals that might still be around and
> interested in participating in a retrospective. I already have someone
> that is willing to talk about AS 7007, which shouldn't be hard to guess
> who.
>
> Thanks in advance for your suggestions,
>
> John
>
>
>

Re: Famous operational issues [ In reply to ]

randy at psg

Feb 23, 2021, 8:46 PM

Post #97 of 105 (1894 views)

Permalink

maybe late '60s or so, we had a few 2314 dasd monsters[0]. think maybe
4m x 2m with 9 drives with removable disk packs.

a grave shift operator gets errors on a drive and wonders if maybe they
swap it into another spindle. no luck, so swapped those two drives with
two others. one more iteration, and they had wiped out the entire
array. at that point they called me; so i missed the really creative
part.

[0] https://www.ibm.com/ibm/history/exhibits/storage/storage_2314.html

randy

---
randy@psg.com
`gpg --locate-external-keys --auto-key-locate wkd randy@psg.com`
signatures are back, thanks to dmarc header mangling

Re: Famous operational issues [ In reply to ]

valdis.kletnieks at vt

Feb 23, 2021, 10:20 PM

Post #98 of 105 (1892 views)

Permalink

On Tue, 23 Feb 2021 20:46:38 -0800, Randy Bush said:
> maybe late '60s or so, we had a few 2314 dasd monsters[0]. think maybe
> 4m x 2m with 9 drives with removable disk packs.
>
> a grave shift operator gets errors on a drive and wonders if maybe they
> swap it into another spindle. no luck, so swapped those two drives with
> two others. one more iteration, and they had wiped out the entire
> array. at that point they called me; so i missed the really creative
> part.

I suspect every S/360 site that had 2314's had an operator who did that, as I
was witness to the same thing. For at least a decade after that debacle, the
Manager of Operations was awarding Gold, Silver, and Bronze Danny awards for
operational screw-ups. (The 2314 event was the sole Platinum Danny :)

And yes, IBM 4341 consoles were all too easy to hit the EPO button on the
keyboard, we got guards for the consoles after one of our operators nailed the
button a second time in a month.

And to tie the S/360 and 4341 together - we were one of the last sites that was
still running an S/360 Mod 65J. And plans came through for a new server room
on the top floor of a new building. Architect comes through, measures the S/360
and all the peripherals for floorspace and power/cooling - and the CPU, plus
*4* meg of memory, and 3 strings of 2314 drives chewed a lot of both.

Construction starts. Meanwhile, IBM announces the 4341, and offers us a real
sweetheart deal because even at the high maintenance charges we were paying,
IBM was losing money. Something insane like the system and peripherals and
first 3 years of maintenance, for less than the old system per-year
maintenance. Oh, and the power requirements are like 10% of the 360s.

So we take delivery of the new system and it's looking pitiful, just one box
and 2 small strings of disk in 10K square feet. Lots of empty space. Do all
the migrations to the new system over the summer, and life is good. Until
fall and winter arrive, and we discover there is zero heat in the room, and the
ceiling is uninsulated, and it's below zero outside because this is way upstate
NY. And if there was a 360 in the room, it would *still* be needing cooling
rather than heating. But it's a 4341 that's shedding only 10% of the heat...

Finally, one February morning, the 4341 throws a thermal check. Air was too
cold at the intakes. Our IBM CE did a double-take because he'd been doing IBM
mainframes for 3 decades and had never seen a thermal check for too cold
before.

Lots of legal action threatened against the architect, who simply said "If you
had *told* me that the system was being replaced, I'd have put heat in the
room". A settlement was reached, revised plans were drawn up, there was a whole
mess of construction to get ductwork and insulation and other stuff into place,
and life was good for the decade or so before I left for a better gig....

Re: Famous operational issues [ In reply to ]

ahebert at pubnix

Feb 24, 2021, 5:16 AM

Post #99 of 105 (1892 views)

Permalink

I personally did "disable vlan Xyz" instead of "delete vlan Xyz" on
Extreme Network... which proceeded to disable all the ports where the
VLAN was present...

Good thing it was a (local) remote pop and not on the core.

-----
Alain Hebert ahebert@pubnix.net
PubNIX Inc.
50 boul. St-Charles
P.O. Box 26770 Beaconsfield, Quebec H9W 6G7
Tel: 514-990-5911 http://www.pubnix.net Fax: 514-990-9443

On 2/23/21 5:22 PM, Justin Streiner wrote:
> An interesting sub-thread to this could be:
>
> Have you ever unintentionally crashed a device by running a perfectly
> innocuous command?
> 1. Crashed a 6500/Sup2 by typing "show ip dhcp binding".
> 2. "clear interface XXX" on a Nexus 7K triggered a
> cascading/undocument Sev1 bug that caused two linecards to crash and
> reload, and take down about two dozen buildings on campus at the .edu
> where I used to work.
> 3. For those that ever had the misfortune of using early versions of
> the "bcc" command shell* on Bay Networks routers, which was intended
> to make the CLI make look and feel more like a Cisco router, you have
> my condolences. One would reasonably expect "delete ?" to respond
> with a list of valid arguments for that command. Instead, it deleted,
> well... everything, and prompted an on-site restore/reboot.
>
> BCC originally stood for "Bay Command Console", but we joked that it
> really stood for "Blatant Cisco Clone".
>
> On Tue, Feb 16, 2021 at 2:37 PM John Kristoff <jtk@dataplane.org
> <mailto:jtk@dataplane.org>> wrote:
>
> Friends,
>
> I'd like to start a thread about the most famous and widespread
> Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
>
> To get things started, I'd suggest the AS 7007 event is perhaps the
> most notorious and likely to top many lists including mine. So if
> that is one for you I'm asking for just two more.
>
> I'm particularly interested in this as the first step in developing a
> future NANOG session. I'd be particularly interested in any issues
> that also identify key individuals that might still be around and
> interested in participating in a retrospective. I already have
> someone
> that is willing to talk about AS 7007, which shouldn't be hard to
> guess
> who.
>
> Thanks in advance for your suggestions,
>
> John
>

Re: Famous operational issues [ In reply to ]

randy at psg

Feb 24, 2021, 7:17 AM

Post #100 of 105 (1892 views)

Permalink

anyone else have the privilege of running 2321 data cells? had a bunch.
unreliable as hell. there was a job running continuously recovering
transactions off of log tapes. one night at 3am, head of apps program
(i was systems) got a call that a tran tape was unmounted with a console
message that recovery was complete. ops did not know what it meant or
what to do. was the first time in over five years the data were stable.

wife of same head of apps grew more and more tired of 2am calls.
finally she answered one "david? he said he was going in to work."
ops never called in the night again.

randy

---
randy@psg.com
`gpg --locate-external-keys --auto-key-locate wkd randy@psg.com`
signatures are back, thanks to dmarc header mangling