Mailing List Archive

1 2 3 4 5  View All
Re: Famous operational issues [ In reply to ]
One day I got called into the office supplies area because there was a
smell of something burning. Uh-oh.

To make a long story short there was a stainless steel bowl which was
focusing the sun from a window such that it was igniting a cardboard
box.

Talk about SMH and random bad luck which could have been a lot worse,
nothing really happened other than some smoke and char.

On February 18, 2021 at 01:07 eric.kuhnke@gmail.com (Eric Kuhnke) wrote:
> On that note, I'd be very interested in hearing stories of actual incidents
> that are the cause of why cardboard boxes are banned in many facilities, due to
> loose particulate matter getting into the air and setting off very sensitive
> fire detection systems.
>
> Or maybe it's more mundane and 99% of the reason is people unpack stuff and
> don't always clean up properly after themselves.
>
> On Wed, Feb 17, 2021, 6:21 PM Owen DeLong <owen@delong.com> wrote:
>
> Stolen isn’t nearly as exciting as what happens when your (used) 6509
> arrives and
> gets installed and operational before anyone realizes that the conductive
> packing
> peanuts that it was packed in have managed to work their way into various
> midplane
> connectors. Several hours later someone notices that the box is quite
> literally
> smoldering in the colo and the resulting combination of panic, fire drill,
> and
> management antics that ensue.
>
> Owen
>
>
> > On Feb 16, 2021, at 2:08 PM, Jared Mauch <jared@puck.nether.net> wrote:
> >
> > I was thinking about how we need a war stories nanog track. My favorite
> was being on call when the router was stolen.
> >
> > Sent from my TI-99/4a
> >
> >> On Feb 16, 2021, at 2:40 PM, John Kristoff <jtk@dataplane.org> wrote:
> >>
> >> Friends,
> >>
> >> I'd like to start a thread about the most famous and widespread Internet
> >> operational issues, outages or implementation incompatibilities you
> >> have seen.
> >>
> >> Which examples would make up your top three?
> >>
> >> To get things started, I'd suggest the AS 7007 event is perhaps  the
> >> most notorious and likely to top many lists including mine.  So if
> >> that is one for you I'm asking for just two more.
> >>
> >> I'm particularly interested in this as the first step in developing a
> >> future NANOG session.  I'd be particularly interested in any issues
> >> that also identify key individuals that might still be around and
> >> interested in participating in a retrospective.  I already have someone
> >> that is willing to talk about AS 7007, which shouldn't be hard to guess
> >> who.
> >>
> >> Thanks in advance for your suggestions,
> >>
> >> John
>
>

--
-Barry Shein

Software Tool & Die | bzs@TheWorld.com | http://www.TheWorld.com
Purveyors to the Trade | Voice: +1 617-STD-WRLD | 800-THE-WRLD
The World: Since 1989 | A Public Information Utility | *oo*
Re: Famous operational issues [ In reply to ]
Northridge quake. I was #2 and on call at CRL. That One Guy on dialup in Atlanta playing MUDs 23x7 pages that things are down. I wander out to my computer to dial in and see what’s up, turned on TV walking past it, sat down and turned computer on, as it was booting on comes a live helicopter shot over Northridge showing the 1.5 remaining floors of the 3-story Cable and Wireless building our east coast connector went through.

Took a second to listen and make sure I understood what was happening, changed channels to verify it wasn’t a stunt, logged on and pinged our router there to confirm nothing there, call & wake up Jim: “East coast’s down because earthquake in Northridge and the C&W center fell down.”

“....oh.”

And then there was the Sidekick outage...


-George

Sent from my iPhone

> On Feb 18, 2021, at 4:37 PM, Patrick W. Gilmore <patrick@ianai.net> wrote:
>
> ?On Feb 18, 2021, at 6:10 PM, Karl Auer <kauer@biplane.com.au> wrote:
>>
>> I think it was Macchiavelli who said that one should not ascribe to
>> malice anything adequately explained by incompetence…
>
> https://en.wikipedia.org/wiki/Hanlon%27s_razor
> Never attribute to malice that which is adequately explained by stupidity.
>
> I personally prefer this version from Robert A. Heinlein:
> Never underestimate the power of human stupidity.
>
> And to put it on topic, cover your EPOs
>
> In 1994, there was a major earthquake near the city of Los Angeles. City hall had to be evacuated and it would take over a year to reinforce the building to make it habitable again. My company moved all the systems in the basement of city hall to a new datacenter a mile or so away. After the install, we spent more than a week coaxing their ancient (even for 1994) machines back online, such as a Prime Computer and an AS400 with tons of DASD. Well, tons of cabinets, certainly less storage than my watch has now.
>
> I was in the DC going over something with the lady in charge when someone walked in to ask her something. She said “just a second”. That person took one step to the side of the door and leaned against the wall - right on an EPO which had no cover.
>
> Have you ever heard an entire row of DASD spin down instantly? Or taken 40 minutes to IPL an AS400? In the middle of the business day? For the second most populous city in the country?
>
> Me: Maybe you should get a cover for that?
> Her: Good idea.
>
> Couple weeks later, in the same DC, going over final checklist. A fedex guy walks in. (To this day, no idea how he got in a supposedly locked DC.) She says “just a second”, and I get a very strong deja vu feeling. He takes one step to the side and leans against the wall.
>
> Me: Did you order that EPO cover?
> Her: Nope.
>
> --
> TTFN,
> patrick
>
Re: Famous operational issues [ In reply to ]
Did you at least hire the janitor?

From: NANOG <nanog-bounces+ops.lists=gmail.com@nanog.org> on behalf of Mark Tinka <mark@tinka.africa>
Date: Friday, 19 February 2021 at 10:20 AM
To: nanog@nanog.org <nanog@nanog.org>
Subject: Re: Famous operational issues

On 2/19/21 00:37, Warren Kumari wrote:

5: Another one. In the early 2000s I was working for a dot-com boom company. We are building out our first datacenter, and I'm installing a pair of Cisco 7206s in 811 10th Ave. These will run basically the entire company, we have some transit, we have some peering to configure, we have an AS, etc. I'm going to be configuring all of this; clearly I'm a router-god...
Anyway, while I'm getting things configured, this janitor comes past, wheeling a garbage bin. He stops outside the cage and says "Whatcha doin'?". I go into this long explanation of how these "routers" <point> will connect to "the Internet" <wave hands in a big circle> to allow my "servers" <gesture at big black boxes with blinking lights> to talk to other "computers" <typing motion> on "the Internet" <again with the waving of the hands>. He pauses for a second, and says "'K. So, you doing a full iBGP mesh, or confeds?". I really hadn't intended to be a condescending ass, but I think of that every time I realize I might be assuming something about someone based on thier attire/job/etc.

:-), cute.

Mark.
Re: Famous operational issues [ In reply to ]
On Feb 18, 2021, at 11:51 PM, Suresh Ramasubramanian <ops.lists@gmail.com> wrote:

>> On 2/19/21 00:37, Warren Kumari wrote:

>> and says "'K. So, you doing a full iBGP mesh, or confeds?". I really hadn't
>> intended to be a condescending ass, but I think of that every time I realize I
>> might be assuming something about someone based on thier attire/job/etc.

> Did you at least hire the janitor?

Well, it's funny that you mention that because I worked at a place where the
company ended up hiring a young lady who worked in the cafeteria. When she
graduated she was offered a job in HR, and turned out to be absolutely awesome.

At some point in my life, I was carrying 50lbs bags of potato starch. Now I have
two graduate degrees and am working on a third. That janitor may be awesome, too!

Thanks,

Sabri
Re: Famous operational issues [ In reply to ]
He is. He asked a perfectly relevant question based on what he saw of the physical setup in front of him.

And he kept his cool when being talked down to.

I?d hire him the next minute, personally speaking.

From: Sabri Berisha <sabri@cluecentral.net>
Date: Friday, 19 February 2021 at 2:02 PM
To: Suresh Ramasubramanian <ops.lists@gmail.com>
Cc: nanog <nanog@nanog.org>
Subject: Re: Famous operational issues
On Feb 18, 2021, at 11:51 PM, Suresh Ramasubramanian <ops.lists@gmail.com> wrote:

>> On 2/19/21 00:37, Warren Kumari wrote:

>> and says "'K. So, you doing a full iBGP mesh, or confeds?". I really hadn't
>> intended to be a condescending ass, but I think of that every time I realize I
>> might be assuming something about someone based on thier attire/job/etc.

> Did you at least hire the janitor?

Well, it's funny that you mention that because I worked at a place where the
company ended up hiring a young lady who worked in the cafeteria. When she
graduated she was offered a job in HR, and turned out to be absolutely awesome.

At some point in my life, I was carrying 50lbs bags of potato starch. Now I have
two graduate degrees and am working on a third. That janitor may be awesome, too!

Thanks,

Sabri
Re: Famous operational issues [ In reply to ]
On 2/19/21 10:40, Suresh Ramasubramanian wrote:

> He is. He asked a perfectly relevant question based on what he saw of
> the physical setup in front of him.
>
> And he kept his cool when being talked down to.
>
> I’d hire him the next minute, personally speaking.
>

In the early 2000's, with that level of deduction, I'd have been
surprised if he wasn't snatched up quickly. Unless, of course, it
ultimately wasn't his passion.

Mark.
Re: Famous operational issues [ In reply to ]
Do you remember the Cisco HDCI connectors?
https://en.wikipedia.org/wiki/HDCI

I once shipped a Cisco 4500 plus some cables to a remote data center and asked the local guys to cable them for me.
With Cisco you could check the cable type and if they were properly attached. They were not.

I asked for a check and the local guy confirmed me three times that the cables were properly plugged.
At the end I gave up, and took the 3 hour drive to the datacenter to check myself.

Problem was that, while the casing of the connector is asymmetrical, the pins inside are symmetrical.
And the local guy was quite strong.

Yes, he managed to plug in the cables 180° flipped, bending the case, but he got them in.
He was quite embarrassed when I fixed the cabling problem in 10 seconds.

That must have been 1995 or so....

Wolfgang



> On 16. Feb 2021, at 20:37, John Kristoff <jtk@dataplane.org> wrote:
>
> Which examples would make up your top three?

--
Wolfgang Tremmel

Phone +49 69 1730902 0 | wolfgang.tremmel@de-cix.net
Executive Directors: Harald A. Summa and Sebastian Seifert | Trade Registry: AG Cologne, HRB 51135
DE-CIX Management GmbH | Lindleystrasse 12 | 60314 Frankfurt am Main | Germany | www.de-cix.net
Re: Famous operational issues [ In reply to ]
On 16 Feb 2021, at 20:37, John Kristoff wrote:

> I'd like to start a thread about the most famous and widespread
> Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?


My absolute top one happened 1995. Traffic engineering was not a widely
used term then. A bright colleague who will remain un-named decided that
he could make AS paths longer by repeating the same AS number more than
once. Unfortunately the prevalent software on CISCO routers was not
resilient to such trickery and reacted with a reboot. This caused an
avalanche of jo-jo-ing routers. Think it through!

It took some time before that offending path could be purged from the
whole Internet; yes we all roughly knew the topology and the players of
the BGP speaking parts of it at that time. Luckily this happened
during the set-up for the Danvers IETF and co-ordination between major
operators was quick because most of their routing geeks happened to be
in the same room, the ‘terminal room’; remember those?

Since at the time I personally had no responsibility for operations any
more I went back to pulling cables and crimping RJ45s.

Lessons: HW/SW mono-cultures are dangerous. Input testing is good
practice at all levels software. Operational co-ordination is key in
times of crisis.

Daniel
Re: Famous operational issues [ In reply to ]
In the case of Exodus when I was working there, it was literally dictated to us by
the fire marshal of the city of Santa Clara (and enough other cities where we had
datacenters to make a universal policy the only sensible choice).

Owen

> On Feb 18, 2021, at 1:07 AM, Eric Kuhnke <eric.kuhnke@gmail.com> wrote:
>
> On that note, I'd be very interested in hearing stories of actual incidents that are the cause of why cardboard boxes are banned in many facilities, due to loose particulate matter getting into the air and setting off very sensitive fire detection systems.
>
> Or maybe it's more mundane and 99% of the reason is people unpack stuff and don't always clean up properly after themselves.
>
> On Wed, Feb 17, 2021, 6:21 PM Owen DeLong <owen@delong.com <mailto:owen@delong.com>> wrote:
> Stolen isn’t nearly as exciting as what happens when your (used) 6509 arrives and
> gets installed and operational before anyone realizes that the conductive packing
> peanuts that it was packed in have managed to work their way into various midplane
> connectors. Several hours later someone notices that the box is quite literally
> smoldering in the colo and the resulting combination of panic, fire drill, and
> management antics that ensue.
>
> Owen
>
>
> > On Feb 16, 2021, at 2:08 PM, Jared Mauch <jared@puck.nether.net <mailto:jared@puck.nether.net>> wrote:
> >
> > I was thinking about how we need a war stories nanog track. My favorite was being on call when the router was stolen.
> >
> > Sent from my TI-99/4a
> >
> >> On Feb 16, 2021, at 2:40 PM, John Kristoff <jtk@dataplane.org <mailto:jtk@dataplane.org>> wrote:
> >>
> >> ?Friends,
> >>
> >> I'd like to start a thread about the most famous and widespread Internet
> >> operational issues, outages or implementation incompatibilities you
> >> have seen.
> >>
> >> Which examples would make up your top three?
> >>
> >> To get things started, I'd suggest the AS 7007 event is perhaps the
> >> most notorious and likely to top many lists including mine. So if
> >> that is one for you I'm asking for just two more.
> >>
> >> I'm particularly interested in this as the first step in developing a
> >> future NANOG session. I'd be particularly interested in any issues
> >> that also identify key individuals that might still be around and
> >> interested in participating in a retrospective. I already have someone
> >> that is willing to talk about AS 7007, which shouldn't be hard to guess
> >> who.
> >>
> >> Thanks in advance for your suggestions,
> >>
> >> John
>
Re: Famous operational issues [ In reply to ]
All these stories remind me of two of my own from back in the late 90s.
I worked for a regional ISP doing some network stuff (under the real
engineer), and some software development.

Like a lot of ISPs in the 90s, this one started out in a rental house.
Over the months and years rooms were slowly converted to host more and more
equipment as we expanded our customer base and presence in the region.
If we needed a "rack", someone would go to the store and buy a 4-post metal
shelf [1] or...in some cases the dump to see what they had.

We had one that looked like an oversized filing cabinet with some sort of
rails on the sides. I don't recall how the equipment was mounted, but I
think it was by drilling holes into the front lip and tapping the screws
in. This was the big super-important rack. It had the main router that
connected lines between 5 POPs around the region, and also several
connections to Portland Oregon about 60 miles away. Since we were
making tons of money, we decided we should update our image and install
real racks in the "bedroom server room". It was decided we were going to
do it with no downtime.

I was on the 2-man team that stood behind and in front of the rack with
2x4s dead-lifting them as equipment was unscrewed and lowered onto the
boards. I was on the back side of the rack. After all the equipment was
unscrewed, someone came in with a sawzall and cut the filing cabinet thing
apart. The top half was removed and taken away, then we lifted up on the
boards and the bottom half was slid out of the way. The new rack was
brought in, bolted to the floor, and then one by one equipment was taken
off the pile we were holding up with 2x4s, brought through the back of the
new rack, and then mounted.

I was pleasantly surprised and very relieved when we finished moving the
big router, several switches, a few servers, and a UPS unit over to the new
rack with zero downtime. The entire team cheered and cracked beers. I
stepped out from behind the rack...
...and snagged the power cable to the main router with my foot. I don't
recall the Cisco model number after all this time...but I do remember the
excruciating 6-8 minutes it took for the damn thing to reboot, and the
sight of the 7 PRI cards in our phone system almost immediately jumping
from 5 channels in-use to being 100% full.

It's been 20 years, but I swear my arms are still sore from holding all
that equipment up for ~20 minutes, and I always pick my feet up very slowly
when I'm near a rack. ;)

The second story is a short one from the same time period. Our POPs
consisted of the afore-mentioned 4-post metal shelves stacked with piles of
US Robotics 56k modems [2] stacked on top of each other. They were wired
back to some sort of serial box that was in-turn connected to an ISA card
stuck in a Windows NT 4 server that used RADIUS to authenticate sessions
with an NT4 server back at the main office that had user accounts for all
our customers. Every single modem had a wall-wart power brick for power,
an RJ11 phone line, and a big old serial cable. It was an absolute rats
nest of cables. The small POP (which I think was a TuffShed in someone's
yard about 50 feet from the telco building) was always 100 degrees--even in
the dead of winter.

One year we made the decision to switch to 3Com Total Control Chassis with
PRI cards. The cut-over was pretty seamless and immediately made shelves
stacked full of hundreds of modems completely useless. As we started
disconnecting modems with the intent of selling them for a few bucks to
existing customers who wanted to upgrade or giving them to new customers to
get them signed up, we found a bunch of the stacks of modems had actually
melted together due to the temps. That explained the handful of numbers in
the hunt group that would just ring and ring with no answer. In the end we
went from a completely packed 10x20 shed to two small 3Com TCH boxes packed
with PRI cards and a handful of PRI cables with much more normal
temperatures.

I thoroughly enjoyed the "wild west" days of the internet.

If Eric and Dan are reading this, thanks for everything you taught me about
networking, business, hard work, and generally being a good person.

-A

[1] -
https://www.amazon.com/dp/B01D54TICS/ref=redir_mobile_desktop?_encoding=UTF8&aaxitk=Pe4xuew1D1PkrRA9cq8Cdg&hsa_cr_id=5048111780901&pd_rd_plhdr=t&pd_rd_r=4d9e3b6b-3360-41e8-9901-d079ac063f03&pd_rd_w=uRxXq&pd_rd_wg=CDibq&ref_=sbx_be_s_sparkle_td_asin_0_img

[2] - https://www.usr.com/products/56k-dialup-modem/usr5686g/



On Tue, Feb 16, 2021 at 11:39 AM John Kristoff <jtk@dataplane.org> wrote:

> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
>
> To get things started, I'd suggest the AS 7007 event is perhaps the
> most notorious and likely to top many lists including mine. So if
> that is one for you I'm asking for just two more.
>
> I'm particularly interested in this as the first step in developing a
> future NANOG session. I'd be particularly interested in any issues
> that also identify key individuals that might still be around and
> interested in participating in a retrospective. I already have someone
> that is willing to talk about AS 7007, which shouldn't be hard to guess
> who.
>
> Thanks in advance for your suggestions,
>
> John
>
Re: Famous operational issues [ In reply to ]
Jen Linkova ????? 2021-02-19 00:04:

> OK, Warren, achievement unlocked. You've just made a network engineer
> to google 'router'....

He meant that we call "frezer" machine... (in our language ;)

I heard a similar story from my colleague who was working at that time
for Huawei as DWDM engineer and had to fly frequently with testing
devices.
One time he tried to explain at airport security control what DWDM
spectrum analyser is for, the officer called another for help and he
said something like this: "DWDM spectrum analyser? Pass it, usual
thing..."

--
Kind regards,
Andrey Kostin
Re: Famous operational issues [ In reply to ]
On 2/16/2021 2:37 PM, John Kristoff wrote:
> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?


I don't believe I've seen this in any of the replies, but the AT&T
cascading switch crashes of 1990 is a good one. This link even has some
pseudocode
https://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collapse
Re: Famous operational issues [ In reply to ]
At a previous company we had a large number of Foundry Networks layer-3
switches. They participated in our OSPF network and had a *really* annoying
bug. Every now and then one of them would get somewhat confused and would
corrupt its OSPF database (there seemed to be some pointer that would end
up off by one).

It would then cleverly realize that its LSDB was different to everyone
else's and so would flood this corrupt database to all other OSPF speakers.
Some vendors would do a better job of sanity checking the LSAs and would
ignore the bad LSAs, other vendors would install them... and now you have
different link state databases on different devices and OSPF becomes
unhappy.

Nov 24 22:23:53.633 EST: %OSPF-4-BADLSAMASK: Bad LSA mask: Type 5,
LSID 0.9.32.5

Mask 10.160.8.0 from 10.178.255.252
NOTE: This route will not be installed in the routing table.
Nov 26 11:01:32.997 EST: %OSPF-4-BADLSAMASK: Bad LSA mask: Type 5, LSID 0.4.2.3

Mask 10.2.153.0 from 10.178.255.252
NOTE: This route will not be installed in the routing table.
Nov 27 23:14:00.660 EST: %OSPF-4-BADLSAMASK: Bad LSA mask: Type 5, LSID 0.4.2.3

Mask 10.2.153.0 from 10.178.255.252
NOTE: This route will not be installed in the routing table.

If you look at the output, you can see that there is some garbage in the
LSID field and the bit that should be there is now in the Mask section. I
also saw some more extreme version of the same bug, in my favorite example
the mask was 115.104.111.119 and further down there was 105.110.116.114 --
if you take these as decimal number and look up their ASCII values we get
"show" and "inte" -- I wrote a tool to scrape bits from these errors and
ended up with a large amount of the CLI help text.




Many years ago I worked for a small Mom-and-Pop type ISP in New York state
(I was the only network / technical person there) -- it was a very free
wheeling place and I built the network by doing whatever made sense at the
time.

One of my "favorite" customers (Joe somebody) was somehow related to the
owner of the ISP and was a gamer. This was back in the day when the gaming
magazines would give you useful tips like "Type 'tracert $gameserver' and
make sure that there are less than N hops". Joe would call up tech
support, me, the owner, etc and complain that there was N+3 hops and most
of them were in our network. I spent much time explaining things about
packet-loss, latency, etc but couldn't shake his belief that hop count was
the only metric that mattered.

Finally, one night he called me at home well after midnight (no, I didn't
give him my home phone number, he looked me up in the phonebook!) to
complain that his gaming was suffering because it was "too many hops to get
out of your network". I finally snapped and built a static GRE tunnel from
the RAS box that he connected to all over the network -- it was a thing of
beauty, it went through almost every device that we owned and took the most
convoluted path I could come up with. "Yay!", I figured, "now I can
demonstrate that latency is more important than hop count" and I went to
bed.

The next morning I get a call from him. He is ecstatic and wildly impressed
by how well the network is working for him now and how great his gaming
performance is. "Oh well", I think, "at least he is happy and will leave me
alone now". I don't document the purpose of this GRE anywhere and after
some time forget about it.

A few months later I am doing some routine cleanup work and stumble across
a weird looking tunnel -- its bizarre, it goes all over the place and is
all kinds of crufty -- there are static routes and policy routing and
bizarre things being done on the RADIUS server to make sure some user
always gets a certain IP... I look in my pile of notes and old configs and
then decide to just yank it out.

That night I get an enraged call (at home again) from Joe *screaming* that
the network is all broken again because it is now way too many hops to get
out of the network and that people keep shooting him...

*What I learnt from this:*
1: Make sure you document everything (and no, the network isn't
documentation)
2: Gamers are weird.
3: Making changes to your network in anger provides short term pleasure but
long term pain.



On Fri, Feb 19, 2021 at 1:10 PM Andrew Gallo <akg1330@gmail.com> wrote:

>
>
> On 2/16/2021 2:37 PM, John Kristoff wrote:
> > Friends,
> >
> > I'd like to start a thread about the most famous and widespread Internet
> > operational issues, outages or implementation incompatibilities you
> > have seen.
> >
> > Which examples would make up your top three?
>
>
> I don't believe I've seen this in any of the replies, but the AT&T
> cascading switch crashes of 1990 is a good one. This link even has some
> pseudocode
> https://users.csc.calpoly.edu/~jdalbey/SWE/Papers/att_collapse
>
>

--
The computing scientist’s main challenge is not to get confused by the
complexities of his own making.
-- E. W. Dijkstra
Re: Famous operational issues [ In reply to ]
----- On Feb 19, 2021, at 3:07 AM, Daniel Karrenberg dfk@ripe.net wrote:

Hi,

> Lessons: HW/SW mono-cultures are dangerous. Input testing is good
> practice at all levels software. Operational co-ordination is key in
> times of crisis.

Well... Here is a very similar, fairly recent one. Albeit in this case, the
opposite is true: running one software train would have prevented an outage.
Some members on this list (hi, Brian!) will recognize the story.

Group XX within $company decided to deploy EVPN. All of backbone was running
single $vendor, but different software trains. Turns out that between an
early draft, implemented in version X, and the RFC, implemented in version Y,
a change was made in NLRI formats which were not backwards compatible.

Version X was in use on virtually all DC egress boxes, version Y was in use
on route reflectors. The moment the first EVPN NLRI was advertised, the
entire backbone melted down. Dept-wide alert issued (at night), people trying
to log on to the VPN. Oh wait, the VPN requires yubikey, which requires the
corp network to access the interwebs, which is not accessible due to said
issue.

And, despite me complaining since the day of hire, no out of band network.

I didn't stay much longer after that.

Thanks,

Sabri
Re: Famous operational issues [ In reply to ]
On 16/02/2021 22:08, Jared Mauch wrote:
> I was thinking about how we need a war stories nanog track. My favorite was being on call when the router was stolen.

Enough time has (probably) elapsed since my escapades in a small data
centre in Manchester. The RFO was ten pages long, and I don't want to
spoil the ending, but ... I later discovered that Cumulus' then VP of
Engineering had elevated me to a veritable 'Hall of Infamy' for the
support ticket attached to that particular tale.

One day I'll be able to buy the guy that handled it a *lot* of whisky.
He deserved it.

--
Tom
Re: Famous operational issues [ In reply to ]
On Thu, Feb 18, 2021 at 07:34:39AM -0500, Patrick W. Gilmore wrote:
> In 1994, there was a major earthquake near the city of Los Angeles. City hall had to be evacuated and it would take over a year to reinforce the building to make it habitable again. My company moved all the systems in the basement of city hall to a new datacenter a mile or so away. After the install, we spent more than a week coaxing their ancient (even for 1994) machines back online, such as a Prime Computer and an AS400 with tons of DASD. Well, tons of cabinets, certainly less storage than my watch has now.
>
> I was in the DC going over something with the lady in charge when someone walked in to ask her something. She said “just a second”. That person took one step to the side of the door and leaned against the wall - right on an EPO which had no cover.
>
> Have you ever heard an entire row of DASD spin down instantly? Or taken 40 minutes to IPL an AS400? In the middle of the business day? For the second most populous city in the country?
>
> Me: Maybe you should get a cover for that?
> Her: Good idea.
>
> Couple weeks later, in the same DC, going over final checklist. A fedex guy walks in. (To this day, no idea how he got in a supposedly locked DC.) She says “just a second”, and I get a very strong deja vu feeling. He takes one step to the side and leans against the wall.
>
> Me: Did you order that EPO cover?
> Her: Nope.

some of the ibm 4300 series mini-mainframes came with a console terminal
that had a very large, raised (completely not flush), alternate power
button on the upper panel of the keyboard, facing the operator. in later
versions, the button was inset in a little open box with high sides. in
earlier versions, there was just a pair of raised ribs on either side of the
button. in the earliest version, if that panel needed to be replaced, the
replacement part didn't even have those protective ribs, this huge button
was just sitting there. on our 4341, someone had dropped the keyboard during
installation and the damaged panel was replaced with the
no-protection-whatsoever part.

i had an operator who, working a double shift into the overnight run,
fell asleep and managed to bang his head square on the button.
the overnight jobs running were left in various states of ruin.

third party manufacturers had an easy sell for lucite power/EPO button covers.

--
Henry Yen Aegis Information Systems, Inc.
Senior Systems Programmer Hicksville, New York
Re: Famous operational issues [ In reply to ]
From a datacenter ROI and economics, cooling, HVAC perspective that might
just be the best colo customer ever. As long as they're paying full price
for the cabinet and nothing is *dangerous* about how they've hung the 2U
server vertically, using up all that space for just one thing has to be a
lot better than a customer that makes full and efficient use of space and
all the amperage allotted to them.


On Thu, Feb 18, 2021 at 11:38 AM tim@pelican.org <tim@pelican.org> wrote:

> On Thursday, 18 February, 2021 16:23, "Seth Mattinen" <sethm@rollernet.us>
> said:
>
> > I had a customer that tried to stack their servers - no rails except the
> > bottom most one - using 2x4's between each server. Up until then I
> > hadn't imagined anyone would want to fill their cabinet with wood, so I
> > made a rule to ban wood and anything tangentially related (cardboard,
> > paper, plastic, etc.). Easier to just ban all things. Fire reasons too
> > but mainly I thought a cabinet full of wood was too stupid to allow.
>
> On the "stupid racking" front, I give you most of a rack dedicated to a
> single server. Not all that high a server, maybe 2U or so, but *way* too
> deep for the rack, so it had been installed vertically. By looping some
> fairly hefty chain through the handles on either side of the front of the
> chassis, and then bolting the four chain ends to the four rack posts. I
> wish I'd kept pictures of that one. Not flammable, but a serious WTF
> moment.
>
> Cheers,
> Tim.
>
>
>
Re: Famous operational issues [ In reply to ]
Not a famous operational issue, but in 2000, we had a major outage of
our dialup modem pool.

The owner of the building was re-skinning the outside using Styrofoam
and stucco. A bunch of the Styrofoam
had blocked the roof drains on the podium section of the building,
immediately above our equipment room.

A flash rainstorm filled the entire flat roof, and water came back in
over the flashings, and poured directly in
to our dialup modem pool through the hole in the concrete roof deck
where the drain pipe protruded through.

In retrospect, it was a monumentally stupid place to put our main
modem pool, but we didn't realize what was
above the drop ceiling - and that it was roof, not the other 11
floors of the building.

1 bay of 6 shelves of USR TC 1000 HiperDSPs were now very wet and
blinking funny patterns on their LEDs.

Fortunately, our vendor in Toronto (4 hour drive away) had stock of
equipment that another customer kept
delaying shipment on. They got their staff in, started un-boxing
and, slotting cards. We spent a few hours
tearing out the old gear and getting ready for replacements.

We left Windsor, Ontario at around 12:00am - same time they left
Toronto, heading towards us. We coordinated
a meet at one of the rural exits along Highway 401 at a closed gas
station at around 2am.

Everything was going so well until a cop pulled up, and asked us what
we were doing, as we were slinging
modem chassis between the back of the vendor's SUV and our van... We
calmly explained
what happened. He looked between us a couple of times, shook his
head and said "well, good luck with that",
got back in his car and drove away.

We had everything back online within 14 hours of the initial outage.

At 02:37 PM 16/02/2021, John Kristoff wrote:
>Friends,
>
>I'd like to start a thread about the most famous and widespread Internet
>operational issues, outages or implementation incompatibilities you
>have seen.
>
>Which examples would make up your top three?
>
>To get things started, I'd suggest the AS 7007 event is perhaps the
>most notorious and likely to top many lists including mine. So if
>that is one for you I'm asking for just two more.
>
>I'm particularly interested in this as the first step in developing a
>future NANOG session. I'd be particularly interested in any issues
>that also identify key individuals that might still be around and
>interested in participating in a retrospective. I already have someone
>that is willing to talk about AS 7007, which shouldn't be hard to guess
>who.
>
>Thanks in advance for your suggestions,
>
>John

--

Clayton Zekelman
Managed Network Systems Inc. (MNSi)
3363 Tecumseh Rd. E
Windsor, Ontario
N8W 1H4

tel. 519-985-8410
fax. 519-985-8409
Re: Famous operational issues [ In reply to ]
On Thu, Feb 18, 2021 at 07:34:39PM -0500, Patrick W. Gilmore wrote:
> And to put it on topic, cover your EPOs

I worked somewhere with an uncovered EPO, which was okay until we had a
telco tech in who was used to a different data center where a similar
looking button controlled the door access, so he reflexively hit it
on his way out to unlock the door. Oops.

Also, consider what's on generator and what's not. I worked in a corporate
data center where we lost power. The backup system kept all the machines
running, but the ventilation system was still down, so it was very warm very
fast as everyone went around trying to shut servers down gracefully while
other folks propped the doors open to get some cooler air in.

--r
Re: Famous operational issues [ In reply to ]
Oh,

I actually wanted to keep this for my memoirs, but if we can name danger
datacenter operational issues …. somehow 2000s:

Somebody ran its own datacenter,
- once had an active ant colony living under the raised floor and in the
climate system,
- for a while had several electric grounding defects, leading to the
work instruction of “don’t touch any metallic or conducting
materials”,
- for a minute, had a “look what we have bought on Ebay” - UPS
system, until started to roast after turned on,
- from time to time had climate issues, leading to temperatures around
peaks with 68 centigrade room temperature, and yes, some equipment
survived and even continued to work.

Decided not to go back there, after “look what we have bought on Ebay,
an argon fire distinguisher, we just need to mount it”.

On 20 Feb 2021, at 10:15, Eric Kuhnke wrote:

> From a datacenter ROI and economics, cooling, HVAC perspective that
> might
> just be the best colo customer ever. As long as they're paying full
> price
> for the cabinet and nothing is *dangerous* about how they've hung the
> 2U
> server vertically, using up all that space for just one thing has to
> be a
> lot better than a customer that makes full and efficient use of space
> and
> all the amperage allotted to them.
>
>
Re: Famous operational issues [ In reply to ]
I’m embarrassed to say, I’ve done this.

Ms. Lady Benjamin PD Cannon, ASCE
6x7 Networks & 6x7 Telecom, LLC
CEO
ben@6by7.net
"The only fully end-to-end encrypted global telecommunications company in the world.”

FCC License KJ6FJJ

Sent from my iPhone via RFC1149.

> On Feb 19, 2021, at 12:55 AM, Wolfgang Tremmel <wolfgang.tremmel@de-cix.net> wrote:
>
> ?Do you remember the Cisco HDCI connectors?
> https://en.wikipedia.org/wiki/HDCI
>
> I once shipped a Cisco 4500 plus some cables to a remote data center and asked the local guys to cable them for me.
> With Cisco you could check the cable type and if they were properly attached. They were not.
>
> I asked for a check and the local guy confirmed me three times that the cables were properly plugged.
> At the end I gave up, and took the 3 hour drive to the datacenter to check myself.
>
> Problem was that, while the casing of the connector is asymmetrical, the pins inside are symmetrical.
> And the local guy was quite strong.
>
> Yes, he managed to plug in the cables 180° flipped, bending the case, but he got them in.
> He was quite embarrassed when I fixed the cabling problem in 10 seconds.
>
> That must have been 1995 or so....
>
> Wolfgang
>
>
>
>> On 16. Feb 2021, at 20:37, John Kristoff <jtk@dataplane.org> wrote:
>>
>> Which examples would make up your top three?
>
> --
> Wolfgang Tremmel
>
> Phone +49 69 1730902 0 | wolfgang.tremmel@de-cix.net
> Executive Directors: Harald A. Summa and Sebastian Seifert | Trade Registry: AG Cologne, HRB 51135
> DE-CIX Management GmbH | Lindleystrasse 12 | 60314 Frankfurt am Main | Germany | www.de-cix.net
>
Re: Famous operational issues [ In reply to ]
> On Feb 18, 2021, at 9:04 PM, Jen Linkova <furry13@gmail.com> wrote:
>
> On Fri, Feb 19, 2021 at 9:40 AM Warren Kumari <warren@kumari.net> wrote:
>> 4: Not too long after I started doing networking (and for the same small ISP in Yonkers), I'm flying off to install a new customer. I (of course) think that I'm hot stuff because I'm going to do the install, configure the router, whee, look at me! Anyway, I don't want to check a bag, and so I stuff the Cisco 2501 in a carryon bag, along with tools, etc (this was all pre-9/11!). I'm going through security and the TSA[0] person opens my bag and pulls the router out. "What's this?!" he asks. I politely tell him that it's a router. He says it's not. I'm still thinking that I'm the new hotness, and so I tell him in a somewhat condescending way that it is, and I know what I'm talking about. He tells me that it's not a router, and is starting to get annoyed. I explain using my "talking to a 5 year old" voice that it most certainly is a router. He tells me that lying to airport security is a federal offense, and starts looming at me. I adjust my attitude and start explaining that it's like a computer and makes the Internet work. He gruffly hands me back the router, I put it in my bag and scurry away. As I do so, I hear him telling his colleague that it wasn't a router, and that he certainly knows what a router is, because he does woodwork...
>
> OK, Warren, achievement unlocked. You've just made a network engineer
> to google 'router'....
>
> P.S. I guess I'm obliged to tell a story if I respond to this thread...so...
> "Servers and the ice cream factory".
> Late spring/early summer in Moscow. The temperature above 30C (86°F).
> I worked for a local content provided.
> Aircons in our server room died, the technician ETA was 2 days ( I
> guess we were not the only ones with aircon problems).
> So we drove to the nearby ice cream factory and got *a lot* of dry
> ice. Then we have a roaster: every few hours one person took a deep
> breath, grabbed a box of dry ice, ran into the server room and emptied
> the box on top of the racks. The backup person was watching through
> the glass door - just in case, you know, ready to start the rescue
> operation.
> We (and the servers) survived till the technician arrived. And we had
> a lot of dry ice to cool the beer..
>
> --
> SY, Jen Linkova aka Furry

During a wood-working project for the Southern California Linux Expo (the tech team that
(among other things) runs the network for the show was building new equipment carts), I
came up with the following meme:



[.I don’t know if NANOG will pass the image despite its small size, so textual description:
A bandaged hand with the index finger amputated at the second knuckle with overlaid red
text stating “Carless Routing May Lead to Urgent Test of Self Healing Network”]

Fortunately, we didn’t have any such issues with the router, though we did have one person
suffer a crushed toe from a cabinet tip-over. Fortunately, the person made a full recovery.

Owen
Re: Famous operational issues [ In reply to ]
On Thursday, 18 February, 2021 22:37, "Warren Kumari" <warren@kumari.net> said:

> 4: Not too long after I started doing networking (and for the same small
> ISP in Yonkers), I'm flying off to install a new customer. I (of course)
> think that I'm hot stuff because I'm going to do the install, configure the
> router, whee, look at me! Anyway, I don't want to check a bag, and so I
> stuff the Cisco 2501 in a carryon bag, along with tools, etc (this was all
> pre-9/11!). I'm going through security and the TSA[0] person opens my bag
> and pulls the router out. "What's this?!" he asks. I politely tell him that
> it's a router. He says it's not. I'm still thinking that I'm the new
> hotness, and so I tell him in a somewhat condescending way that it is, and
> I know what I'm talking about. He tells me that it's not a router, and is
> starting to get annoyed. I explain using my "talking to a 5 year old" voice
> that it most certainly is a router. He tells me that lying to airport
> security is a federal offense, and starts looming at me. I adjust my
> attitude and start explaining that it's like a computer and makes the
> Internet work. He gruffly hands me back the router, I put it in my bag and
> scurry away. As I do so, I hear him telling his colleague that it wasn't a
> router, and that he certainly knows what a router is, because he does
> woodwork...

Here in the UK we avoid that issue by pronouncing the packet-shifter as "rooter", and only the wood-working tool as "rowter" :)

Of course, it raises a different set of problems when talking to the Australians...

Cheers,
Tim.
Re: Famous operational issues [ In reply to ]
    Well...

    During my younger days, that button was used a few time by the
operator of a VM/370 to regain control from someone with a "curious
mind" *cought* *cought*...

-----
Alain Hebert ahebert@pubnix.net
PubNIX Inc.
50 boul. St-Charles
P.O. Box 26770 Beaconsfield, Quebec H9W 6G7
Tel: 514-990-5911 http://www.pubnix.net Fax: 514-990-9443

On 2/20/21 4:07 AM, Henry Yen wrote:
> On Thu, Feb 18, 2021 at 07:34:39AM -0500, Patrick W. Gilmore wrote:
>> In 1994, there was a major earthquake near the city of Los Angeles. City hall had to be evacuated and it would take over a year to reinforce the building to make it habitable again. My company moved all the systems in the basement of city hall to a new datacenter a mile or so away. After the install, we spent more than a week coaxing their ancient (even for 1994) machines back online, such as a Prime Computer and an AS400 with tons of DASD. Well, tons of cabinets, certainly less storage than my watch has now.
>>
>> I was in the DC going over something with the lady in charge when someone walked in to ask her something. She said “just a second”. That person took one step to the side of the door and leaned against the wall - right on an EPO which had no cover.
>>
>> Have you ever heard an entire row of DASD spin down instantly? Or taken 40 minutes to IPL an AS400? In the middle of the business day? For the second most populous city in the country?
>>
>> Me: Maybe you should get a cover for that?
>> Her: Good idea.
>>
>> Couple weeks later, in the same DC, going over final checklist. A fedex guy walks in. (To this day, no idea how he got in a supposedly locked DC.) She says “just a second”, and I get a very strong deja vu feeling. He takes one step to the side and leans against the wall.
>>
>> Me: Did you order that EPO cover?
>> Her: Nope.
> some of the ibm 4300 series mini-mainframes came with a console terminal
> that had a very large, raised (completely not flush), alternate power
> button on the upper panel of the keyboard, facing the operator. in later
> versions, the button was inset in a little open box with high sides. in
> earlier versions, there was just a pair of raised ribs on either side of the
> button. in the earliest version, if that panel needed to be replaced, the
> replacement part didn't even have those protective ribs, this huge button
> was just sitting there. on our 4341, someone had dropped the keyboard during
> installation and the damaged panel was replaced with the
> no-protection-whatsoever part.
>
> i had an operator who, working a double shift into the overnight run,
> fell asleep and managed to bang his head square on the button.
> the overnight jobs running were left in various states of ruin.
>
> third party manufacturers had an easy sell for lucite power/EPO button covers.
>
> --
> Henry Yen Aegis Information Systems, Inc.
> Senior Systems Programmer Hicksville, New York
Re: Famous operational issues [ In reply to ]
On 2/22/21 9:14 AM, Alain Hebert wrote:
> *[External Email]*
>
>     Well...
>
>     During my younger days, that button was used a few time by the
> operator of a VM/370 to regain control from someone with a "curious
> mind" *cought* *cought*...
>
Two horror stories I remember from long ago when I was a console jockey
for a federal space agency that will remain nameless :P

1. A coworker brought her daughter to work with her on a Saturday
overtime shift because she couldn't get a babysitter. She parked the kid
with a coloring book and a pile of crayons at the only table in the
console room with some space, right next to the master console for our
3081. I asked her to make sure sh was well away from the console, and as
she reached over to scoot the girl and her coloring books further away
she slipped, and reached out to steady herself. Yep, planted her finger
right down on the IML button (plexi covers? We don' need no STEENKIN'
plexi covers!). MVS and VM vanished, two dozen tape drives rewound and
several hours' worth of data merge jobs went blooey.

2. The 3081 was water cooled via a heat exchanger. The building chilled
water feed had a very old, very clogged filter that was bypassed until
it could be replaced. One day a new maintenance foreman came through the
building doing his "clipboard and harried expression" thing, and spotted
the filter in bypass (NO, I don't know WHY it hadn't been red-tagged.
Someone clearly dropped that ball.) He thought, "Well that's not right"
and reset all the valves to put it back inline, which of course, pretty
much killed the chilled water flow through the heat exchanger. First
thing we knew about it in Operations was when the 3081 started throwing
thermal alarms and MVS crashed hard. IBM had to replace several modules
in the CPUs.

--
--------------------------------------------
Bruce H. McIntosh
Network Engineer II
University of Florida Information Technology
bhm@ufl.edu
352-273-1066

1 2 3 4 5  View All