Mailing List Archive

1 2 3 4 5  View All
Re: Famous operational issues [ In reply to ]
Ahh, war stories. I like the one where I got a wake up call that our IRC
server was on fire,  together with the rest of the DC.


Not that widespread, but we reached Slashdot. :)

November 2002, University of Twente, The Netherlands. Some idiot wanted
to be a hero. He deflated peoples tires, to help inflate them. One
morning he thought it would be a good idea to start a small fire and
then extinguish it, so he would be the hero that stopped a fire. He
failed and the building burned down. He got caught a few days later when
he tried the same thing in a different building.

Almost all of the IT was in that building, including core network,
uplinks to SURFNet (Dutch Educational Network) and to the 2000 students
living on the campus. Ironically a new DC was already being built, so
that was ready for use a few weeks later.

As we had quite a network for 2002 we hosted for instance
security.debian.org. The students all had 100Mbit in their room, so some
of them also hosted some popular websites. One I can remember was an
image sharing site.

Some students immediately created a backup network; dhcp server, dns
server with a catch all, website explaining what was going on, IRC
server, etc..

A local ISP offered to sponsor 50Mbit for the residents, which was
connected via a microwave relay and a temporary fiber was run through a
ditch to connect two parts of the campus residencies. At the end of the
day all 2000 students had their internet connection back, although all
behind a single 50Mbit link.


Syslog message from the local SURFNet router:

lo0.ar5.enschede1.surf.net 3613: Nov 20 07:20:50.927 UTC:
%ENV_MON-2-TEMP: Hotpoint temp sensor(slot 18) temperature has reached
WARNING level at 61(C)


(Disclaimer: Where I say we, I mean we as University. I wasn't working
for the university, but was part of the students working on the backup
network. There are probably some other people on list with some more
details and I've probably missed some details, but this is the summary.)


On 16-02-2021 23:08, Jared Mauch wrote:
> I was thinking about how we need a war stories nanog track. My favorite was being on call when the router was stolen.
>
> Sent from my TI-99/4a
>
>> On Feb 16, 2021, at 2:40 PM, John Kristoff <jtk@dataplane.org> wrote:
>>
>> ?Friends,
>>
>> I'd like to start a thread about the most famous and widespread Internet
>> operational issues, outages or implementation incompatibilities you
>> have seen.
>>
>> Which examples would make up your top three?
>>
>> To get things started, I'd suggest the AS 7007 event is perhaps the
>> most notorious and likely to top many lists including mine. So if
>> that is one for you I'm asking for just two more.
>>
>> I'm particularly interested in this as the first step in developing a
>> future NANOG session. I'd be particularly interested in any issues
>> that also identify key individuals that might still be around and
>> interested in participating in a retrospective. I already have someone
>> that is willing to talk about AS 7007, which shouldn't be hard to guess
>> who.
>>
>> Thanks in advance for your suggestions,
>>
>> John
Re: Famous operational issues [ In reply to ]
John Kristoff wrote:
> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
Well... pre-Internet, but the great Northeast fiber cut comes to mind
(backhoe vs. fiber, backhoe won).

Miles Fidelman

--
In theory, there is no difference between theory and practice.
In practice, there is. .... Yogi Berra

Theory is when you know everything but nothing works.
Practice is when everything works but no one knows why.
In our lab, theory and practice are combined:
nothing works and no one knows why. ... unknown
Re: Famous operational issues [ In reply to ]
I remember when the big carriers de-peered with Cogent in the early 2000s. The underestimated the amount of web-sites being hosted by people using cogent exclusively.


Justin Wilson
j2sw@j2sw.com


https://j2sw.com - All things jsw (AS209109)
https://blog.j2sw.com - Podcast and Blog

> On Feb 17, 2021, at 10:29 AM, Miles Fidelman <mfidelman@meetinghouse.net> wrote:
>
> John Kristoff wrote:
>> Friends,
>>
>> I'd like to start a thread about the most famous and widespread Internet
>> operational issues, outages or implementation incompatibilities you
>> have seen.
>>
> Well... pre-Internet, but the great Northeast fiber cut comes to mind (backhoe vs. fiber, backhoe won).
>
> Miles Fidelman
>
> --
> In theory, there is no difference between theory and practice.
> In practice, there is. .... Yogi Berra
>
> Theory is when you know everything but nothing works.
> Practice is when everything works but no one knows why.
> In our lab, theory and practice are combined:
> nothing works and no one knows why. ... unknown
Re: Famous operational issues [ In reply to ]
Cogentco still did not peer with Google and HE over IPv6 I guess.

________________________________
From: NANOG <nanog-bounces+david=xtom.com@nanog.org> on behalf of Justin Wilson (Lists) <lists@mtin.net>
Sent: Thursday, February 18, 2021 00:53
To: Miles Fidelman
Cc: nanog@nanog.org
Subject: Re: Famous operational issues

I remember when the big carriers de-peered with Cogent in the early 2000s. The underestimated the amount of web-sites being hosted by people using cogent exclusively.


Justin Wilson
j2sw@j2sw.com

?
https://j2sw.com - All things jsw (AS209109)
https://blog.j2sw.com - Podcast and Blog

> On Feb 17, 2021, at 10:29 AM, Miles Fidelman <mfidelman@meetinghouse.net> wrote:
>
> John Kristoff wrote:
>> Friends,
>>
>> I'd like to start a thread about the most famous and widespread Internet
>> operational issues, outages or implementation incompatibilities you
>> have seen.
>>
> Well... pre-Internet, but the great Northeast fiber cut comes to mind (backhoe vs. fiber, backhoe won).
>
> Miles Fidelman
>
> --
> In theory, there is no difference between theory and practice.
> In practice, there is. .... Yogi Berra
>
> Theory is when you know everything but nothing works.
> Practice is when everything works but no one knows why.
> In our lab, theory and practice are combined:
> nothing works and no one knows why. ... unknown
Re: Famous operational issues [ In reply to ]
The he.net side is interesting as you can see who their v4 transits are but they suppress their routes via v6, but (last I knew) lacked community support for their customers to do similar route suppression.

I’m not a fan of it, but it makes the commercial discussions much easier each time those networks come by to shop services to me in a personal or professional capacity. “No, I need all the internet”.

- Jared

> On Feb 17, 2021, at 12:07 PM, David Guo via NANOG <nanog@nanog.org> wrote:
>
> Cogentco still did not peer with Google and HE over IPv6 I guess.
>
> From: NANOG <nanog-bounces+david=xtom.com@nanog.org> on behalf of Justin Wilson (Lists) <lists@mtin.net>
> Sent: Thursday, February 18, 2021 00:53
> To: Miles Fidelman
> Cc: nanog@nanog.org
> Subject: Re: Famous operational issues
>
> I remember when the big carriers de-peered with Cogent in the early 2000s. The underestimated the amount of web-sites being hosted by people using cogent exclusively.
>
>
> Justin Wilson
> j2sw@j2sw.com
>
> —
> https://j2sw.com - All things jsw (AS209109)
> https://blog.j2sw.com - Podcast and Blog
>
> > On Feb 17, 2021, at 10:29 AM, Miles Fidelman <mfidelman@meetinghouse.net> wrote:
> >
> > John Kristoff wrote:
> >> Friends,
> >>
> >> I'd like to start a thread about the most famous and widespread Internet
> >> operational issues, outages or implementation incompatibilities you
> >> have seen.
> >>
> > Well... pre-Internet, but the great Northeast fiber cut comes to mind (backhoe vs. fiber, backhoe won).
> >
> > Miles Fidelman
> >
> > --
> > In theory, there is no difference between theory and practice.
> > In practice, there is. .... Yogi Berra
> >
> > Theory is when you know everything but nothing works.
> > Practice is when everything works but no one knows why.
> > In our lab, theory and practice are combined:
> > nothing works and no one knows why. ... unknown
Re: Famous operational issues [ In reply to ]
On Wed, 17 Feb 2021 14:07:54 -0500
John Curran <jcurrran@istaff.org> wrote:

> I have no idea what outages were most memorable for others, but the
> Stanford transfer switch explosion in October 1996 resulted in a much
> of the Internet in the Bay Area simply not being reachable for
> several days.

Thanks John.

This reminds me of two I've not seen anyone mention yet. Both
coincidentally in the Chicago area that I learned before my entry
into netops full time. One was a flood:

<https://en.wikipedia.org/wiki/Chicago_flood>

The other, at the dawn of an earlier era:

<http://telecom-digest.org/telecom-archives/TELECOM_Digest_Online/1309.html>

I wouldn't necessarily put those two in the top 3, but by some standard
for many they were certainly very significant and noteworthy.

John
Re: Famous operational issues [ In reply to ]
(resent - to list this time)
On 16 Feb 2021, at 2:37 PM, John Kristoff <jtk@dataplane.org <mailto:jtk@dataplane.org>> wrote:
>
> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?

John -

I have no idea what outages were most memorable for others, but the Stanford transfer switch explosion in October 1996 resulted in a much of the Internet in the Bay Area simply not being reachable for several days.

At the time there were three main power grids feeding Stanford – two from PG&E and one from Stanford’s own CoGen plant – and somehow a rat crawling into one of the two 12KVA transfer switches resulted in an the switch disppearing in an epic explosion that even took out a portion of the exterior wall of the building.

The ensuing restoration involved lots of industry folks, GE power-on-wheel generating stations, anaconda-sized power cables, and all in all was quite the adventure.

FYI,
/John
Re: Famous operational issues [ In reply to ]
Stolen isn’t nearly as exciting as what happens when your (used) 6509 arrives and
gets installed and operational before anyone realizes that the conductive packing
peanuts that it was packed in have managed to work their way into various midplane
connectors. Several hours later someone notices that the box is quite literally
smoldering in the colo and the resulting combination of panic, fire drill, and
management antics that ensue.

Owen


> On Feb 16, 2021, at 2:08 PM, Jared Mauch <jared@puck.nether.net> wrote:
>
> I was thinking about how we need a war stories nanog track. My favorite was being on call when the router was stolen.
>
> Sent from my TI-99/4a
>
>> On Feb 16, 2021, at 2:40 PM, John Kristoff <jtk@dataplane.org> wrote:
>>
>> ?Friends,
>>
>> I'd like to start a thread about the most famous and widespread Internet
>> operational issues, outages or implementation incompatibilities you
>> have seen.
>>
>> Which examples would make up your top three?
>>
>> To get things started, I'd suggest the AS 7007 event is perhaps the
>> most notorious and likely to top many lists including mine. So if
>> that is one for you I'm asking for just two more.
>>
>> I'm particularly interested in this as the first step in developing a
>> future NANOG session. I'd be particularly interested in any issues
>> that also identify key individuals that might still be around and
>> interested in participating in a retrospective. I already have someone
>> that is willing to talk about AS 7007, which shouldn't be hard to guess
>> who.
>>
>> Thanks in advance for your suggestions,
>>
>> John
Re: Famous operational issues [ In reply to ]
On that note, I'd be very interested in hearing stories of actual incidents
that are the cause of why cardboard boxes are banned in many facilities,
due to loose particulate matter getting into the air and setting off very
sensitive fire detection systems.

Or maybe it's more mundane and 99% of the reason is people unpack stuff and
don't always clean up properly after themselves.

On Wed, Feb 17, 2021, 6:21 PM Owen DeLong <owen@delong.com> wrote:

> Stolen isn’t nearly as exciting as what happens when your (used) 6509
> arrives and
> gets installed and operational before anyone realizes that the conductive
> packing
> peanuts that it was packed in have managed to work their way into various
> midplane
> connectors. Several hours later someone notices that the box is quite
> literally
> smoldering in the colo and the resulting combination of panic, fire drill,
> and
> management antics that ensue.
>
> Owen
>
>
> > On Feb 16, 2021, at 2:08 PM, Jared Mauch <jared@puck.nether.net> wrote:
> >
> > I was thinking about how we need a war stories nanog track. My favorite
> was being on call when the router was stolen.
> >
> > Sent from my TI-99/4a
> >
> >> On Feb 16, 2021, at 2:40 PM, John Kristoff <jtk@dataplane.org> wrote:
> >>
> >> ?Friends,
> >>
> >> I'd like to start a thread about the most famous and widespread Internet
> >> operational issues, outages or implementation incompatibilities you
> >> have seen.
> >>
> >> Which examples would make up your top three?
> >>
> >> To get things started, I'd suggest the AS 7007 event is perhaps the
> >> most notorious and likely to top many lists including mine. So if
> >> that is one for you I'm asking for just two more.
> >>
> >> I'm particularly interested in this as the first step in developing a
> >> future NANOG session. I'd be particularly interested in any issues
> >> that also identify key individuals that might still be around and
> >> interested in participating in a retrospective. I already have someone
> >> that is willing to talk about AS 7007, which shouldn't be hard to guess
> >> who.
> >>
> >> Thanks in advance for your suggestions,
> >>
> >> John
>
>
Re: Famous operational issues [ In reply to ]
On Thu, Feb 18, 2021 at 01:07:01AM -0800, Eric Kuhnke wrote:
> On that note, I'd be very interested in hearing stories of actual incidents
> that are the cause of why cardboard boxes are banned in many facilities,
> due to loose particulate matter getting into the air and setting off very
> sensitive fire detection systems.
>
> Or maybe it's more mundane and 99% of the reason is people unpack stuff and
> don't always clean up properly after themselves.

We had a plastic bag sucked into the intake of a router in a
datacenter once that caused it to overheat and take the site down. We
had cameras in our cage and I remember seeing the photo from the site of
the colo (I'll protect their name just because) taken as the tech was on
the phone and pulled the bag out of the router.

The time from the thermal warning syslog that it's getting warm
to overheat and shutdown is short enough you can't really get a tech to
the cage in time to prevent it.

I assume also the latter above, which is people have varying
definitons of clean.

- Jared

--
Jared Mauch | pgp key available via finger from jared@puck.nether.net
clue++; | http://puck.nether.net/~jared/ My statements are only mine.
Re: Famous operational issues [ In reply to ]
On 2/18/21 1:07 AM, Eric Kuhnke wrote:
> On that note, I'd be very interested in hearing stories of actual
> incidents that are the cause of why cardboard boxes are banned in many
> facilities, due to loose particulate matter getting into the air and
> setting off very sensitive fire detection systems.
>


I had a customer that tried to stack their servers - no rails except the
bottom most one - using 2x4's between each server. Up until then I
hadn't imagined anyone would want to fill their cabinet with wood, so I
made a rule to ban wood and anything tangentially related (cardboard,
paper, plastic, etc.). Easier to just ban all things. Fire reasons too
but mainly I thought a cabinet full of wood was too stupid to allow.

The "no wood" rule has become a fun story to tell everyone who asks how
that ended up being a rule. The wood customer turned out to be a
complete a-hole anyway, wood was just the tip of the iceberg.
Re: Famous operational issues [ In reply to ]
Worked a cronic support call where their internet would bounce at noon every workday. The Cisco 1601 or 1700 Router that had there T1 in, ended up being on top a microwave. Weeks of troubleshooting and shipping new routers on this one.

Also had another one where the router was plugged in to an outlet that was controlled by a light switch, discovered this after shipping them two new routers.

Customer had there building remodeled and the techs counldn't find the T1 Smartjack for the building. The contract who did the remodel job, decided it would be a good idea to cut out the section of wall where the telco equipment was and mounted it to the ceiling. It's new location was in the ladys bathroom, above the drop ceiling mounted to the building's rafters 10' in the air.

Customer needed a new router, because the first one died. It was a machine shop and they mounted the router to the wall next to a lathe or drill press that used oil to cool the bit while it was cutting. It looked like some dumped the router in a bucket of oil when we got it back.

Arriving at another large colo for a buildout. Only to find that our ASR9K that arrived 2 weeks ago was stored outside on the load dock which has no roof or locked gate. I guess that why Cisco put the plastic bag over the chassis when there shipped.

Colo techs at another larger colo decided to unpack our router which was a fully loaded 1/2 rack chassis. Since they couldn't lift it, they tipped the router on the side and walked it back by shifting the weight from one corner of the chassis to another. Bending the chassis. I could see the scrap marks in the floor from it.

We had colo space in top floor of an ATT CO where we put a Cisco 7513 to terminate about a dozen CHDS3's. The roof was leaking and instead of fixing the roof. The fix was to put a sheet of plastic over our cabinet. It was more like a tent over the cabinet. A pool of water formed in a diviot at the top and it was 120+ degrees under the plastic tarp.

Our office was in a work loft off an older building and they had the AC unit mounted to the ceiling with a drip pan underneath them. Well, AC on the 2nd floor had the pump for the drip pan died. Who every installed the drip pan didn't secure it or center it under the AC unit. It filled up with water and since it was not secured and was off centered. The drip pan came crashing down with a few gallons of water. The water worked it's way over to the wall and traveled down one story in the building. The floor below had all the telco equipment mounted to that same wall and the water flowed down right through a couple of ATT's Ciena mounted to the wall shorting them out. I was at the Chicago Nanog Hackathon on Sunday and was called out to work that one ????

Was working in the back of a cabinet that had -48 VDC power for a Cisco Router, a screw fell and shorted out the power. My co worker who was standing in front of the rack wasn't happy because the ADC PowerWorx Fuse panel was about 6" from his face where he was working. It had those little black alarm fuses, that had the spring-loaded arm. When it tripped a nice shower of sparks had flew right at his face Luckly he wore glasses.

I was 18 at my first IT job and it was a brand-new building. I was plugging in a 208VAC 30A APC UPS in the server room the electrican had just energized and check the circuit. I plugged in the APC UPS and gave it a good turn for the twist lock plug to catch and KA BAMB!!! Sparks came shooting out of the outlet at me. I think I pooped myself that day. Turns out the electricians deiced that a single Gange electrical box was good enough for a 208 VAC 30A outlet, that barely fit in the box. Didn't put any tape around the wire terminals. When they energized the circuit there was enough of an air gap that the hot screw didn't ground out. When I gave it that good old twist while plugging in the APC, I grounded the hot screw to the side of the electrical box.






________________________________
From: NANOG <nanog-bounces+esundberg=nitelusa.com@nanog.org> on behalf of Seth Mattinen <sethm@rollernet.us>
Sent: Thursday, February 18, 2021 10:23 AM
To: nanog@nanog.org <nanog@nanog.org>
Subject: Re: Famous operational issues

On 2/18/21 1:07 AM, Eric Kuhnke wrote:
> On that note, I'd be very interested in hearing stories of actual
> incidents that are the cause of why cardboard boxes are banned in many
> facilities, due to loose particulate matter getting into the air and
> setting off very sensitive fire detection systems.
>


I had a customer that tried to stack their servers - no rails except the
bottom most one - using 2x4's between each server. Up until then I
hadn't imagined anyone would want to fill their cabinet with wood, so I
made a rule to ban wood and anything tangentially related (cardboard,
paper, plastic, etc.). Easier to just ban all things. Fire reasons too
but mainly I thought a cabinet full of wood was too stupid to allow.

The "no wood" rule has become a fun story to tell everyone who asks how
that ended up being a rule. The wood customer turned out to be a
complete a-hole anyway, wood was just the tip of the iceberg.

________________________________

CONFIDENTIALITY NOTICE: This e-mail transmission, and any documents, files or previous e-mail messages attached to it may contain confidential information that is legally privileged. If you are not the intended recipient, or a person responsible for delivering it to the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of any of the information contained in or attached to this transmission is STRICTLY PROHIBITED. If you have received this transmission in error please notify the sender immediately by replying to this e-mail. You must destroy the original transmission and its attachments without reading or saving in any manner.
Thank you.
Re: Famous operational issues [ In reply to ]
On Thursday, 18 February, 2021 16:23, "Seth Mattinen" <sethm@rollernet.us> said:

> I had a customer that tried to stack their servers - no rails except the
> bottom most one - using 2x4's between each server. Up until then I
> hadn't imagined anyone would want to fill their cabinet with wood, so I
> made a rule to ban wood and anything tangentially related (cardboard,
> paper, plastic, etc.). Easier to just ban all things. Fire reasons too
> but mainly I thought a cabinet full of wood was too stupid to allow.

On the "stupid racking" front, I give you most of a rack dedicated to a single server. Not all that high a server, maybe 2U or so, but *way* too deep for the rack, so it had been installed vertically. By looping some fairly hefty chain through the handles on either side of the front of the chassis, and then bolting the four chain ends to the four rack posts. I wish I'd kept pictures of that one. Not flammable, but a serious WTF moment.

Cheers,
Tim.
Re: Famous operational issues [ In reply to ]
Normally I reference this as an example of terrible government
bureaucracy, but in this case it's also how said bureaucracy can delay
operational changes.

I was a contractor for one of the many branches of the DoD in charge
of the network at a moderate-sized site. I'd been there about 4
months, and it was my first job with FedGov. I was sent a pair of
Cisco 6509-E routers, with all supervisors and blades needed, along
with a small mountain of SFPs, to replace the non-E 6509s we had
installed that were still using GBICs for their downlinks. These were
the distro switches for approximately half the site.

Problem was, we needed 84 new SC-LC fiber jumpers to replace the SC-SC
we had in place for the existing switch - GBICs to SFPs remember. We
hadn't received any with the shipment. So I reached out to the project
manager to ask about getting the fiber jumpers. "Oh, that should be
coming from the server farm folks, since it's being installed in a
server farm." Okay, that seems stupid to me, but $FedGov, who knows. I
tell him we're stalled out until we get those cables - we have the
routers configured and ready to go, just need the jumpers, can he get
them from the server farm folks? He'll do that.

It took FIFTEEN MONTHS to hash out who was going to pay for and order
the fiber jumpers. Any number of times as the months dragged on, I
seriously considered ordering them on Amazon Prime using my corporate
card. We had them installed a week and a half after we got them. Why
that long? Because we had to completely reconfigure them, and after 15
months, the urgency just wasn't there.

By the way, the project ended up buying them, not the server farm team.

On Tue, Feb 16, 2021 at 2:38 PM John Kristoff <jtk@dataplane.org> wrote:
>
> Friends,
>
> I'd like to start a thread about the most famous and widespread Internet
> operational issues, outages or implementation incompatibilities you
> have seen.
>
> Which examples would make up your top three?
>
> To get things started, I'd suggest the AS 7007 event is perhaps the
> most notorious and likely to top many lists including mine. So if
> that is one for you I'm asking for just two more.
>
> I'm particularly interested in this as the first step in developing a
> future NANOG session. I'd be particularly interested in any issues
> that also identify key individuals that might still be around and
> interested in participating in a retrospective. I already have someone
> that is willing to talk about AS 7007, which shouldn't be hard to guess
> who.
>
> Thanks in advance for your suggestions,
>
> John
Re: Famous operational issues [ In reply to ]
A few I remember:

    . Some monitoring server SCSI drive failed (we're talking
State/Province level govt)...  Got a return back stating it will take 6
month delay to get a replacement...

        Ended up choosing to use my own drive instead of leaving
something that could be have been deadly, unmonitored.

    . Metro interruption during rush hour (for a pop of 4M) due to
overload power bar in a MMR (Meet Me Room) during a unplanned deployment;

    . Cherry red and very angry looking 520-600V bus bar =D;

    . Fire fighters hitting the building generator emergency STOP
button because some neighbor reported smoke on top of the building
during a black out...
    ( not their fault, local gov failure as usual )

    . Some idiots poured gasoline into a large pipe under a bridge... 
ended up demonstrating the lack of diversity to the DCs on that urban
island;

    . Underground transformer blow up downtown Mtl and took out the
entire fiber bundle, demonstrating to those customers that their
diversity was actually real =D.

        (took them a year to get that fixed)

and

    . Obviously: Any rack cabling I do...

-----
Alain Hebert ahebert@pubnix.net
PubNIX Inc.
50 boul. St-Charles
P.O. Box 26770 Beaconsfield, Quebec H9W 6G7
Tel: 514-990-5911 http://www.pubnix.net Fax: 514-990-9443

On 2/18/21 2:37 PM, tim@pelican.org wrote:
> On Thursday, 18 February, 2021 16:23, "Seth Mattinen" <sethm@rollernet.us> said:
>
>> I had a customer that tried to stack their servers - no rails except the
>> bottom most one - using 2x4's between each server. Up until then I
>> hadn't imagined anyone would want to fill their cabinet with wood, so I
>> made a rule to ban wood and anything tangentially related (cardboard,
>> paper, plastic, etc.). Easier to just ban all things. Fire reasons too
>> but mainly I thought a cabinet full of wood was too stupid to allow.
> On the "stupid racking" front, I give you most of a rack dedicated to a single server. Not all that high a server, maybe 2U or so, but *way* too deep for the rack, so it had been installed vertically. By looping some fairly hefty chain through the handles on either side of the front of the chassis, and then bolting the four chain ends to the four rack posts. I wish I'd kept pictures of that one. Not flammable, but a serious WTF moment.
>
> Cheers,
> Tim.
>
>
Re: Famous operational issues [ In reply to ]
On Thu, Feb 18, 2021 at 01:07:01AM -0800, Eric Kuhnke wrote:
> On that note, I'd be very interested in hearing stories of actual incidents
> that are the cause of why cardboard boxes are banned in many facilities,

the datacenter manager's daughter's cat.

--
Henry Yen Aegis Information Systems, Inc.
Senior Systems Programmer Hicksville, New York
Re: Famous operational issues [ In reply to ]
On Thu, Feb 18, 2021 at 8:31 AM Jared Mauch <jared@puck.nether.net> wrote:

> On Thu, Feb 18, 2021 at 01:07:01AM -0800, Eric Kuhnke wrote:
> > On that note, I'd be very interested in hearing stories of actual
> incidents
> > that are the cause of why cardboard boxes are banned in many facilities,
> > due to loose particulate matter getting into the air and setting off very
> > sensitive fire detection systems.
> >
> > Or maybe it's more mundane and 99% of the reason is people unpack stuff
> and
> > don't always clean up properly after themselves.
>
> We had a plastic bag sucked into the intake of a router in a
> datacenter once that caused it to overheat and take the site down. We
> had cameras in our cage and I remember seeing the photo from the site of
> the colo (I'll protect their name just because) taken as the tech was on
> the phone and pulled the bag out of the router.
>
> The time from the thermal warning syslog that it's getting warm
> to overheat and shutdown is short enough you can't really get a tech to
> the cage in time to prevent it.
>


1: A previous employer was a large customer of a (now defunct) L3 switch
vendor. The AC power inputs were along the bottom of the power supply, and
the big aluminium heatsinks in the power supplies were just above the AC
socket.
Anyway, the subcontractor who made the power supplies for the vendor
realized that they could save a few cents by not installing the little
metal clip that held the heatsink to the MOSFET, and instead relying on the
thermal adhesive to hold it...
This worked fine, until a certain number of hours had passed, at which
point the goop would dry out and the heatsink would fall down, directly
across the AC socket.... This would A: trip the circuit that this was on,
but, more excitingly, set the aluminum on fire, which would then ignite the
other heatsinks in the PSU, leading to much fire...

2: A somewhat similar thing would happen with the Ascend TNT Max, which had
side-to-side airflow. These were dial termination boxes, and so people
would install racks and racks of them. The first one would draw in cool air
on the left, heat it up and ship it out the right. The next one over would
draw in warm air on the left, heat it up further, and ship it out the
right... Somewhere there is a fairly famous photo of a rack of TNT Maxes,
with the final one literally on fire, and still passing packets.
There is a related (and probably apocryphal) regarding the launch of the
TNT. It was being shipped for a major trade-show, but got stuck in customs.
After many bizarre calls with the customs folk, someone goes to the customs
office to try and sort it out, and get greeted by custom agents with guns.
They all walk into the warehouse, and discover that there is a large empty
area around the crate, which is a wooden cube, with "TNT" stencilled in big
red letters...

3: I used to work for a small ISP in Yonkers, NY. We had a customer in
Florida, and on a Friday morning their site goes down. We (of course) have
not paid for Cisco 4 hour support (or, honestly, any support) and they have
a strict SLA, so we are a little stuck.
We end up driving to JFK, and lugging a fully loaded Cisco 7507 to the
check in counter. It was just before the last flight of the day, so we
shrugged and said it was my checked bag. The excess baggage charges were
eye-watering, but it rode the conveyor belt with the rest of the luggage
onto the plane. It arrived with just a bent ejector handle, and the rest
was fine.

4: Not too long after I started doing networking (and for the same small
ISP in Yonkers), I'm flying off to install a new customer. I (of course)
think that I'm hot stuff because I'm going to do the install, configure the
router, whee, look at me! Anyway, I don't want to check a bag, and so I
stuff the Cisco 2501 in a carryon bag, along with tools, etc (this was all
pre-9/11!). I'm going through security and the TSA[0] person opens my bag
and pulls the router out. "What's this?!" he asks. I politely tell him that
it's a router. He says it's not. I'm still thinking that I'm the new
hotness, and so I tell him in a somewhat condescending way that it is, and
I know what I'm talking about. He tells me that it's not a router, and is
starting to get annoyed. I explain using my "talking to a 5 year old" voice
that it most certainly is a router. He tells me that lying to airport
security is a federal offense, and starts looming at me. I adjust my
attitude and start explaining that it's like a computer and makes the
Internet work. He gruffly hands me back the router, I put it in my bag and
scurry away. As I do so, I hear him telling his colleague that it wasn't a
router, and that he certainly knows what a router is, because he does
woodwork...

5: Another one. In the early 2000s I was working for a dot-com boom
company. We are building out our first datacenter, and I'm installing a
pair of Cisco 7206s in 811 10th Ave. These will run basically the entire
company, we have some transit, we have some peering to configure, we have
an AS, etc. I'm going to be configuring all of this; clearly I'm a
router-god...
Anyway, while I'm getting things configured, this janitor comes past,
wheeling a garbage bin. He stops outside the cage and says "Whatcha
doin'?". I go into this long explanation of how these "routers" <point>
will connect to "the Internet" <wave hands in a big circle> to allow my
"servers" <gesture at big black boxes with blinking lights> to talk to
other "computers" <typing motion> on "the Internet" <again with the waving
of the hands>. He pauses for a second, and says "'K. So, you doing a full
iBGP mesh, or confeds?". I really hadn't intended to be a condescending
ass, but I think of that every time I realize I might be assuming something
about someone based on thier attire/job/etc.





W
[0]: Well, technically pre-TSA, but I cannot remember what we used to call
airport security pre-TSA...



>
> I assume also the latter above, which is people have varying
> definitons of clean.
>
> - Jared
>
> --
> Jared Mauch | pgp key available via finger from jared@puck.nether.net
> clue++; | http://puck.nether.net/~jared/ My statements are only
> mine.
>


--
The computing scientist’s main challenge is not to get confused by the
complexities of his own making.
-- E. W. Dijkstra
Re: Famous operational issues [ In reply to ]
On Thu, 2021-02-18 at 17:37 -0500, Warren Kumari wrote:
> Anyway, the subcontractor who made the power supplies for the vendor
> realized that they could save a few cents by not installing the
> little metal clip that held the heatsink to the MOSFET

I think it was Macchiavelli who said that one should not ascribe to
malice anything adequately explained by incompetence...

> 3: I used to work for a small ISP in Yonkers, NY.

There is actually a place called "Yonkers"?!? I always thought it was a
joke placename. We don't really need joke placenames in Oz, since we
have real ones like Woolloomooloo, Burpengary and Humpty Doo. My
favourite is Numbugga (closely followed by Wonglepong).

> I cannot remember what we used to call airport security pre-TSA...

"Useful"?

Regards, K.

--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Karl Auer (kauer@biplane.com.au)
http://www.biplane.com.au/kauer

GPG fingerprint: 2561 E9EC D868 E73C 8AF1 49CF EE50 4B1D CCA1 5170
Old fingerprint: 8D08 9CAA 649A AFEF E862 062A 2E97 42D4 A2A0 616D
Re: Famous operational issues [ In reply to ]
warren> 2: A somewhat similar thing would happen with the Ascend TNT
warren> Max, which had side-to-side airflow. These were dial termination
warren> boxes, and so people would install racks and racks of them. The
warren> first one would draw in cool air on the left, heat it up and
warren> ship it out the right. The next one over would draw in warm air
warren> on the left, heat it up further, and ship it out the
warren> right... Somewhere there is a fairly famous photo of a rack of
warren> TNT Maxes, with the final one literally on fire, and still
warren> passing packets.

The Ascend MAX (TNT was the T3 version, max took 2 T1s) was originally
an ISDN device. We got the first v.34 rockwell modem version for
testing. An individual card had 4 daughter boards. They were burned in
for 24 hours at Ascend, then shipped to us. We were doing stress testing
in Fairfax VA. Turns out that the boards started to overheat at about 30
hours and caught fire a few hours after that... Completely melted the
daughterboards. They did fix that issue and upped the burnin test period
to 48 hours.

And yeah, they vented side to side. They were designed for enclosed
racks where are flow was forced up. We were colocating at telco POPs so
we had to use center mount open relay racks. The air flow was as you
describe. Good time. Had by all...

Both we (UUNET, for MSN and Earthlink) and AOL were using these for
dialup access. 80k ports before we switched to the TNTs, 3+ million
ports on TNTs by the time I stopped paying attention.
Re: Famous operational issues [ In reply to ]
On 2021-02-17 13:28, John Kristoff wrote:
> On Wed, 17 Feb 2021 14:07:54 -0500
> John Curran <jcurrran@istaff.org> wrote:
>
>> I have no idea what outages were most memorable for others, but the
>> Stanford transfer switch explosion in October 1996 resulted in a much
>> of the Internet in the Bay Area simply not being reachable for
>> several days.
>
> Thanks John.
>
> This reminds me of two I've not seen anyone mention yet. Both
> coincidentally in the Chicago area that I learned before my entry
> into netops full time. One was a flood:
>
> <https://en.wikipedia.org/wiki/Chicago_flood>
>
> The other, at the dawn of an earlier era:
>
>
> <http://telecom-digest.org/telecom-archives/TELECOM_Digest_Online/1309.html>
>
> I wouldn't necessarily put those two in the top 3, but by some standard
> for many they were certainly very significant and noteworthy.
>
> John

Thanks for sharing these links John. I was personally affected by the
Hinsdale CO fire when I was a kid. At the time, my family lived on the
southern border of Hinsdale in the adjacent town of Burr Ridge. It was
weird like a power outage: you're reminded of the loss of service every
time you perform the simple act of requesting service, picking up the
phone or toggling a light switch. But it lasted a lot longer than any
loss of power: It was six or seven weeks that, to this day, felt a lot
longer.

Anytime we needed to talk to someone long-distance, we had to drive to a
cousin's house to make the call. To talk to anyone local, you'd have to
physically go and show up unannounced. At 11 years old, I was the
bicycle messenger between our house and my great-grandmother, who lived
about two blocks away. My mother and father kept the cars gassed up and
extra fuel on hand in case there was an emergency.

Dad ran a home improvement business out of the house, so new business
ground to a halt. Mom worked for a publishing company, so their release
dates were impacted. The local grocery store's scanners wouldn't work,
so they had to punch the orders into the register by hand, using the
paper sticker prices on the items.

I clearly remember from the local papers that they had to special-order
the replacement 5ESS at enormous cost. I saw the big brick building
after the fire with the burn marks around the front door. In late May
and early June, the Greyhound buses with the workers were parked around
the block, power plants outside with huge cables snaking in right
through the wide open front door.

When we heard that dial tone at last, everyone was happier than an
iPhone with full bars. Lol

We're spoiled for choice in telecom networks these days. Also,
facilities management have learned plenty of lessons since then. Like,
install and maintain an FM-200 fire suppression system. But
nevertheless, sometimes when I step into a colo, I think of that outage
and the impact it had.

-Brian
Re: Famous operational issues [ In reply to ]
On Feb 18, 2021, at 6:10 PM, Karl Auer <kauer@biplane.com.au> wrote:
>
> I think it was Macchiavelli who said that one should not ascribe to
> malice anything adequately explained by incompetence…

https://en.wikipedia.org/wiki/Hanlon%27s_razor
Never attribute to malice that which is adequately explained by stupidity.

I personally prefer this version from Robert A. Heinlein:
Never underestimate the power of human stupidity.

And to put it on topic, cover your EPOs

In 1994, there was a major earthquake near the city of Los Angeles. City hall had to be evacuated and it would take over a year to reinforce the building to make it habitable again. My company moved all the systems in the basement of city hall to a new datacenter a mile or so away. After the install, we spent more than a week coaxing their ancient (even for 1994) machines back online, such as a Prime Computer and an AS400 with tons of DASD. Well, tons of cabinets, certainly less storage than my watch has now.

I was in the DC going over something with the lady in charge when someone walked in to ask her something. She said “just a second”. That person took one step to the side of the door and leaned against the wall - right on an EPO which had no cover.

Have you ever heard an entire row of DASD spin down instantly? Or taken 40 minutes to IPL an AS400? In the middle of the business day? For the second most populous city in the country?

Me: Maybe you should get a cover for that?
Her: Good idea.

Couple weeks later, in the same DC, going over final checklist. A fedex guy walks in. (To this day, no idea how he got in a supposedly locked DC.) She says “just a second”, and I get a very strong deja vu feeling. He takes one step to the side and leans against the wall.

Me: Did you order that EPO cover?
Her: Nope.

--
TTFN,
patrick
Re: Famous operational issues [ In reply to ]
when employer had shipped 2xJ to london, had the circuits up, ...
the local office sat on their hands. for weeks. i finally was
pissed enough to throw my toolbag over my shoulder, get on a
plane, and fly over. i walked into the fancy office and said
"hi, i am randy, vp eng, here to help you turn up the routers."
they managed to turn them up pretty quickly.
Re: Famous operational issues [ In reply to ]
> On Feb 18, 2021, at 4:37 PM, Warren Kumari <warren@kumari.net> wrote:
>
> 4: Not too long after I started doing networking (and for the same small ISP in Yonkers), I'm flying off to install a new customer. I (of course) think that I'm hot stuff because I'm going to do the install, configure the router, whee, look at me! Anyway, I don't want to check a bag, and so I stuff the Cisco 2501 in a carryon bag, along with tools, etc (this was all pre-9/11!). I'm going through security and the TSA[0] person opens my bag and pulls the router out. "What's this?!" he asks. I politely tell him that it's a router. He says it's not. I'm still thinking that I'm the new hotness, and so I tell him in a somewhat condescending way that it is, and I know what I'm talking about. He tells me that it's not a router, and is starting to get annoyed. I explain using my "talking to a 5 year old" voice that it most certainly is a router. He tells me that lying to airport security is a federal offense, and starts looming at me. I adjust my attitude and start explaining that it's like a computer and makes the Internet work. He gruffly hands me back the router, I put it in my bag and scurry away. As I do so, I hear him telling his colleague that it wasn't a router, and that he certainly knows what a router is, because he does woodwork…

Well, in his defense, he wasn’t wrong… :-)



----
Andy Ringsmuth
5609 Harding Drive
Lincoln, NE 68521-5831
(402) 304-0083
andy@andyring.com

“Better even die free, than to live slaves.” - Frederick Douglas, 1863
Re: Famous operational issues [ In reply to ]
On 2/19/21 00:37, Warren Kumari wrote:

>
> 5: Another one. In the early 2000s I was working for a dot-com boom
> company. We are building out our first datacenter, and I'm installing
> a pair of Cisco 7206s in 811 10th Ave. These will run basically the
> entire company, we have some transit, we have some peering to
> configure, we have an AS, etc. I'm going to be configuring all of
> this; clearly I'm a router-god...
> Anyway, while I'm getting things configured, this janitor comes past,
> wheeling a garbage bin. He stops outside the cage and says "Whatcha
> doin'?". I go into this long explanation of how these "routers"
> <point> will connect to "the Internet" <wave hands in a big circle> to
> allow my "servers" <gesture at big black boxes with blinking lights>
> to talk to other "computers" <typing motion> on "the Internet" <again
> with the waving of the hands>. He pauses for a second, and says "'K.
> So, you doing a full iBGP mesh, or confeds?". I really hadn't intended
> to be a condescending ass, but I think of that every time I realize I
> might be assuming something about someone based on thier attire/job/etc.

:-), cute.

Mark.
Re: Famous operational issues [ In reply to ]
On Fri, Feb 19, 2021 at 9:40 AM Warren Kumari <warren@kumari.net> wrote:
> 4: Not too long after I started doing networking (and for the same small ISP in Yonkers), I'm flying off to install a new customer. I (of course) think that I'm hot stuff because I'm going to do the install, configure the router, whee, look at me! Anyway, I don't want to check a bag, and so I stuff the Cisco 2501 in a carryon bag, along with tools, etc (this was all pre-9/11!). I'm going through security and the TSA[0] person opens my bag and pulls the router out. "What's this?!" he asks. I politely tell him that it's a router. He says it's not. I'm still thinking that I'm the new hotness, and so I tell him in a somewhat condescending way that it is, and I know what I'm talking about. He tells me that it's not a router, and is starting to get annoyed. I explain using my "talking to a 5 year old" voice that it most certainly is a router. He tells me that lying to airport security is a federal offense, and starts looming at me. I adjust my attitude and start explaining that it's like a computer and makes the Internet work. He gruffly hands me back the router, I put it in my bag and scurry away. As I do so, I hear him telling his colleague that it wasn't a router, and that he certainly knows what a router is, because he does woodwork...

OK, Warren, achievement unlocked. You've just made a network engineer
to google 'router'....

P.S. I guess I'm obliged to tell a story if I respond to this thread...so...
"Servers and the ice cream factory".
Late spring/early summer in Moscow. The temperature above 30C (86°F).
I worked for a local content provided.
Aircons in our server room died, the technician ETA was 2 days ( I
guess we were not the only ones with aircon problems).
So we drove to the nearby ice cream factory and got *a lot* of dry
ice. Then we have a roaster: every few hours one person took a deep
breath, grabbed a box of dry ice, ran into the server room and emptied
the box on top of the racks. The backup person was watching through
the glass door - just in case, you know, ready to start the rescue
operation.
We (and the servers) survived till the technician arrived. And we had
a lot of dry ice to cool the beer..

--
SY, Jen Linkova aka Furry

1 2 3 4 5  View All