Mailing List Archive

1 2  View All
Level 3 down in Atlanta [ In reply to ]
This is the purpose of learning from your mistakes in the past. Create a
maintenance plan so it doesn't happen again!

Fool me once...

Josh Luthman
Office: 937-552-2340
Direct: 937-552-2343
1100 Wayne St
Suite 1337
Troy, OH 45373

"When you have eliminated the impossible, that which remains, however
improbable, must be the truth."
--- Sir Arthur Conan Doyle


On Thu, Oct 22, 2009 at 10:43 PM, George Herbert
<george.herbert at gmail.com>wrote:

> On Thu, Oct 22, 2009 at 7:03 PM, Jay R. Ashworth <jra at baylink.com> wrote:
> > ----- "Jeremy Chadwick" <outages at jdc.parodius.com> wrote:
> >> On Tue, Oct 20, 2009 at 09:28:21AM -0700, Scott Howard wrote:
> >> > Looks like it's all back up as of about 30 mins ago.
> >> >
> >> > Apparently either a core switch or router failed, which took down much
> of
> >> > their network in Atlanta, as well as Memphis and Nashville.
> >>
> >> Level 3 has a single router or switch handling packets at a major
> >> POP?
> >> I doubt this, but the outage is confirmation something bad happened.
> >> That said: where's the redundancy, and why didn't it kick in?
> >
> > Oh; you're *always* asking that.
> >
> > :-)
> >
> > The Internet Backbone<tm> has been a commercial, rather than an
> engineering,
> > construct for over 15 years now.
>
> The RFO that went out somewhat after he asked that was more useful...
> N=2 redundancy was in place. However, when primary had hardware
> failure, secondary had (unknown / unstated) software, config, or
> hardware failure that hadn't been detected or checked, and it didn't
> work either.
>
> It's hard to test clusters of things well when they have near-100%
> uptime requirements. The dependability of the untested failover unit
> is low, as you're not testing it well.
>
> Sometimes you can test failovers in stream. But sometimes those
> supposedly harmless failover tests fail for baroque reasons, taking
> down a service when the primary was in fact just fine.
>
> This isn't (just) an economics problem. Reliability of complex
> problems is an mathematically exponentially hard problem to crack from
> the engineering and theoretical levels.
>
> Some people don't try - and get what they deserve - and some people
> give it a good or best commercial reasonable effort, and still fail.
> Doing better than that is really hard.
>
>
> --
> -george william herbert
> george.herbert at gmail.com
> _______________________________________________
> outages mailing list
> outages at outages.org
> https://puck.nether.net/mailman/listinfo/outages
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://puck.nether.net/pipermail/outages/attachments/20091022/7e5147b0/attachment.html>
Level 3 down in Atlanta [ In reply to ]
On Thu, 22 Oct 2009, George Herbert wrote:

> On Thu, Oct 22, 2009 at 7:03 PM, Jay R. Ashworth <jra at baylink.com> wrote:

>>> Level 3 has a single router or switch handling packets at a major POP?
>>> I doubt this, but the outage is confirmation something bad happened.
>>> That said: where's the redundancy, and why didn't it kick in?

>> Oh; you're *always* asking that.

> The RFO that went out somewhat after he asked that was more useful...
> N=2 redundancy was in place. However, when primary had hardware
> failure, secondary had (unknown / unstated) software, config, or
> hardware failure that hadn't been detected or checked, and it didn't

I'm not in Atlanta but from what was mentioned on the list, it was a soft
failure which is why the other routers didn't failover w/ HSRP or whatnot:

https://puck.nether.net/pipermail/outages/2009-October/001607.html
https://puck.nether.net/pipermail/outages/2009-October/001608.html

The real question should be why nobody powered down that device the first
or second time, considering it didn't failover properly the first time.

https://puck.nether.net/pipermail/outages/2009-October/001600.html
https://puck.nether.net/pipermail/outages/2009-October/001611.html

These things happen from time-to-time -- that's the Internet.

--
William R. Lorenz

1 2  View All