Mailing List Archive: outages, quality monitoring, trouble tickets, etc

Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]

Nov 27, 1995, 8:03 PM

Post #26 of 38 (1312 views)

On Mon, 27 Nov 1995, Dave Siegel wrote:

> Seriously, though. "Acts of God" as they are called are typically not
> counted either. When it comes right down to it, if the customer no longer
> has a premise, then they are probably less worried about their equipment
> being up and operational. Technically, it only matters if you define your
> network as extending up to, and perhaps beyond CPE, or if it is just before
> CPE, or even futher, if it ends right at the Bell Demark at your POP.

I think there is a big difference between Customer Premise Equipment,
which the ISP(if that's theirs) is responsible, and the customer Premise,
which the ISP is not, and should not be responsible for.

> Since RTD purchases all the circuits for clients, our network (read, area of
> responsibility) includes everything up to the CPE, but not CPE itself. If
> there are circuit's down to individual clients as a result of Bell, we don't
> count it as downtime, even though it is logged and pursued as any other outage.

While telco faults are not the faults of the ISP, doesn't it nonetheless
mean that the customer doesn't have connectivity? I guess this depends on
whether you try to arrive at some sort of a metric from provider's or
customer's perspective.

> Why do I believe that? hehe.

Oh, I've seen everything from floods to tornados. High schools seem to be
particularly vulnerable to such things, not to mention things like kids
tripping over power cords and such. :)

-dorian
______________________________________________________________________________
Dorian Kim Email: dorian@cic.net 2901 Hubbard Drive
Network Engineer Phone: (313)998-6976 Ann Arbor MI 48105
CICNet Network Systems Fax: (313)998-6105 http://www.cic.net/~dorian

Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]

kwe at 6SigmaNets

Nov 28, 1995, 5:45 AM

Post #27 of 38 (1311 views)

Permalink

At 5:34 PM 11/27/95, Dave Siegel wrote:
>>
>> It is because of these sorts of questions that I never know what
>> it means when someone says that their network is available 99.9%
>
>Problems with CPE is usually not figured into network uptime.
>

I've been away from the NOC for a couple of years now, but one way that
made sense for reporting a general availability number is to consider the
backbone as one component and each leaf node as an identical independent
component (leased line, router and CSU/DSU).

Then you can define the end to end availability as the concatentation of
two independent leaf nodes and the backbone.

Highly redundant backbones always have a high availability (>99%). Leased
lines are abominable (low 90s or less -- don't ask me about circuits to
Italy) and two routers and two CSU/DSUs lead to an overall end to end of
95% or less.

In my experience with leased lines, they always dominated this equation and
overall availability is strictly limited by actual leased line
availability.
So keep that CPE in the equation and figure out how to get highly available
bit pipes. I think actual networks like frame relay have a chance of being
much more available since the network actually understands frames. Leased
lines are simply electrical signalling conduits to the phone company -- not
a network.

I'm skeptical of any end-to-end availability figures over 97%. I don't
think they reflect the reality of leased line circuits today, or else they
don't include the leaf node circuits and only report backbone availability.
For a highly redundant backbone, almost any definition of availability
should result in a number like 99.mumble%. Remember 99.9% availability
means less than 9 hours outage per year. Routing hiccups take that much.
One or two leased lines outages is all you get for 9 hours. The real world
is a lot less available than that.

But since half the web servers I try to talk to refuse me half the time,
I'm not sure that network availability per se (HWB's complaints duly
acknowledged) is the tallest pole in the tent.

--Kent

Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]

mdz at netrail

Nov 28, 1995, 4:31 PM

Post #28 of 38 (1311 views)

Permalink

On Mon, 27 Nov 1995, Jon Zeeff wrote:

> Being in the web hosting business, we measure our own "availability"
> and that of others. The top providers do 99.9%. The median
> in our sample group is 98.5%. That's about 15 times worse and
> 11 hours/month.
>
> It's amazing how many of the companies in the low 98s claim 99.9%.

I'd be terribly interested to know how you obtained these figures...we do
web hosting services as well. I had one of our clients complain angrily
for weeks that his web site was frequently "down" because he couldn't get
to it from AOL. I had to sit him down and show that his site was
operational and accessible from a dozen other sites to convince him that
AOL was the exception, and that our connectivity and server reliability
were not to blame.

I find it hard to believe that many providers could offer only 98%
reliability (assuming, of course, that this is a measurable quantity;
this is shaky ground); this implies that over an average period of 100
hours (less than 4.2 days), there exists a total of 2 _hours_ of
"downtime" (assuming, again, that this, too, is determinable in any
meaningful sense).

> We also offer guarantees to some of our customers. If we don't meet
> x% availability, we refund $xx. It's not enough to break us, but
> it does reassure the customer that we are concerned.

Do you take the customer's word for it?

> If more providers did this, we would probably see much more
> rapid progress towards more reliable networks. As it is, nobody
> has a quantifiable cost for "unreliability".

I think the reason for this is obvious. If a customer complains that
your network is unreliable because he can't reach it from point X, do you
give him a refund? Not all of us can afford that...I know we get more
than a few complaints of this type every month.

// Matt Zimmerman Chief of System Management NetRail, Inc.
// mdz@netrail.net sales@netrail.net
// (703) 524-4800 [voice] (703) 524-4802 [data] (703) 534-5033 [fax]

Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]

SEAN at SDG

Nov 28, 1995, 4:43 PM

Post #29 of 38 (1309 views)

Permalink

>From: kwe@6SigmaNets.COM (Kent W. England)
>I'm skeptical of any end-to-end availability figures over 97%. I don't
>think they reflect the reality of leased line circuits today, or else they
>don't include the leaf node circuits and only report backbone availability.
>For a highly redundant backbone, almost any definition of availability
>should result in a number like 99.mumble%. Remember 99.9% availability
>means less than 9 hours outage per year. Routing hiccups take that much.
>One or two leased lines outages is all you get for 9 hours. The real world
>is a lot less available than that.

Thank you! I thought I was living in a twilight zone with people
reporting 99.9% network availability. This is the rathole end-to-end
network useablity. The customer is interested in end-to-end useability.
While the network operator can only easily measure intra-network modules.

I can't tell you the answer, but there is definitely something happening
with customer perceptions of Internet useability. Looking at the
numbers I would agree a single leased circuit should be less reliable
(single point of failure) than a highly redundant backbone. But by
our customer perceptions, that isn't the case. Either we have better
than "normal" leased circuits, or the highly redundant backbones aren't,
or our customers needs are based something we aren't directly measuring.

Highly redundant backbones remain extremely vunerable to the "glitch."
Human glitches, software glitches, "impossible" data glitches. Redundant
backbones do protect against the backhoe "glitch."

>But since half the web servers I try to talk to refuse me half the time,
>I'm not sure that network availability per se (HWB's complaints duly
>acknowledged) is the tallest pole in the tent.

Part of this problem is the growing number of interdepencies (complexity,
chaos?). Even if each individual module is working 99.9% of the time,
the probabilities start looking pretty bad when all need to be working
at the same time. To make a web connection, you have a string of name
servers, a string of networks to the name servers, a string of routers
on those networks, another string of networks to the web server, another
string of routers, more strings of networks and routers and servers on
the return path.

I'm amazed it even works 50% of the time. Unfortunately our customers
aren't always as understanding.

Since error reporting sucks in most network applications, it becomes
the fault of whatever help desk happens to take the customers phone call.
--
Sean Donelan, Data Research Associates, Inc, St. Louis, MO
Affiliation given for identification not representation

Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]

edm at halcyon

Nov 28, 1995, 7:07 PM

Post #30 of 38 (1311 views)

Permalink

>On Mon, 27 Nov 1995, Jon Zeeff wrote:
>
>> Being in the web hosting business, we measure our own "availability"
>> and that of others. The top providers do 99.9%. The median
>> in our sample group is 98.5%. That's about 15 times worse and
>> 11 hours/month.
>>
>> It's amazing how many of the companies in the low 98s claim 99.9%.
>
>I'd be terribly interested to know how you obtained these figures...we do
>web hosting services as well. I had one of our clients complain angrily
>for weeks that his web site was frequently "down" because he couldn't get
>to it from AOL. I had to sit him down and show that his site was
>operational and accessible from a dozen other sites to convince him that
>AOL was the exception, and that our connectivity and server reliability
>were not to blame.

I've seen several providers claim "it's AOL's fault" because the provider
themselves didn't properly set up an RADB entry so the ANS network would
carry their packets. Given that AOL is on the other side of the ANS
network, that could be a big problem. So, just sitting down and showing
the site is accessible from a dozen other sites proves nothing about it
not being the provider's problem -- it helps to show that it works in
some cases though.

Ed Morin
Northwest Nexus, Inc.
_________________________________________________________________________
Ed Morin edm@halcyon.com
Northwest Nexus - Professional Internet Services Bellevue, WA USA
Voice: 206 455-3505 Web: http://www.halcyon.com/ Info: info@halcyon.com

Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]

smd at icp

Nov 28, 1995, 7:07 PM

Post #31 of 38 (1314 views)

Permalink

>"it's AOL's fault" because the provider themselves didn't properly
>set up an RADB entry so the ANS network ...

I find it amusing that the only utility ascribed to the RADB
by network operators almost always has to do with being
reachable from ANS.

It's also tragic, actually. I'm glad that my taxes go up to
the receiver-general for Canada sometimes rather than going
to a project which frequently appears to benefit one and only
one U.S.-based network service provider.

Anyway, typically the only time we hear problems with
respect to the RADB is when someone wants to talk to AOL
and cannot, or when some ANS customer can't get to anything
behind us or somewhere downstream from us.

It'd be very interesting to see what happens to the RADB the
second AOL has Internet connectivity through another
provider... It'd certainly reduce the incidence of people
updating the RADB to fix disconnectivities.

Sean.

Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]

jon at branch

Nov 28, 1995, 7:42 PM

Post #32 of 38 (1312 views)

Permalink

> I don't know. One of the most frequent problem I see is power outages at
> the sites. I don't think any ISP can be responsible for things like that.

Let's assume that you measure "availability" as being to ping the
customer site router from some site off your network. Some ISPs install a
UPS on the router at the customer site, some don't. That decision effects
price and reliability and "availability" (especially when the customer
has a UPS on his own equipment).

Or let's look at lightning strikes. Some ISP's say "that's an act of
God, we can't be responsible". Others put in surge supressors, have
hot swap backups, etc. and they keep running.

Some ISPs say "that's the fault of my upstream provider, we can't
be responsible". Others tell their upstream provider to do a better
job or they switch providers.

Much of it sounds like a cop out to me. Ultimately, (with enough money)
there are few reliability factors in this business that are truly out of
anyone's control. It's just very convenient to say that there are.

Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]

mdz at netrail

Nov 29, 1995, 12:05 PM

Post #33 of 38 (1311 views)

Permalink

On Tue, 28 Nov 1995, Ed Morin wrote:

[AOL unreachability]
> I've seen several providers claim "it's AOL's fault" because the provider
> themselves didn't properly set up an RADB entry so the ANS network would
> carry their packets. Given that AOL is on the other side of the ANS
> network, that could be a big problem. So, just sitting down and showing
> the site is accessible from a dozen other sites proves nothing about it
> not being the provider's problem -- it helps to show that it works in
> some cases though.

Hmm. I was under the impression that our maintainer object
(NETRAIL-NOC), AS object (AS4006), and route object (205.215.0.0/18,
containing the server in question) were sufficient. Do I need to go
through some additional motions to satisfy ANS? Also, this seems to
only be an intermittent problem (I would think that ANS not carrying our
packets would result in complete unreachability).

// Matt Zimmerman Chief of System Management NetRail, Inc.
// mdz@netrail.net sales@netrail.net
// (703) 524-4800 [voice] (703) 524-4802 [data] (703) 534-5033 [fax]

Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]

curtis at ans

Nov 30, 1995, 2:08 PM

Post #34 of 38 (1311 views)

Permalink

In message <Pine.LNX.3.91.951129134921.25369D-100000@netrail.net>, Matt Zimmerm
an writes:
> On Tue, 28 Nov 1995, Ed Morin wrote:
>
> [AOL unreachability]
> > I've seen several providers claim "it's AOL's fault" because the provider
> > themselves didn't properly set up an RADB entry so the ANS network would
> > carry their packets. Given that AOL is on the other side of the ANS
> > network, that could be a big problem. So, just sitting down and showing
> > the site is accessible from a dozen other sites proves nothing about it
> > not being the provider's problem -- it helps to show that it works in
> > some cases though.
>
> Hmm. I was under the impression that our maintainer object
> (NETRAIL-NOC), AS object (AS4006), and route object (205.215.0.0/18,
> containing the server in question) were sufficient. Do I need to go
> through some additional motions to satisfy ANS? Also, this seems to
> only be an intermittent problem (I would think that ANS not carrying our
> packets would result in complete unreachability).
>
> // Matt Zimmerman Chief of System Management NetRail, Inc.
> // mdz@netrail.net sales@netrail.net
> // (703) 524-4800 [voice] (703) 524-4802 [data] (703) 534-5033 [fax]

The policy toward AS4006 was set to 1:3561 2:1239 based on the
advisories for the 3 AS4006 nets that existed when we froze the
aut-num. If you add prefixes to AS4006, you don't have to do
anything except to make sure to register route objects with the
correct origin AS.

For AS that have never registered an AS690 advisory (there were 20 AS
covering 59 prefixes in the IRR) we didn't have any policy. For AS
that have never registered anything in the IRR, we don't have any
import policy and we won't be importing their routes.

We plan to run a perl program to detect new aut-nums and keep md5 sums
of prior aut-nums so we can detect changes (assuming the changed field
won't get changed). We will be basing any new import policy on the
paths seen in the IRR, just sending a notification message to the AS
affected when we change things. This will give us as reliable routing
as we have now, but less burden on others to tell us how they want us
to route towards them.

I'm hoping to be able to catch things we need to change by noting
changes to the aut-nums. Right now the tools to do this are not
available, so we need to trace paths manually. It might be that
updating prpaths is about all that is needed.

Curtis

Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]

mdz at netrail

Nov 30, 1995, 3:01 PM

Post #35 of 38 (1314 views)

Permalink

On Thu, 30 Nov 1995, Curtis Villamizar wrote:

> The policy toward AS4006 was set to 1:3561 2:1239 based on the
> advisories for the 3 AS4006 nets that existed when we froze the
> aut-num. If you add prefixes to AS4006, you don't have to do
> anything except to make sure to register route objects with the
> correct origin AS.

The prefix in question was one of these three, and our networks seem to
be talking just fine. The fact that our system also sends and receives
hundreds of messages to/from AOL customers every day would seem to suggest
further that the original problem was/is with AOL. Of course, the fact
that a good percentage of their servers don't respond to pings from
here makes it difficult to isolate when this is happening. Are they
just broken in this aspect, or is this another symptom of a connectivity
problem between us? (I just tried this from another location outside our
network, and trying to ping a.mx.aol.com produced a _segfault_ (Solaris
box)...what's going on here?)

// Matt Zimmerman Chief of System Management NetRail, Inc.
// mdz@netrail.net sales@netrail.net
// (703) 524-4800 [voice] (703) 524-4802 [data] (703) 534-5033 [fax]

Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]

curtis at ans

Dec 1, 1995, 1:03 AM

Post #36 of 38 (1307 views)

Permalink

In message <Pine.LNX.3.91.951130161916.15338O-100000@netrail.net>, Matt Zimmerm
an writes:
> On Thu, 30 Nov 1995, Curtis Villamizar wrote:
>
> > The policy toward AS4006 was set to 1:3561 2:1239 based on the
> > advisories for the 3 AS4006 nets that existed when we froze the
> > aut-num. If you add prefixes to AS4006, you don't have to do
> > anything except to make sure to register route objects with the
> > correct origin AS.
>
> The prefix in question was one of these three, and our networks seem to
> be talking just fine. The fact that our system also sends and receives
> hundreds of messages to/from AOL customers every day would seem to suggest
> further that the original problem was/is with AOL. Of course, the fact
> that a good percentage of their servers don't respond to pings from
> here makes it difficult to isolate when this is happening. Are they
> just broken in this aspect, or is this another symptom of a connectivity
> problem between us? (I just tried this from another location outside our
> network, and trying to ping a.mx.aol.com produced a _segfault_ (Solaris
> box)...what's going on here?)
>
> // Matt Zimmerman Chief of System Management NetRail, Inc.
> // mdz@netrail.net sales@netrail.net
> // (703) 524-4800 [voice] (703) 524-4802 [data] (703) 534-5033 [fax]

Try running traceroute rather than or in addition to ping. How about
if we take this off line. This may not be a pressing NANOG issue.

Curtis

Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]

kwe at 6SigmaNets

Dec 1, 1995, 5:41 PM

Post #37 of 38 (1317 views)

Permalink

At 3:43 PM 11/28/95, Sean Donelan wrote:
>
>>From: kwe@6SigmaNets.COM (Kent W. England)
>>I'm skeptical of any end-to-end availability figures over 97%. I don't
>>think they reflect the reality of leased line circuits today, or else they
>>don't include the leaf node circuits and only report backbone availability.
>>For a highly redundant backbone, almost any definition of availability
>>should result in a number like 99.mumble%. Remember 99.9% availability
>>means less than 9 hours outage per year. Routing hiccups take that much.
>>One or two leased lines outages is all you get for 9 hours. The real world
>>is a lot less available than that.
>
>Thank you! I thought I was living in a twilight zone with people
>reporting 99.9% network availability. This is the rathole end-to-end
>network useablity. The customer is interested in end-to-end useability.
>While the network operator can only easily measure intra-network modules.
>
>I can't tell you the answer, but there is definitely something happening
>with customer perceptions of Internet useability. Looking at the
>numbers I would agree a single leased circuit should be less reliable
>(single point of failure) than a highly redundant backbone. But by
>our customer perceptions, that isn't the case. Either we have better
>than "normal" leased circuits, or the highly redundant backbones aren't,
>or our customers needs are based something we aren't directly measuring.

My analysis was assumed to apply within a single service providers
backbone. I believe you can report end-to-end availability within a single
service provider's backbone and have a meaningful number. But leased lines
bring that figure down into the range of 99.7%. I think frame relay has the
promise of getting that up to .8 or .85, if the regional bells get good at
FR net mgmt.

The dismal situation you are referring to is the 12-15 service provider
backbones, all mashing together at 7-8 exchange points with lots of paths
that wind back and forth over asymmetric routes from one coast to the other
and back.

It's a lot different than when NSFnet was the default. No more default.

Today routing is more complex, business is lower margin, and engineer folks
don't have time to look up from their router memory overload and route
cache woes to figure out why their routes thru ISP#17 aren't optimal.

--Kent

Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]

ulmo at Q

Dec 6, 1995, 6:55 PM

Post #38 of 38 (1314 views)

Permalink

> Since error reporting sucks in most network applications, it becomes
> the fault of whatever help desk happens to take the customers phone call.

Bingo!

Ever since ten years ago, I've wanted *all* my programs to
automatically detect larger-than-usual delays (more than 5ms?) and
start giving *exact* status reports, such as "your host is doing DNS
query", getting more detailed as the delay gets worse, getting the
status and error information dynamically from the intermediaries "host
x.y.z is doing a query of nameservers of the BAZ.COM domain in your
host flappy.c.e.baz.com, and out of seven route servers 4 have failed,
currently trying ... a.root-servers.net ... ICMP unreachable by router
X-Y-Z based on data obtained from ..." and on and on until either
utter failure or success; the greater the delays from what's
reasonable, the more status information gets queried and automatically
offered; if any point freezes, you'd know who is responsible.
Programs should use a common library of routines which would pop up a
window of the responsible organization for the error at hand, and the
proper email address to send complaints to; an automated, standard
computer-interpretable complaint should be registered automatically
similar to syslog but internetted, and a button the user can press to
add comments, opinions, etc. or even send further emails (and
attaching CC's, making local copies, etc.).

I'm amazed it even works 50% of the time, too. I understand why it
doesn't work more often; a zillion pieces. We could use the tools (I
mean computers) at hand better to inform us of these problems.

Yes, I want it to be like my Volvo -- a light lights up whenever a
bulb stops working, but better yet the computer can tell me which
light is out, and can automatically order the spare part from the
right factory. When I'm in Mosaic or Netscape, there's no reason it
shouldn't tell me that a diode in a CSU/DSU just blew out in Wyoming,
owned by Joe Bizzlededof, and that his staff has an average response
time of six hours to fix such a problem, and that his functionaries
have been notified, and whether or not I should expect workarounds and
what actions I would have to take in order to use them (mostly, this
is none -- wait n minutes for routers to find another route and the
route will work again?) And, the same thing with a route missing from
a router table.

Is this crazy?

I don't think so! Everything is getting so complex, we need:

1) to know when users can't use the network; thus the automatic feedback
mechanisms from end-to-end
2) to know what things to fix to minimize #1 (we're not all
trillionares and get to buy redundant everythings; we have to know
which items manufactured by who and which programs maintained by
who are bound to be more reliable than which others, so we can
choose good items or know which ones need redundancy or other
protective measures).
3) users to know exactly who to blame, so that help desks get
*appropriate* calls rather than *inappropriate* ones. (the
get-calls metephore is starting to get old; get-email will be more
and more appropriate), and so that users know which organizations
not to spend money on.

In many cases, users are help desks fixing other peoples' problems.
It's all hierarchical, everyone's the top and everyone's the bottom.

I believe programmers should experiment with these things, and
standards should be drafted, specifically for
feedback-to-user-of-actual-problem and end-to-end automatic error
reporting in *both* directions (so that each side of the connection
and each end of the stack of layers knows what to fix), and
responsible party lookup automation (who to *really* bother when
there's just no other way (right now, 95% of the time)).

Bradley Allen
<Ulmo@Q.Net>