Mailing List Archive

outages, quality monitoring, trouble tickets, etc
A bit of a rambling note, as I catch up with the unusually busy lists....

> From: Scott Huddle <huddle@mci.net>
> I consider this list a place for ISPs to discuss general policy and
> planning issues that effect all of us. It is a very inappropriate
> place to discuss problems with a specific provider.

I firmly DISAGREE! None of us are particularly interested in hearing
every jot and tittle about every network flap, but _BIG_ ones and their
resolution are important to bring to this list! How else to get a
handle on what the real problems are? How else to help each other
avoid repeating the problem in the future?

----

As to the current state of the 'net, I have to agree whole-heartedly
with Hans Werner (something I rarely did when he was around here....)

The mindset in NANOG is pretty useless. It would be nice if folks
stopped beating around the bush, worrying about "competitive" issues,
and started cooperating! Leave the competitive posturing to the
marketing departments.

Fixing problems usually means focusing on a particular case. If
analysis of the problems of a particular ISP/NSP shed some light on the
resolution of a bigger scope, then a little embarrassment is a small
price to pay; it's not fatal -- failing to fix the problem is fatal!

----

As to the earlier discussion about Frame Relay instead of direct links,
my experience is that F-R within a LATA between a few routers is working
reasonably well, but inter-LATA and wider is working poorly, and more
than 5-6 routers is a disaster.

Most of my recent link problems that can be pinpointed enough to trouble
ticket have been directly due to F-R, primarily at PSI. A couple of
weeks ago, they lost the entire Great Lakes area, and didn't notice for
over 4.5 hours. And took another 6 hours to fix. They never did tell
me the final solution.

So, based on experience, I don't recommend F-R for long haul links.
It's just not good enough!

----

One of the reasons that ISPs are flapping is the lack of Link Quality
Monitoring. You can easily tell when the link is degrading, with very
accurate reports on a packet or byte basis. This is particularly
important for F-R links, as the switches don't seem to tell each other
when the link is down.

I was surprised to learn that some folks weren't using PPP LQM on high
speed HDLC links. That's why we originally designed it! PPP also runs
over F-R links, even if all you use it for is LQM.

After 4 years, we are finally getting around to raising PPP LQM for
Draft Standard, but it is pretty widely implemented....

Insist on LQM from your router vendors!

----

Has anybody else noticed how hard it is to get trouble tickets these
days? Once upon a time, I just called the NSF NOC, and got a report to
them in real time, so the problem could be fixed quickly. Nowadays,
NOCs seem to want you to send email with 24 or 48 hour turnaround, or go
through 2 layers of service representatives. Pretty hard to send email
to them when their link is down, or go through "regular" support in the
middle of the night!

We really need more folks like MCI with an 800 number. I've found them
very responsive. But then, I've also found that they have fewer
problems than other ISPs I've dealt with lately. Maybe that's because
they get faster problem reports? (See, I can give compliments, too.)

Bill.Simpson@um.cc.umich.edu
Key fingerprint = 2E 07 23 03 C5 62 70 D3 59 B1 4F 5E 1D C2 C1 A2
Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]
>From: Scott Huddle <huddle@mci.net>
> I consider this list a place for ISPs to discuss general policy and
> planning issues that effect all of us. It is a very inappropriate
> place to discuss problems with a specific provider.

As a general policy I wish all network providers would implement
at a minimum a network notification list. Bonus points for a network
status WWW page, NetNews, FAX, and PR Newswire distribution.

>From: bsimpson@morningstar.COM (William Allen Simpson)
> I firmly DISAGREE! None of us are particularly interested in hearing
> every jot and tittle about every network flap, but _BIG_ ones and their
> resolution are important to bring to this list! How else to get a
> handle on what the real problems are? How else to help each other
> avoid repeating the problem in the future?

This is always a tough call. When is news no longer news. When it
happens all the time. The slow intermittent problems are the hardest
problems to fix. I end up tracking problems from Europe to Australia
for customers, so I'm interested in network outages all over the world.
On the other hand, I don't (usually) care when Bill's PC is turned off
at night. Maybe take a cue from the (failed) NetNews Distribution: or
MBONE-sublist mechanism.

What to do about backbone telco providers which hate announcing
their network problems? Maybe when the NSFNET backbone fractured
into multiple backbones, the NSF should have taken a lesson from
the breakup of AT&T and established a Network Reliability Council.
Any telephone network outage effecting more than 50,000 (now lowered
to 15,000 I believe) lines have to be reported to the FCC's NRC.

Who cares if you connect to three (plus one) NAPs. In the post-NSFNET
era, Internet-wide reliability requires Internet-wide information.
Reporting network reliability problems is more important than how
many NAPs your network connects.

>Has anybody else noticed how hard it is to get trouble tickets these
>days? Once upon a time, I just called the NSF NOC, and got a report to
>them in real time, so the problem could be fixed quickly. Nowadays,
>NOCs seem to want you to send email with 24 or 48 hour turnaround, or go
>through 2 layers of service representatives. Pretty hard to send email
>to them when their link is down, or go through "regular" support in the
>middle of the night!

Welcome to the new and improved Internet. More clueless people cal
NOCs these days (is it plugged in?) so more caller screening is done.
Likewise there more clueless people working in NOCs so more levels
before reaching someone who even understands what the problem is.

NOC-to-NOC communication has been a long standing Internet problem. But
now there are more NOC's, more different ways to contact them, with
no common conventions. Even though its out of date, I still keep my
Internet Manager's Phonebook published by BBN in 1990.

ANS, MCI, and Sprint issued press releases a few months ago about their
joint agreement. I don't know how how well their joint agreement is
working, but I suspect there will be more such agreements between network
operators in the future. As the sheer number of people involved grows,
everyone is going to filter their calls, e-mail, etc. If you aren't on
the list, you get dumped in the "take care of after hell freezes" pile.

In the meantime, keep a stack of business cards and a special rolodex,
with the magic names and telephone numbers that get you directly to
someone who can understand (and maybe even fix) the problem. Interesting
enough, the people usually don't change; but the employers do.
--
Sean Donelan, Data Research Associates, Inc, St. Louis, MO
Affiliation given for identification not representation
Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]
......... Sean Donelan is rumored to have said:
]
] >From: Scott Huddle <huddle@mci.net>
] > I consider this list a place for ISPs to discuss general policy and
] > planning issues that effect all of us. It is a very inappropriate
] > place to discuss problems with a specific provider.

Scott, if a certain provider *cough you_know_who* is causing our
connectivity to go to hell through their lame non aggregation
policies (now fixed) then where else would the issue be discussed?

] As a general policy I wish all network providers would implement
] at a minimum a network notification list. Bonus points for a network
] status WWW page, NetNews, FAX, and PR Newswire distribution.

Hmm, I wonder if the Trib' would be interested in knowing when the
DS3 from Pensaulen is down.....

] Who cares if you connect to three (plus one) NAPs. In the post-NSFNET
] era, Internet-wide reliability requires Internet-wide information.
] Reporting network reliability problems is more important than how
] many NAPs your network connects.

At a precursing glance I would agree with you. However, let us
delve into this a bit deeper. Donning my idiot hat may I point out
that the _most_ important thing is network reliability -Period-.

While I agree w/ you that accountability is important I can't
agree that a simple outage list is very terribly useful. With all
due respect to the Sprint folx, their lists are often vague and
noninformative. As of late the MCI tickets have been more and
more coming and less and less useful.

Back to accountability, (with kindest respect) Sean, you haven't a
lambs foot to stand when Barrnet isn't looking into a problem. In
my opinion this is an issue that needs to be brought to fruition.
Either develop some world policy of Internet Connectivity or
perhaps we should all realize that the only person we can hold
accountable is Mr. Upstream. There are a few ways I can think of
this working, in (my opinion of the) order of their potential for
success:

o Sprint, MCI, ANS jointly fund an Inet trouble tracking NOC

o The Federal Government's FCC encompasses USA's Inet
Traffic as a medium

o NSPs voluntarily subscribe to a policy of notification of
problems to a global mail list.

Your page at DRA is quite good, however the concensus among
upper management (not just at our site) is "Why should other
people know when we're broke?". And the sad thing is, I am
tempted to agree with them.

If you call our NOC and you ask about a connectivity issue, you
will get a straight answer. Perhaps not from the first person you
get, maybe not the second, but my people will escalate it until
you do.

The fact that we don't advertise this is not deterrent to the
quality of the information, only the convenience.

] >Has anybody else noticed how hard it is to get trouble tickets these
] >days? Once upon a time, I just called the NSF NOC, and got a report to
] >them in real time, so the problem could be fixed quickly. Nowadays,
] >NOCs seem to want you to send email with 24 or 48 hour turnaround, or go
] >through 2 layers of service representatives. Pretty hard to send email
] >to them when their link is down, or go through "regular" support in the
] >middle of the night!

I don't know how all the other NSPs work, but if there is ever an
issue wrt connectivity or systems we HAVE a trouble ticket and we
WILL provide it on request. With kindest respect, I understand
your desire to get it "on demand" but with a bit more work you can
get it from our NOC.

] Welcome to the new and improved Internet. More clueless people cal
] NOCs these days (is it plugged in?)

You can't imagine how humerous this is.... :) I truly feel sorry
for the poor chaps at INSC....

] so more caller screening is done.

To a point, but if the person on our end of the phone doesn't know
the answer, they aren't allowed to say as such, they escalate the
issue until it's resolved. "I don't know" has to be followed by a
promise here.

Is this not common?

] NOC-to-NOC communication has been a long standing Internet problem.

Hmm, I'm not sure I would terribly agree. When MCI or Sprint has
a problem, we have not had any latency issue getting to them.
Likewise w/ wacky issues causing us to get with Sura, Barr, Cerf,
Westnet, etc...

] no common conventions. Even though its out of date, I still keep my
] Internet Manager's Phonebook published by BBN in 1990.

Sounds like a good market... :)

] In the meantime, keep a stack of business cards and a special rolodex,
] with the magic names and telephone numbers that get you directly to
] someone who can understand (and maybe even fix) the problem. Interesting
] enough, the people usually don't change; but the employers do.
^
You have openings? ;-)

Enter the "Backbone Cabal". I can call you when I need to know
what's up w/ DRA. If apropo, you shoot me to the less clueful
person. You call us, ditto. I've got the same folx at ANS,
Sprint, MCI, etc.. That's why we're important, we know who can
do what and occasionally how to find them.

Do you really want outage and downtime on public record, or do you
want easier access to clueful folx?

-alan
Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]
> Do you really want outage and downtime on public record, or do you
> want easier access to clueful folx?

A clueful expert system running on a data base collecting information
from all over the place, and answering questions automatically, would
be a good start. Like an automatic responder at something like this FCC
NRC someone mentioned earlier. No need to send me random outage email,
until I perceive a problem. I get enough email even without that. I
care about fixing problems. The problem is that there is no working
procedure if a problem is being perceived, that results in information
to a user in near real-time. The underlying issue is no overall
Internet management, or at least coordination, at this time.

In a computer network we *do* have the technology to do such things,
you know. It requires willingness to do it more than it needs
technology at this point of time.
Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]
From: alan@gi.NET (Alan Hannan)
> Hmm, I wonder if the Trib' would be interested in knowing when the
> DS3 from Pensaulen is down.....

Sometimes the Trib would be interested, most of the time they wouldn't.
Who cares if a central office in Hillsdale, IL burns down? Some people
thought it was front page news.

> At a precursing glance I would agree with you. However, let us
> delve into this a bit deeper. Donning my idiot hat may I point out
> that the _most_ important thing is network reliability -Period-.

That would be great, please give me the name of a network provider
which provides perfect network reliability.

In the absence of perfection, please tell me what went wrong when I
can't get an expected level of usability out of the network. You can
greatly reduce your customers' stress levels simply by keeping them
informed. Give me the TCP/IP equivalent of "*beep* *boop* *BEEP* We're
sorry your TCP/IP connection can not be completed due to an earthquake
(software glitch, route table overload, nuclear detonation) in the area.
Please hang up and try your call later."

With an accurate RA database, and a little magic, the route servers
could redirect connections to an intercept message. That should
send a shiver up the spine of your network security folks.

It would be nice if the problem is also fixed quickly, but I realize
that is asking for a lot. In the mean time, keep the customer informed.
As the size of the Internet has grown, keeping the customer informed
is a bigger job. Relying on a 1-800 number doesn't work when a large
NSPs backbone melts down, and all the NSP's customers call the NOC at
the same time.

> Your page at DRA is quite good, however the concensus among
> upper management (not just at our site) is "Why should other
> people know when we're broke?". And the sad thing is, I am
> tempted to agree with them.

Thanks for the complement. I would point out to your upper management
other people already know when your network is broken. If they didn't
notice, it wouldn't be a problem. If a network falls over in the woods,
and there was no one to hear it, does it make a sound?

Tell your upper managers, the only time people don't know when your
network is broken is when your network is irrelevant to their work. I
don't know about you, but if I was managing an irrelevant network, I
would be working on my resume. Maybe that's why so many people in
this business keep switching employers? :-)

> Do you really want outage and downtime on public record, or do you
> want easier access to clueful folx?

As a network user (operator, manager):

- Ideally I want a useable network.
- When I can't use the network, I want an explanation.
- I want the problem fixed so I can use the network again.

How you meet those needs, I don't care. If you fix the problem
before I'm effected by it, then I don't care about the intermediate
steps either (tree falling in the forest). If I can get the explanation
from an automated server, then I don't have to bother your clueless or
cluefull folk. If your clueless folk can take a log and get it resolved,
I don't have to bother your cluefull folk.

When I need to dig out my magic cache of business cards and start
e-mailing/calling the secret members of the "backbone cabal" to get
a problem fixed, I consider it a failure of the process. That is
what I meant when I said NOC-to-NOC communications has been a long-term
Internet problem. It relies on personal contacts, rather than a
reliable process. I try very hard not to handle problems by directly
contact people I know at other NOCs. I prefer handling problems through
the normal NOC channels. If you do happen to receive a phone call
directly from me about a network problem, something has gone very wrong.
--
Sean Donelan, Data Research Associates, Inc, St. Louis, MO
Affiliation given for identification not representation
Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]
With all due respect, Alan wrote:
...
> While I agree w/ you that accountability is important I can't
> agree that a simple outage list is very terribly useful. With all
> due respect to the Sprint folx, their lists are often vague and
> noninformative. As of late the MCI tickets have been more and

Sorry, SprintLink has 255/255 in my book.

For instance, long before Gordone Cook sent his flamebait nanog-way,
SPRINT had already identified a power failure at sl-stk-9 and its
one hour downtime.

On at least three occasions that I can recall, Sean has sent
out innage (non-outage ;) notes indicating _in_detail_ what he was
going to install on what routers, where it had been tried, and which
functionality he was expecting to get out of this. Some of the details
are the stuff that Cisco is yet to implement in production release.

On at least four other occasions, all in the last month, Sean, Elliot
Alby, Peter Lothberg, and one other SPRINT individuals [or consultants]
posted details as to an outage, ETR, details, up-to-the-minute stuff,
and a resolution.

While they had their problems before, and will doubtless have them
again, right now SprintLink's backbone has a good trouble-reporting
mechanism.

> Back to accountability, (with kindest respect) Sean, you haven't a
> lambs foot to stand when Barrnet isn't looking into a problem. In

I have a ticket open with barrnet since 07/95 as to simple loss in
the Santa Cruz area. When Barrnet [oops, sorry, "call us BBN Planet"]
closes that 4-month-old ticket, talk.

> Do you really want outage and downtime on public record, or do you
> want easier access to clueful folx?

I'll take both, thanks very much. Frankly I'll give up on having
immediate access to clueful folx. We're all f'busy. I'll take access
to folks-clueful-enough-to-fix-it without my having to educate them.
So far, the folks at the SL INSC (Diana, Muhammed, Pat, etc.) have
done fine by me.

Ehud
gavron@Hearts.ACES.COM
p.s. You might be inclined to think "My, how easy it is to impress
him. SL must have really bought him lunch." Well let me tell you --
it is easy to impress me -- with professionalism, quality, competence, and
attention to detail. SL has those.
Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]
......... Ehud Gavron is rumored to have said:
] > due respect to the Sprint folx, their lists are often vague and
] > noninformative.
]
] Sorry, SprintLink has 255/255 in my book.
]
] On at least three occasions that I can recall, Sean has sent
] out innage (non-outage ;) notes indicating _in_detail_ what he was

Erm, Sean != Sprint. (That's arguable, you know ;) I'm not here
to downgrade anyone, Jove knows my folks have been less than 100%
at times. However, I gain little of substance from the notes.
So we're different. Sure, they're nice, and a bit informative,
but they don't help me fix connectivity problems. They don't
really even help me explain things to my customers. (Don't take
me off SL-outage! :)

] While they had their problems before, and will doubtless have them
] again, right now SprintLink's backbone has a good trouble-reporting
] mechanism.

Agreed, but I direct back to the central issue, that being HOW
does this trouble reporting mechanism improve the quality of the
overall Internet connectivity?

] > Back to accountability, (with kindest respect) Sean, you haven't a
] > lambs foot to stand when Barrnet isn't looking into a problem. In
]
] I have a ticket open with barrnet since 07/95 as to simple loss in
] the Santa Cruz area. When Barrnet [oops, sorry, "call us BBN Planet"]
] closes that 4-month-old ticket, talk.

Why should they talk to you? Do you pay them a service fee?
That's my base issue, there is a hierarchy, and you can't skip
rope to the other guy. It just doesn't work, there's nothing in
the system to encourage it.

[ access to information or do you....]
] > want easier access to clueful folx?
]
] I'll take both, thanks very much.

I'm not sure where my responsibility to you lies. You are another
person on this wacky Internet we've created. Why should I
allocate 60 hours of my staff time to design an integrated web
reporting mechanism for you?

Sean Donelan has a terribly good point, he's my customer, and his
words mean alot, but I can't agree w/ him that he should/could
demand the same thing from another ex-NSFnet regional, or from
Sprint. I certainly see no reason why I should do this work for
you.

MFS and the RA folx do it becuase their customers demand it.
Along the way they provide the information to y'all.

So we end up in a socialist system where maybe I'm demanded to
provide this for my customers, and maybe along the way I'll
provide it to the Inet community, but there's no motivation for me
to do it globally. Someone ought convince me that knowing where someone
else's problem exists make it easier for me to fix problems of
mine own.

There seems to be this large obsession with linking information to
action. If you get an update you think something's happening.
Perhaps it's needed, but stuff will happen whether your hand is
held or not.

Brash in my Thanksgiving Vegetarianism,

-alan
Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]
Alan wrote:
[I wrote:]
>] I have a ticket open with barrnet since 07/95 as to simple loss in
>] the Santa Cruz area. When Barrnet [oops, sorry, "call us BBN Planet"]
>] closes that 4-month-old ticket, talk.

> Why should they talk to you? Do you pay them a service fee?

Their customer has instructed Barrnet NOC and staff to treat me as
a consultant/employee of the customer authorized to speak for them.
(4 month open ticket. Problem duplicated at will. Large packet loss.
inexcusable.)

...
> I certainly see no reason why I should do this work for you.

Fair enough... Do no work for other than your customers. This isn't
Atlas Shrugged. If I need something from you I'll get one of your
customers to sign off on it, or I'm just another leech.

> Brash in my Thanksgiving Vegetarianism,

> -alan


Ehud

--
Ehud Gavron (EG76)
gavron@Hearts.ACES.COM
Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]
Sorry, another long one.

Relevance to NANOG, well, NOCs are customers too. And "remote" NOCs
often report problems that effect your paying customers too. Customer
service should be of interest to operations folks, at least to the extent
the problems are getting reported to the right people to fix.

I doubt I can change anyone's mind that providing explanations to
customers and non-customers when the network has problems is good for
business. In the future I will simply recommend to customers to buy
services from NSPs which do provide explanations when their networks
fail. Since I haven't found a perfect network yet, I suspect it
includes everyone on this list.

> Why should they talk to you? Do you pay them a service fee?
> That's my base issue, there is a hierarchy, and you can't skip
> rope to the other guy. It just doesn't work, there's nothing in
> the system to encourage it.

The hierarchy is dead. None of the old NSFnet regional have a
monopoly on service in their regions any more. Outside of the US,
there are still a few monopoly providers, but they are a rare breed. If
you aren't providing the level of service I need, I'll go to someone who
can. If XYZ's NOC gives me better service than ABC's NOC, I'll
recommend XYZ to my customers.

> Sean Donelan has a terribly good point, he's my customer, and his
> words mean alot, but I can't agree w/ him that he should/could
> demand the same thing from another ex-NSFnet regional, or from
> Sprint. I certainly see no reason why I should do this work for
> you.

Because it is in their self-interest? You are correct I can't make
anyone run their network how I would like it run, not even MIDNET (GI).

But I can point out long-term problems and code of silence is costing such
providers money, and has already cost them customers. For example, I really
wish my direct providers would stop munging BGP announcements, or explain
why they are doing it. If I have made a mistake, I would like to fix it.
Otherwise I will come to the conclusion those provider's NOCs are not up
to the job and find a different provider that can do the job.

When someone (anyone) reports a problem effecting connectivity with
your network, more than likely the reverse is also true for your paying
customers. DRA has a bunch of customers connected through just about
every major NSP in North America and a couple of other continents. The
only time "I" call another NSP is when the process has become totally
FUBARed. When I call another NSP, it is usually that NSP's last chance
to keep a paying customer on their network.

I might call BARRNET because the University of California-Davis has
reported problems reaching DRA to DRA's help desk, and the problem hasn't
been resolved. No, BARNET doesn't *have* to talk to me. And I will
report the same back to the customer. However, I suspect it is in
BARRNET's self-interest to work with me in resolving the problem
to ensure UC-Davis has end-to-end reliability.

I track network reliability by dollars (not packet loss, not latency).
I measure network providers, good and bad, by how many of our customers
have used their own dollars to buy private lines to St. Louis because
they couldn't get the reliability they needed from the network provider.

It is not a pretty picture. <http://dranet.dra.com/dranet.html> has
a picture where our private line customers are located. If you are an
NSP, every one of those green boxes (some boxes represent many paying
customers) is an arrow through the heart of your (former) customers view
of your network reliability. If you are an NSP in one of those areas,
not dealing with these problems or providing coherent explanations has
cost you cold-hard cash.

Money is something I expect most upper managers to understand. DRA makes
its profits elsewhere. DRAnet is simply a vertical market VPN used to sell
access to other things. I'm happy to use the Internet and other NSP's to
provide that VPN, when the quality exists. On the other hand, if I have
to manage a not-so virtual VPN with private lines to achieve the required
level of quality, I do.

Maybe Adam Smith's invisible hand will correct this eventually.

> There seems to be this large obsession with linking information to
> action. If you get an update you think something's happening.
> Perhaps it's needed, but stuff will happen whether your hand is
> held or not.

As I said before: Ideally I want a reliable network. If you can't
provide a perfectly reliability network I want an explanation when I
can't get through. And I want the problem fixed. The better the
explanation, the longer I'm willing to give you to fix the problem. If
I get no explanation, I expect the problem to already be fixed.

The current situation is the customer gets neither the explanation nor
action solving the problem. My proof is the DRAnet map. DRA's customers
take a very, very long time to budget money. Those green boxes represent
customers whose problems went unanswered, and unsolved for a long time
before they gave up on their NSP and expended their own dollars for a
private line to St. Louis.

Since the technicians seem to be having a very difficult time fixing
the network, I thought upper management could meet my other goal. Give
the customer an explanation. I'm not pointing fingers at any particular
NSP, because frankly I don't have enough fingers to point. Everyone had
problems. Yes, even DRA's NOC has fallen down a few times. I'm not asking
for perfection, but an explanation when things don't work, while you
fix the problem.

The Internet is a global cooperative network. If people don't cooperate,
the global nature of the network fails. Since your customers may in fact
want to use the Internet to communicate globally, problems effect customers
globally. When I go to the US Post Office, sometimes there is a sign on
the wall that postal service to Timbukto may be delayed because Timbukto's
main post office was blown up. I have no idea how many postal customers
in Olivette, Missouri send mail to Timbukto. Even though the US Post Service
has no control over rebuilding Timbukto's main post office, the US Post Service
has discovered it is good customer service to inform their customers why
their mail to Timbukto may be delayed.

Can't NSPs provide their customers an explanation at least as well as
the US Post Office?
--
Sean Donelan, Data Research Associates, Inc, St. Louis, MO
Affiliation given for identification not representation
Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]
......... Sean Donelan is rumored to have said:

] Customer service should be of interest to operations folks, at least
] to the extent the problems are getting reported to the right people to fix.

It certainly is here.

] I doubt I can change anyone's mind that providing explanations to
] customers and non-customers when the network has problems is good for
] business.

I agree with you that it is important.

] In the future I will simply recommend to customers to buy
] services from NSPs which do provide explanations when their networks
] fail. Since I haven't found a perfect network yet, I suspect it
] includes everyone on this list.

alan> rope to the other guy. It just doesn't work, there's nothing in
alan> the system to encourage it.

What I mean by saying this is NOT that I don't think a per-NSP
trouble reporting mechanism is a good idea. What I'm saying is
that within our Internet arrangement today, I don't see that it's
terribly capitalistically useful for NSP-A to adverise internal
problems to NSP-B. There is no doubt in my mind that it IS
terribly useful for NSP-A to advertise internal problems to
NSP-A's customers, as well as to NSP-B if they inquire on behalf
of NSP-B's customers wrt an outage internal to NSP-A. You're
right the migration of customers is a good metric, but it's hard
to quantify that migration wrt trouble reporting to management.

A friend at MFS brings up a good point, that being that the COREN
agreement stipulated for a trouble reporting list.

Perhaps we could work to develop a scalable model of such for
world wide Internet use, or adapt that to this.

Any other suggestions?

] If you aren't providing the level of service I need, I'll go to someone who
] can. If XYZ's NOC gives me better service than ABC's NOC, I'll
] recommend XYZ to my customers.

Adam Smith's rules _will_ follow us into the Internet. Agreed.

] > Sprint. I certainly see no reason why I should do this work for
] > you.
]
] Because it is in their self-interest? You are correct I can't make
] anyone run their network how I would like it run, not even MIDNET (GI).
]
] But I can point out long-term problems and code of silence is costing such
] providers money, and has already cost them customers.

It's not a code of silence. That's my point, that being that
historically when we are asked about problems we give darn good
answers. That we don't directly advertise problem attention or
resolution is not correlative to our response to requests.

Should we provide darned good answers? - YES
Should we provide automated Darned Good Answers to our customers?
- YES, it would be nice
but not a NEED,
rather a nifty
service (IMHO)

Should we provide automated Darned Good Answers to other NSPs?
- YES, it would be nice
but not a NEED,
rather a nifty
service and
lower priority
than #2.

] I might call BARRNET because the University of California-Davis has
] reported problems reaching DRA to DRA's help desk, and the problem hasn't
] been resolved. No, BARNET doesn't *have* to talk to me. And I will
] report the same back to the customer. However, I suspect it is in
] BARRNET's self-interest to work with me in resolving the problem
] to ensure UC-Davis has end-to-end reliability.

I agree it is too. However, when I hear people complaining about bad
NOCs, I think it is important to point out that there is no
mechanism in place to hold those other NSPs accountable as the
person complaining is rarely the customer of the NSP. Yes it's in
our long term interest, but that doesn't mean there's something in
place to encourage it other than honest intention.

] I track network reliability by dollars (not packet loss, not latency).
] I measure network providers, good and bad, by how many of our customers
] have used their own dollars to buy private lines to St. Louis because
] they couldn't get the reliability they needed from the network provider.

Ouch.

] As I said before: Ideally I want a reliable network. If you can't
] provide a perfectly reliability network I want an explanation when I
] can't get through. And I want the problem fixed. The better the
] explanation, the longer I'm willing to give you to fix the problem. If
] I get no explanation, I expect the problem to already be fixed.

This is a good point, and I have been more convinced that it is
important.

Because of this discussion I am going to work to develop an
automated WWW status page.

] The current situation is the customer gets neither the explanation nor
] action solving the problem.

I appreciate that NSP response is not always ideal. However, I
would encourage all people who get a less than exceptional
response from a NOC technician to escalate the question so as to
improve the NOC quality. No, this isn't something you should have
to do, and it's not something that makes anyone terribly proud but
it does tend to improve the service by natural tech selection.

] Since the technicians seem to be having a very difficult time fixing
] the network, I thought upper management could meet my other goal. Give
] the customer an explanation.

This is done when they ask, and due to your and others concern, I
am going to work to develop an automated web page showing down
time problems.

] The Internet is a global cooperative network. If people don't cooperate,
] the global nature of the network fails.

Agreed.

] Can't NSPs provide their customers an explanation at least as well as
] the US Post Office?

Yes, it's possible, and due to this discussion, I am going to work
to build one as nice as FedEx's.... Anyone want to volunteer
joint development? :)

-alan
Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]
On Sat, 25 Nov 1995, Alan Hannan wrote:

> Should we provide automated Darned Good Answers to our customers?
> - YES, it would be nice
> but not a NEED,
> rather a nifty
> service (IMHO)

Automated answers would be great...but what about implementation? "Press
1 for an automated status report...<click>" Keeping customer service
staff well-informed (perhaps via an internal automated system) might be a
better solution.

> Should we provide automated Darned Good Answers to other NSPs?
> - YES, it would be nice
> but not a NEED,
> rather a nifty
> service and
> lower priority
> than #2.

I'm afraid I have to disagree...in a network of the level of complexity
of today's Internet (in fact, in any system where communication between
two points is dependent on more than just an "upstream" entity),
connectivity issues are MORE likely to be caused by interaction with
other NSP's. Dissemination of problem information between providers
helps everyone diagnose difficulties and keep their customers better
informed with respect to current status and predictions for the near
future (solutions).

A mailing list for this purpose seems like overkill...if dozens of NSP's
were to be informed every time JoeNet has a problem, even if their
service were not to be affected, the noise overload would reduce the
informative value of the list, as well as provider attention to it. But
how to determine when a problem is important enough to be distributed?

A more interactive shared system (ticket-based?) makes more sense, but
may prove far more difficult to design. Problem classification, impact,
severity, and location are all issues here, as well as the problem of
associating such a record of a problem with its effects. That is, when
a provider "discovers" a problem, how are they to know if it has already
been "registered", and if so, how to reference the information associated
with it?

[need for explanations]
> This is a good point, and I have been more convinced that it is
> important.
>
> Because of this discussion I am going to work to develop an
> automated WWW status page.

Good response, but how sound is the choice of implementation? If there
is a problem with your network, there is no small chance that those most
interested in acquiring this information would not be able to reach your
server to do so.

> ] The current situation is the customer gets neither the explanation nor
> ] action solving the problem.
> I appreciate that NSP response is not always ideal. However, I
> would encourage all people who get a less than exceptional
> response from a NOC technician to escalate the question so as to
> improve the NOC quality. No, this isn't something you should have
> to do, and it's not something that makes anyone terribly proud but
> it does tend to improve the service by natural tech selection.

I hate to say it, but what may be needed here is standardization. NOC
operating procedre varies greatly between providers, and the proper
escalation, etc. of a problem may not be clear.

// Matt Zimmerman Chief of System Management NetRail, Inc.
// Work..........mdz@netrail.net | Play...gemini@alcor.netrail.net
// (703) 524-4800 [voice] (703) 524-4802 [data] (703) 534-5033 [fax]
Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]
On Thu, 23 Nov 1995, Hans-Werner Braun wrote:

> In a computer network we *do* have the technology to do such things,
> you know. It requires willingness to do it more than it needs
> technology at this point of time.

I think the response being generated (if nowhere else) on this list shows
that there IS a willingness to implement such a system. The "best"
design for a global trouble database is unclear at best; a lot of issues
re: data collection, information distribution, etc. will have to be resolved.

I, for one, am willing to devote whatever time and computing resources
are required of me to support a project like this, and I think most other
providers share my position. Being better informed is an advantage to
everyone.

// Matt Zimmerman Chief of System Management NetRail, Inc.
// Work..........mdz@netrail.net | Play...gemini@alcor.netrail.net
// (703) 524-4800 [voice] (703) 524-4802 [data] (703) 534-5033 [fax]
Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]
On Sat, 25 Nov 1995, Matt Zimmerman wrote:

> connectivity issues are MORE likely to be caused by interaction with
> other NSP's. Dissemination of problem information between providers
> helps everyone diagnose difficulties and keep their customers better
> informed with respect to current status and predictions for the near
> future (solutions).

Agreed, but it has to be done in an "easy" manner. I'm sure that several
of the NSPs have concerns as to what this information will be used
for. Everyone likes to portray the image of having a 99.98%
uptime whenever possible, even though most folks realize that it just
plain isn't possible, at least today. This sort of leads into the
question of the various NOCs integration with whatever central repository of
information we are shooting to provide. When provider X opens a ticket,
will it automatically be reflected in the 'central' database? I doubt
folks will go for that based on security alone. Or how about provider
X's NOC staff fire off an Email to incident-report@outages.com? How will
they be trained or reimbursed for their time spent on this service?

[..facts about how useless mailing lists are removed..]
> A more interactive shared system (ticket-based?) makes more sense, but
> may prove far more difficult to design. Problem classification, impact,
> severity, and location are all issues here, as well as the problem of
> associating such a record of a problem with its effects. That is, when
> a provider "discovers" a problem, how are they to know if it has already
> been "registered", and if so, how to reference the information associated
> with it?

Such an idea is already being discussed in several smoke filled rooms. :)
Remedy/ARS has the ability to accept input for incident reports and
queries to its database via an Email form. One could write a Web page
containing the necessary parameters in a form, and then transpose that to an
Email sent to the AR system. Implementing such a system is really based
around cost issues, as the coding is relatively trivial. (CGIs come to mind)
(I used the above example because it's something we've done in the past
and I know works, there are probably others)

On the issue of connectivity -- agreed; some lonely site should not
be allowed to be the only host. However -- if connectivity between
certain NSPs also falls apart, you're equally screwed. Some sort of
distribution of the "centralized" source of information would be needed.

I forsee the most difficult part of the process being, convincing all of
the associated Operations groups into sharing their outage information.
Providing a simple mechanism for either the customer service, or operations
staff to disseminate outage information to the "server," would be equally
challenging. If step (a) were to be overcome, I would assume that
writing a procedure to fit (b).


-jh-
Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]
>>>>> "Jonathan" == Jonathan Heiliger <loco@mfst.com> writes:

Jonathan> Everyone likes to portray
Jonathan> the image of having a 99.98% uptime whenever
Jonathan> possible, even though most folks realize
Jonathan> that it just plain isn't possible

Well, more importantly, what on earth does a number like
that mean?

Sean.
Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]
In article <951123045818.17481@SDG.DRA.COM> Sean Donelan <SEAN@SDG.DRA.COM> writes:

> Welcome to the new and improved Internet. More clueless people cal
> NOCs these days (is it plugged in?) so more caller screening is done.
> Likewise there more clueless people working in NOCs so more levels
> before reaching someone who even understands what the problem is.

I've had an idea in the back of my head for a long time about dealing
with this, ever since I tried to call an 800 customer assistance
number and got a nice message saying "the current average wait is 45
minutes"...

I guess it could apply equally well to NOCs, so here it is. Provide a
"secret" layer 2 NOC. When someone calls the NOC 800 number who is
obviously more clueful than the NOC operator/screening person, pass
them onto the next level of experts, and also give them a new "secret"
800 number. Then in the future, that caller will skip the first layer
of screeners who handle the "is-it-plugged-in" problems.

Of course, there is no reason to only have two levels. Callers with
proven technical competence (over time) automatically get shunted to
help at the right level for their know-how. Periodically, you
probably want to recycle the 800- numbers down to the lowest level so
that word of the secret numbers doesn't get too widespread.

I'd sure like to see someone implement this (or something like it). A
few companies we have worked with have had individual techies give us
their direct number and say "from now on, just call me direct", but
this is rare and probably would be a bad idea for a NOC.

--Jamshid
Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]
On Nov 27, 9:17am, Sean Doran wrote:
> Jonathan> Everyone likes to portray
> Jonathan> the image of having a 99.98% uptime whenever
> Jonathan> possible, even though most folks realize
> Jonathan> that it just plain isn't possible
>
> Well, more importantly, what on earth does a number like
> that mean?

Sorry, bad choice of words. Rather than uptime, availability would be the
proper word. Availability tends to be the amount of time the network is
"available" for the customer to receive their expected service (whether
guaranteed in writing or not), and for the customers expectation of how the
service will perform when it is considered "in-service" is met.


-jh-
Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]
Jonathan Heiliger <loco@mfst.com> writes
> On Nov 27, 9:17am, Sean Doran wrote:
> > Jonathan> Everyone likes to portray
> > Jonathan> the image of having a 99.98% uptime whenever
> > Jonathan> possible, even though most folks realize
> > Jonathan> that it just plain isn't possible
> >
> > Well, more importantly, what on earth does a number like
> > that mean?
>
> Sorry, bad choice of words. Rather than uptime, availability would be the
> proper word. Availability tends to be the amount of time the network is
> "available" for the customer to receive their expected service (whether
> guaranteed in writing or not), and for the customers expectation of how the
> service will perform when it is considered "in-service" is met.

This is reasonable expectation, but unfortunately it's pretty difficult
to measure what "the network" is. Do you mean your NSPs backbone? Its
connections to other NSPs? Connections to a specific site? Global
connectivity? How do you factor when you or a target site is singly
connected?

-scott
Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]
Being in the web hosting business, we measure our own "availability"
and that of others. The top providers do 99.9%. The median
in our sample group is 98.5%. That's about 15 times worse and
11 hours/month.

It's amazing how many of the companies in the low 98s claim 99.9%.

We also offer guarantees to some of our customers. If we don't meet
x% availability, we refund $xx. It's not enough to break us, but
it does reassure the customer that we are concerned.

If more providers did this, we would probably see much more
rapid progress towards more reliable networks. As it is, nobody
has a quantifiable cost for "unreliability".

> > Jonathan> the image of having a 99.98% uptime whenever
> > Jonathan> possible, even though most folks realize
Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]
> > Jonathan> Everyone likes to portray
> > Jonathan> the image of having a 99.98% uptime whenever
> > Jonathan> possible, even though most folks realize
> > Jonathan> that it just plain isn't possible
> >
> > Well, more importantly, what on earth does a number like
> > that mean?
>
> Sorry, bad choice of words. Rather than uptime, availability would be the
> proper word. Availability tends to be the amount of time the network is
> "available" for the customer to receive their expected service (whether
> guaranteed in writing or not), and for the customers expectation of how the
> service will perform when it is considered "in-service" is met.

heh.

Depending on the service organizations view customer expectations, there is a
little too much room for differences in perceived uptime using that definition.

Dave

--
Dave Siegel President, RTD Systems & Networking, Inc.
(520)623-9663 Network Engineer -- Regional/National NSPs (Cisco)
dsiegel@rtd.com User Tracking & Acctg -- "Written by an ISP,
http://www.rtd.com/~dsiegel/ for an ISP."
Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]
available *to* whom? the customer themself? other customers of the same
local provider? other customers of the same regional provider? of one of
their local/regional peers? to my friend Serge in Odessa Ukraine?

randy
Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]
On Nov 27, 7:11pm, Jeff.Ogden@um.cc.umich.edu wrote:
> It is because of these sorts of questions that I never know what
> it means when someone says that their network is available 99.9%
> of the time. Hell at least part of MichNet is available 100% of
> the time, but that never seems to cut it with the folks who don't
> have service right now.

I hate to say it, but the relationship in my Email regarding carriers not
disclosing network outages could potentially be to maintain the guise of a
better than real-life network availability.

If we want to enter the tangent discussion that seems inevitable at this point,
perhaps we should at least change the subject. :-)


-jh-
Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]
> service that due to various problems is available only 50% of the
> time. So what is the availability that I should report to potential
> new customers? 100% 50% or 75% Does it matter that the reason the
> one customer has only 50% availability is that the room where the
> router at that customer's site is underwater every other day due to
> no fault of my own?
>
> It is because of these sorts of questions that I never know what
> it means when someone says that their network is available 99.9%

Problems with CPE is usually not figured into network uptime.

Now, if it was *your* router at the customer prem, then it would be your
responsibility that you allowed your equipment to, uh, become waterlogged. ;-)

Dave

--
Dave Siegel President, RTD Systems & Networking, Inc.
(520)623-9663 Network Engineer -- Regional/National NSPs (Cisco)
dsiegel@rtd.com User Tracking & Acctg -- "Written by an ISP,
http://www.rtd.com/~dsiegel/ for an ISP."
Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]
OK. I am a small network service provider and I have two customers.
One has service that is available 100% of the time. The other has
service that due to various problems is available only 50% of the
time (not sure why they stay with me, but I am happy to take their
money). So what is the availability that I should report to potential
new customers? 100% 50% or 75% Does it matter that the reason the
one customer has only 50% availability is that the room where the
router at that customer's site is underwater every other day due to
no fault of my own?

It is because of these sorts of questions that I never know what
it means when someone says that their network is available 99.9%
of the time. Hell at least part of MichNet is available 100% of
the time, but that never seems to cut it with the folks who don't
have service right now.

-Jeff Ogden
Merit/MichNet
Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]
On Mon, 27 Nov 1995, Dave Siegel wrote:

> Now, if it was *your* router at the customer prem, then it would be your
> responsibility that you allowed your equipment to, uh, become waterlogged. ;-)

I don't know. One of the most frequent problem I see is power outages at
the sites. I don't think any ISP can be responsible for things like that.

How about tornado nocking out much of a building including your CPE? How
should such things be counted?

<this did actually happen, btw>

-dorian
______________________________________________________________________________
Dorian Kim Email: dorian@cic.net 2901 Hubbard Drive
Network Engineer Phone: (313)998-6976 Ann Arbor MI 48105
CICNet Network Systems Fax: (313)998-6105 http://www.cic.net/~dorian
Re: outages, quality monitoring, trouble tickets, etc [ In reply to ]
> > Now, if it was *your* router at the customer prem, then it would be your
> > responsibility that you allowed your equipment to, uh, become waterlogged. ;-)
>
> I don't know. One of the most frequent problem I see is power outages at
> the sites. I don't think any ISP can be responsible for things like that.

Fixable. UPS + AC/DC Power generators.

> How about tornado nocking out much of a building including your CPE? How
> should such things be counted?

Um, you ought to know better than to put your damn equipment in a building
that's going to get destroyed by a tornado, I guess. ;-)

Seriously, though. "Acts of God" as they are called are typically not
counted either. When it comes right down to it, if the customer no longer
has a premise, then they are probably less worried about their equipment
being up and operational. Technically, it only matters if you define your
network as extending up to, and perhaps beyond CPE, or if it is just before
CPE, or even futher, if it ends right at the Bell Demark at your POP.

Since RTD purchases all the circuits for clients, our network (read, area of
responsibility) includes everything up to the CPE, but not CPE itself. If
there are circuit's down to individual clients as a result of Bell, we don't
count it as downtime, even though it is logged and pursued as any other outage.

As for connectivity to the rest of the 'net, it's *really* subjective. We
sort of consider 50%+ of estimated Internet connectivity "up," even though
we probably shouldn't. Internet connectivity uptime can be increased simply
by adding more links to NSP's, so you can't just throw responsibility for that
entirely onto your provider.

> <this did actually happen, btw>

Why do I believe that? hehe.

Dave

--
Dave Siegel President, RTD Systems & Networking, Inc.
(520)623-9663 Network Engineer -- Regional/National NSPs (Cisco)
dsiegel@rtd.com User Tracking & Acctg -- "Written by an ISP,
http://www.rtd.com/~dsiegel/ for an ISP."

1 2  View All