Mailing List Archive

rampant disk failure
I have recently seen surge in the number of disk failures on an F760
cluster in our datacenter. We have four F760's purchased a few months
apart; one cluster has seen 12 disks fail in the past three months while
the other cluster has seen two disks fail (I can only actually remember
one, but I'm allowing for my failing memory as well).

The clusters sit only a few feet apart so I am discounting
environmental problems. Both clusters are running NetApp Release
6.1R1P1. The "bad" cluster is using mostly Seagate ST318203FC 18 GB
disks, while the "good" cluster is mostly the Seagate ST118202FC drives
(the infamous spin-up problem disks). The good cluster is unbalanced;
one head has 52 disks and the other has 32. The bad cluster is evenly
balanced (or it was before we started losing disks en masse) with 42 on
each side. Both clusters are running disk firmware NA10 for the
ST318203FC disks (I know, I just discovered that it's one rev out of
date) and NA27 for the ST118202FC disks.

The strangest part of this whole situation is that the disks rarely
fail; they disappear and the partner complains that there is a cluster
mismatch, breaks clustering, and sends out an email. The filer with the
missing disk starts to rebuild (if it was a data disk) or merrily goes
on its way (if it was a spare), but nothing ever shows up as broken.
The disk just disappears.

Short of going through and replacing every piece of hardware in the
"bad" filers, I am at a loss of how to proceed. I've spent the morning
searching NOW without luck. [.Someone just pointed out to me that we
have a few X221_ST318304FC disks with NA06 firmware in several of our
filers, not just the good and bad clusters I've been describing, opening
us up to bug 27068 (we're trying to schedule downtime to upgrade the
firmware on all our filers now).] I am going to try upgrading the disk
firmware on the filers as a first step, but if anyone else has seen this
problem, or something similar, I would appreciate any input.

Geoff Hardin
geoff.hardin@dalsemi.com
If it's glowing, don't eat it...
Re: rampant disk failure [ In reply to ]
This sounds more like a general fiber channel error, perhaps
from a bad LRC or cable, or card. You should open up a ticket
with Network Appliance, or at the very least boot into maintenance
mode and run some of the detailed fiber channel tests from
the 1-5 menu.


On Wed, 24 Jul 2002, Geoff Hardin wrote:

| I have recently seen surge in the number of disk failures on an F760
| cluster in our datacenter. We have four F760's purchased a few months
| apart; one cluster has seen 12 disks fail in the past three months while
| the other cluster has seen two disks fail (I can only actually remember
| one, but I'm allowing for my failing memory as well).
|
| The clusters sit only a few feet apart so I am discounting
| environmental problems. Both clusters are running NetApp Release
| 6.1R1P1. The "bad" cluster is using mostly Seagate ST318203FC 18 GB
| disks, while the "good" cluster is mostly the Seagate ST118202FC drives
| (the infamous spin-up problem disks). The good cluster is unbalanced;
| one head has 52 disks and the other has 32. The bad cluster is evenly
| balanced (or it was before we started losing disks en masse) with 42 on
| each side. Both clusters are running disk firmware NA10 for the
| ST318203FC disks (I know, I just discovered that it's one rev out of
| date) and NA27 for the ST118202FC disks.
|
| The strangest part of this whole situation is that the disks rarely
| fail; they disappear and the partner complains that there is a cluster
| mismatch, breaks clustering, and sends out an email. The filer with the
| missing disk starts to rebuild (if it was a data disk) or merrily goes
| on its way (if it was a spare), but nothing ever shows up as broken.
| The disk just disappears.
|
| Short of going through and replacing every piece of hardware in the
| "bad" filers, I am at a loss of how to proceed. I've spent the morning
| searching NOW without luck. [.Someone just pointed out to me that we
| have a few X221_ST318304FC disks with NA06 firmware in several of our
| filers, not just the good and bad clusters I've been describing, opening
| us up to bug 27068 (we're trying to schedule downtime to upgrade the
| firmware on all our filers now).] I am going to try upgrading the disk
| firmware on the filers as a first step, but if anyone else has seen this
| problem, or something similar, I would appreciate any input.
|
| Geoff Hardin
| geoff.hardin@dalsemi.com
| If it's glowing, don't eat it...
|