Mailing List Archive: Random scsi disk disappearing

Random scsi disk disappearing

Aug 17, 2006, 3:55 AM

Post #1 of 9 (897 views)

An old problem, very annoying.

From time to time, an scsi disk just disappears from
the bus, without any [error] messages whatsoever.
The only relevant stuff in dmesg is logging from md
(softraid) layer, about "error updating superblock"
and later "giving up and removing the disk from the
array" - not even error number.

When I try to access such a disk (/dev/sdX device),
I got "No such device or address" error back.

It's still listed in /sys/block and /proc/scsi/scsi,
but any access to the device gives this error.

But the disk is here, I know it is. Deleting it from
kernel:

echo y > /sys/block/sdX/device/delete

and adding it back:

echo scsi add-single-device x y z > /proc/scsi/scsi

works just fine, linux finds "new" scsi device and it
happily works again.

This happens on alot of different machines, with different
disk drives (ok, most of them are from Seagate, but not
all). I can't say for sure that it happens on different
scsi controllers - at least majority of them are adaptecs,
using aic7xxx or aix79xx driver.

I suspected the disks are too hot - nope, according to
smartctrl, the themp is far from bad (typically about
25..35 Celsius, and the themperature is not changing much).
Bad cables, bad power supply, bad anything else? Not sure
either, at least I can't guess more: the machines are
really different, some has good, under-loaded power supplies
(and server chassis/motherboards/allthestuff) some has less
good ones - makes no difference. And the thing is - having
in mind really sporadic disappearing, not depending on current
load, time of day (eg, during nights, there's no one on site
so no one to touch cables etc), ... Well, I just can't think
of any reason, at all.

But one thing bothers me most: there's NO LOGGING from scsi
layer. None, zero, not at all.

Has anyone else seen something similar? Any pointers on how
to debug the issue?

Thanks.

/mjt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: Random scsi disk disappearing [ In reply to ]

Aug 17, 2006, 4:35 AM

Post #2 of 9 (890 views)

On Thu, Aug 17, 2006 at 02:55:58PM +0400, Michael Tokarev wrote:
> From time to time, an scsi disk just disappears from
> the bus, without any [error] messages whatsoever.
> The only relevant stuff in dmesg is logging from md
> (softraid) layer, about "error updating superblock"
> and later "giving up and removing the disk from the
> array" - not even error number.
>
> When I try to access such a disk (/dev/sdX device),
> I got "No such device or address" error back.
>
> It's still listed in /sys/block and /proc/scsi/scsi,
> but any access to the device gives this error.
>
> But the disk is here, I know it is. Deleting it from
> kernel:
>
> echo y > /sys/block/sdX/device/delete
>
> and adding it back:
>
> echo scsi add-single-device x y z > /proc/scsi/scsi
>
> works just fine, linux finds "new" scsi device and it
> happily works again.
>
> This happens on alot of different machines, with different
> disk drives (ok, most of them are from Seagate, but not
> all). I can't say for sure that it happens on different
> scsi controllers - at least majority of them are adaptecs,
> using aic7xxx or aix79xx driver.
>
> I suspected the disks are too hot - nope, according to
> smartctrl, the themp is far from bad (typically about
> 25..35 Celsius, and the themperature is not changing much).
> Bad cables, bad power supply, bad anything else? Not sure
> either, at least I can't guess more: the machines are
> really different, some has good, under-loaded power supplies
> (and server chassis/motherboards/allthestuff) some has less
> good ones - makes no difference. And the thing is - having
> in mind really sporadic disappearing, not depending on current
> load, time of day (eg, during nights, there's no one on site
> so no one to touch cables etc), ... Well, I just can't think
> of any reason, at all.
>
> But one thing bothers me most: there's NO LOGGING from scsi
> layer. None, zero, not at all.
>
> Has anyone else seen something similar? Any pointers on how
> to debug the issue?

I'd recommend turning on scsi logging; it might give you a clue about
which bit of scanning is failing to work properly.

Try booting with scsi_mod.scsi_logging_level = 448 (I think I have that
number right; 7 shifted left by 6) and then you can compare failing and
non-failing runs and see if there's any difference.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: Random scsi disk disappearing [ In reply to ]

Aug 17, 2006, 4:43 AM

Post #3 of 9 (889 views)

Matthew Wilcox wrote:
> On Thu, Aug 17, 2006 at 02:55:58PM +0400, Michael Tokarev wrote:
[sporadic disk disappearing, no logging]
>
> I'd recommend turning on scsi logging; it might give you a clue about
> which bit of scanning is failing to work properly.
>
> Try booting with scsi_mod.scsi_logging_level = 448 (I think I have that
> number right; 7 shifted left by 6) and then you can compare failing and
> non-failing runs and see if there's any difference.

It should be the same as
echo $((7<<6)) > /sys/module/scsi_mod/parameters/scsi_logging_level
(which indeed is 448) at runtime, right? (And yes, CONFIG_SCSI_LOGGING
is set to y).

Heh oh those magic numbers!.. ;)

Ok, I've turned on the logging on a bunch of machines (using the sysfs
method), let's see what will happen next. Thank you!

By the way, should kernel pefrorm at least *some* "minimal" logging of
such a serious events by default? Well ok, ok, it's not known yet what
the event really is, so I'm shutting up now, at least for a while.. ;)

/mjt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: Random scsi disk disappearing [ In reply to ]

Aug 17, 2006, 4:53 AM

Post #4 of 9 (881 views)

On Thu, Aug 17, 2006 at 03:43:55PM +0400, Michael Tokarev wrote:
> Matthew Wilcox wrote:
> > On Thu, Aug 17, 2006 at 02:55:58PM +0400, Michael Tokarev wrote:
> [sporadic disk disappearing, no logging]
> >
> > I'd recommend turning on scsi logging; it might give you a clue about
> > which bit of scanning is failing to work properly.
> >
> > Try booting with scsi_mod.scsi_logging_level = 448 (I think I have that
> > number right; 7 shifted left by 6) and then you can compare failing and
> > non-failing runs and see if there's any difference.
>
> It should be the same as
> echo $((7<<6)) > /sys/module/scsi_mod/parameters/scsi_logging_level
> (which indeed is 448) at runtime, right? (And yes, CONFIG_SCSI_LOGGING
> is set to y).

That's right.

> Heh oh those magic numbers!.. ;)

Yeah, but the alternative is an in-kernel named symbol parser ... which
we have in some drivers, but boy is it ugly.

> Ok, I've turned on the logging on a bunch of machines (using the sysfs
> method), let's see what will happen next. Thank you!
>
> By the way, should kernel pefrorm at least *some* "minimal" logging of
> such a serious events by default? Well ok, ok, it's not known yet what
> the event really is, so I'm shutting up now, at least for a while.. ;)

That's the problem -- if it turns out the event is a reasonable thing to
happen for some devices, we'll annoy everyone with those devices. It's
hard to please everybody ;-)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: Random scsi disk disappearing [ In reply to ]

jengelh at linux01

Aug 17, 2006, 6:41 AM

Post #5 of 9 (885 views)

>> It should be the same as
>> echo $((7<<6)) > /sys/module/scsi_mod/parameters/scsi_logging_level
>> (which indeed is 448) at runtime, right? (And yes, CONFIG_SCSI_LOGGING
>> is set to y).
>
>> Heh oh those magic numbers!.. ;)
>
>> By the way, should kernel pefrorm at least *some* "minimal" logging of
>> such a serious events by default? Well ok, ok, it's not known yet what
>> the event really is, so I'm shutting up now, at least for a while.. ;)
>
>That's the problem -- if it turns out the event is a reasonable thing to
>happen for some devices, we'll annoy everyone with those devices. It's
>hard to please everybody ;-)

Since 7<<6 seems to indicate a flag, it would be best to have some sysfs
variable that you can flip using 0 and 1.

Jan Engelhardt
--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: Random scsi disk disappearing [ In reply to ]

Aug 17, 2006, 9:48 AM

Post #6 of 9 (876 views)

On 17.08.2006 15:41 Jan Engelhardt <jengelh@linux01.gwdg.de> wrote:

> >> It should be the same as
> >> echo $((7<<6)) >
/sys/module/scsi_mod/parameters/scsi_logging_level
> >> (which indeed is 448) at runtime, right? (And yes,
CONFIG_SCSI_LOGGING
> >> is set to y).
> >
> >> Heh oh those magic numbers!.. ;)
> >

...

> Since 7<<6 seems to indicate a flag, it would be best to have some sysfs

> variable that you can flip using 0 and 1.

It's not a flag. This one sets loglevel 7 for SCSI_LOG_SCAN.
So all SCSI_LOG_SCAN messages might show up.
The loglevel can also be set using sysctl dev.scsi.logging_level.

Anyone interested in a script to conveniently interpret or change the
SCSI logging level? Such a script (scsi_logging_level) exists in the
s390-tools package (version 1.5.3).

If others show interest for this script, maybe a better place can be
found than s390-tools (because it is not really s390-specific).

Regards,

Andreas
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: Random scsi disk disappearing [ In reply to ]

stefanr at s5r6

Aug 17, 2006, 3:33 PM

Post #7 of 9 (884 views)

Andreas Herrmann wrote:
> Anyone interested in a script to conveniently interpret or change the
> SCSI logging level? Such a script (scsi_logging_level) exists in the
> s390-tools package (version 1.5.3).

That would be very welcome.

> If others show interest for this script, maybe a better place can be
> found than s390-tools (because it is not really s390-specific).

It could be put into linux/Documentation/scsi/. People who are
confronted with a debugging problem probably look into Documentation/.
Also, scripts which demonstrate usage of certain kernel interfaces do
count as valuable documentation.
--
Stefan Richter
-=====-=-==- =--- =--=-
http://arcgraph.de/sr/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: Random scsi disk disappearing [ In reply to ]

Aug 18, 2006, 2:11 AM

Post #8 of 9 (859 views)

On 18.08.2006 00:33 Stefan Richter <stefanr@s5r6.in-berlin.de> wrote:
> Andreas Herrmann wrote:
> > Anyone interested in a script to conveniently interpret or change the
> > SCSI logging level? Such a script (scsi_logging_level) exists in the
> > s390-tools package (version 1.5.3).

> That would be very welcome.

See script below. To set SCSI_LOG_SCAN as discussed in this thread you
can use:

# scsi_logging_level -s --scan 7
New scsi logging level:
dev.scsi.logging_level = 448
SCSI_LOG_ERROR=0
SCSI_LOG_TIMEOUT=0
SCSI_LOG_SCAN=7
SCSI_LOG_MLQUEUE=0
SCSI_LOG_MLCOMPLETE=0
SCSI_LOG_LLQUEUE=0
SCSI_LOG_LLCOMPLETE=0
SCSI_LOG_HLQUEUE=0
SCSI_LOG_HLCOMPLETE=0
SCSI_LOG_IOCTL=0

> > If others show interest for this script, maybe a better place can be
> > found than s390-tools (because it is not really s390-specific).

> It could be put into linux/Documentation/scsi/. People who are
> confronted with a debugging problem probably look into Documentation/.
> Also, scripts which demonstrate usage of certain kernel interfaces do
> count as valuable documentation.

I am not sure whehter this script should be added to
Documentation/scsi. I think it would be better to just document the
SCSI logging feature at all in Documentation/scsi and put the script
somewhere else.

Maybe Douglas Gilbert has a suggestion where such a script will best
fit in? He knows best which packages with scripts and utilities for
storage and SCSI are available.

Regards,

Andreas

--
#! /bin/bash
###############################################################################
# Conveniently create and set scsi logging level, show SCSI_LOG fields in human
# readable form.
#
# Copyright (C) IBM Corp. 2006
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or (at
# your option) any later version.
#
# This program is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
# 02110-1301, USA.
###############################################################################

SCRIPTNAME="scsi_logging_level"

declare -i LOG_ERROR=0
declare -i LOG_TIMEOUT=0
declare -i LOG_SCAN=0
declare -i LOG_MLQUEUE=0
declare -i LOG_MLCOMPLETE=0
declare -i LOG_LLQUEUE=0
declare -i LOG_LLCOMPLETE=0
declare -i LOG_HLQUEUE=0
declare -i LOG_HLCOMPLETE=0
declare -i LOG_IOCTL=0

declare -i LEVEL=0

_ERROR_SHIFT=0
_TIMEOUT_SHIFT=3
_SCAN_SHIFT=6
_MLQUEUE_SHIFT=9
_MLCOMPLETE_SHIFT=12
_LLQUEUE_SHIFT=15
_LLCOMPLETE_SHIFT=18
_HLQUEUE_SHIFT=21
_HLCOMPLETE_SHIFT=24
_IOCTL_SHIFT=27

SET=0
GET=0
CREATE=0

OPTS=`getopt -o hvcgsa:E:T:S:I:M:L:H: --long \
help,version,create,get,set,all:,error:,timeout:,scan:,ioctl:,\
midlevel:,mlqueue:,mlcomplete:,lowlevel:,llqueue:,llcomplete:,\
highlevel:,hlqueue:,hlcomplete: -n \'$SCRIPTNAME\' -- "$@"`
eval set -- "$OPTS"

# print version info
printversion()
{
cat <<EOF
$SCRIPTNAME (s390-tools) %S390_TOOLS_VERSION%
(C) Copyright IBM Corp. 2006
EOF
}

# print usage and help
printhelp()
{
cat <<EOF
Usage: $SCRIPTNAME [OPTIONS]

Create, get or set scsi logging level.

Options:

-h, --help print this help
-v, --version print version information
-s, --set create and set logging level as specified on
command line
-g, --get get current logging level and display it
-c, --create create logging level as specified on command line
-a, --all specify value for all SCSI_LOG fields
-E, --error specify SCSI_LOG_ERROR
-T, --timeout specify SCSI_LOG_TIMEOUT
-S, --scan specify SCSI_LOG_SCAN
-M, --midlevel specify SCSI_LOG_MLQUEUE and SCSI_LOG_MLCOMPLETE
--mlqueue specify SCSI_LOG_MLQUEUE
--mlcomplete specify SCSI_LOG_MLCOMPLETE
-L, --lowlevel specify SCSI_LOG_LLQUEUE and SCSI_LOG_LLCOMPLETE
--llqueue specify SCSI_LOG_LLQUEUE
--llcomplete specify SCSI_LOG_LLCOMPLETE
-H, --highlevel specify SCSI_LOG_HLQUEUE and SCSI_LOG_HLCOMPLETE
--hlqueue specify SCSI_LOG_HLQUEUE
--hlcomplete specify SCSI_LOG_HLCOMPLETE
-I, --ioctl specify SCSI_LOG_IOCTL

Exactly one of the options "-c", "-g" and "-s" has to be specified.
Valid values for SCSI_LOG fields are integers from 0 to 7.

Note: Several SCSI_LOG fields can be specified using several options.
When multiple options specify same SCSI_LOG field the most specific
option has precedence.

Example: "scsi_logging_level --hlqueue 3 --hlcomplete 2 --all 1 -s" sets
SCSI_LOG_HLQUEUE=3, SCSI_LOG_HLCOMPLETE=2 and assigns all other SCSI_LOG
fields the value 1.
EOF
}

check_level()
{
if [ `echo -n $1 | tr --complement [:digit:] 'a' | grep -s 'a'` ]
then
invalid_cmdline "log level '$1' out of range [0, 7]"
fi

if [ $1 -lt 0 -o $1 -gt 7 ]
then
invalid_cmdline "log level '$1' out of range [0, 7]"
fi
}

# check cmd line arguments
check_cmdline()
{
while true ; do
case "$1" in
-a|--all) _ALL=$2; check_level $2
shift 2;;
-c|--create) CREATE=1;
shift 1;;
-g|--get) GET=1
shift 1;;
-h|--help) printhelp
exit 0;;
-s|--set) SET=1
shift 1;;
-v|--version) printversion
exit 0;;
-E|--error) _ERROR=$2; check_level $2
shift 2;;
-T|--timeout) _TIMEOUT=$2; check_level $2
shift 2;;
-S|--scan) _SCAN=$2; check_level $2
shift 2;;
-M|--midlevel) _ML=$2; check_level $2
shift 2;;
--mlqueue) _MLQUEUE=$2; check_level $2
shift 2;;
--mlcomplete) _MLCOMPLETE=$2; check_level $2
shift 2;;
-L|--lowlevel) _LL=$2; check_level $2
shift 2;;
--llqueue) _LLQUEUE=$2; check_level $2
shift 2;;
--llcomplete) _LLCOMPLETE=$2; check_level $2
shift 2;;
-H|--highlevel) _HL=$2; check_level $2
shift 2;;
--hlqueue) _HLQUEUE=$2; check_level $2
shift 2;;
--hlcomplete) _HLCOMPLETE=$2; check_level $2
shift 2;;
-I|--ioctl) _IOCTL=$2; check_level $2
shift 2;;
--) shift; break;;
*) echo "Internal error!" ; exit 1;;
esac
done

if [ -n "$*" ]
then
invalid_cmdline invalid parameter $*
fi

if [ $GET = "1" -a $SET = "1" ]
then
invalid_cmdline options \'-c\', \'-g\' and \'-s\' are mutual exclusive
elif [ $GET = "1" -a $CREATE = "1" ]
then
invalid_cmdline options \'-c\', \'-g\' and \'-s\' are mutual exclusive
elif [ $SET = "1" -a $CREATE = "1" ]
then
invalid_cmdline options \'-c\', \'-g\' and \'-s\' are mutual exclusive
fi

LOG_ERROR=${_ERROR:-${_ALL:-0}}
LOG_TIMEOUT=${_TIMEOUT:-${_ALL:-0}}
LOG_SCAN=${_SCAN:-${_ALL:-0}}
LOG_MLQUEUE=${_MLQUEUE:-${_ML:-${_ALL:-0}}}
LOG_MLCOMPLETE=${_MLCOMPLETE:-${_ML:-${_ALL:-0}}}
LOG_LLQUEUE=${_LLQUEUE:-${_LL:-${_ALL:-0}}}
LOG_LLCOMPLETE=${_LLCOMPLETE:-${_LL:-${_ALL:-0}}}
LOG_HLQUEUE=${_HLQUEUE:-${_HL:-${_ALL:-0}}}
LOG_HLCOMPLETE=${_HLCOMPLETE:-${_HL:-${_ALL:-0}}}
LOG_IOCTL=${_IOCTL:-${_ALL:-0}}
}

invalid_cmdline()
{
echo "$SCRIPTNAME: $*"
echo "$SCRIPTNAME: Try '$SCRIPTNAME --help' for more information."
exit 1
}

get_logging_level()
{
echo "Current scsi logging level:"
LEVEL=`sysctl -n dev.scsi.logging_level`
if [ $? != 0 ]
then
echo "$SCRIPTNAME: could not read scsi logging level" \
"(kernel probably without SCSI_LOGGING support)"
exit 1
fi
}

show_logging_level()
{
echo "dev.scsi.logging_level = $LEVEL"

LOG_ERROR=$((($LEVEL>>$_ERROR_SHIFT) & 7))
LOG_TIMEOUT=$((($LEVEL>>$_TIMEOUT_SHIFT) & 7))
LOG_SCAN=$((($LEVEL>>$_SCAN_SHIFT) & 7))
LOG_MLQUEUE=$((($LEVEL>>$_MLQUEUE_SHIFT) & 7))
LOG_MLCOMPLETE=$((($LEVEL>>$_MLCOMPLETE_SHIFT) & 7))
LOG_LLQUEUE=$((($LEVEL>>$_LLQUEUE_SHIFT) & 7))
LOG_LLCOMPLETE=$((($LEVEL>>$_LLCOMPLETE_SHIFT) & 7))
LOG_HLQUEUE=$((($LEVEL>>$_HLQUEUE_SHIFT) & 7))
LOG_HLCOMPLETE=$((($LEVEL>>$_HLCOMPLETE_SHIFT) & 7))
LOG_IOCTL=$((($LEVEL>>$_IOCTL_SHIFT) & 7))

echo "SCSI_LOG_ERROR=$LOG_ERROR"
echo "SCSI_LOG_TIMEOUT=$LOG_TIMEOUT"
echo "SCSI_LOG_SCAN=$LOG_SCAN"
echo "SCSI_LOG_MLQUEUE=$LOG_MLQUEUE"
echo "SCSI_LOG_MLCOMPLETE=$LOG_MLCOMPLETE"
echo "SCSI_LOG_LLQUEUE=$LOG_LLQUEUE"
echo "SCSI_LOG_LLCOMPLETE=$LOG_LLCOMPLETE"
echo "SCSI_LOG_HLQUEUE=$LOG_HLQUEUE"
echo "SCSI_LOG_HLCOMPLETE=$LOG_HLCOMPLETE"
echo "SCSI_LOG_IOCTL=$LOG_IOCTL"
}

set_logging_level()
{
echo "New scsi logging level:"
sysctl -q -w dev.scsi.logging_level=$LEVEL
if [ $? != 0 ]
then
echo "$SCRIPTNAME: could not write scsi logging level" \
"(kernel probably without SCSI_LOGGING support)"
exit 1
fi
}

create_logging_level()
{
LEVEL=$((($LOG_ERROR & 7)<<$_ERROR_SHIFT))
LEVEL=$(($LEVEL|(($LOG_TIMEOUT & 7)<<$_TIMEOUT_SHIFT)))
LEVEL=$(($LEVEL|(($LOG_SCAN & 7)<<$_SCAN_SHIFT)))
LEVEL=$(($LEVEL|(($LOG_MLQUEUE & 7)<<$_MLQUEUE_SHIFT)))
LEVEL=$(($LEVEL|(($LOG_MLCOMPLETE & 7)<<$_MLCOMPLETE_SHIFT)))
LEVEL=$(($LEVEL|(($LOG_LLQUEUE & 7)<<$_LLQUEUE_SHIFT)))
LEVEL=$(($LEVEL|(($LOG_LLCOMPLETE & 7)<<$_LLCOMPLETE_SHIFT)))
LEVEL=$(($LEVEL|(($LOG_HLQUEUE & 7)<<$_HLQUEUE_SHIFT)))
LEVEL=$(($LEVEL|(($LOG_HLCOMPLETE & 7)<<$_HLCOMPLETE_SHIFT)))
LEVEL=$(($LEVEL|(($LOG_IOCTL & 7)<<$_IOCTL_SHIFT)))
}

check_cmdline $*

if [ $SET = "1" ]
then
create_logging_level
set_logging_level
show_logging_level
elif [ $GET = "1" ]
then
get_logging_level
show_logging_level
elif [ $CREATE = "1" ]
then
create_logging_level
show_logging_level
else
invalid_cmdline missing option \'-g\', \'-s\' or \'-c\'
fi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: Random scsi disk disappearing [ In reply to ]

Aug 30, 2006, 12:40 PM

Post #9 of 9 (860 views)

Michael Tokarev wrote:
> Matthew Wilcox wrote:
>> On Thu, Aug 17, 2006 at 02:55:58PM +0400, Michael Tokarev wrote:
> [sporadic disk disappearing, no logging]
>> I'd recommend turning on scsi logging; it might give you a clue about
>> which bit of scanning is failing to work properly.
>>
>> Try booting with scsi_mod.scsi_logging_level = 448 (I think I have that
>> number right; 7 shifted left by 6) and then you can compare failing and
>> non-failing runs and see if there's any difference.

Ok, yesterday it happened again. This machine is running 2.6.11 still
(leftover - I'm updating it to current 2.6.17 now).

The controller is, according to lspci:

0000:04:04.0 SCSI storage controller: Adaptec AIC-7902 U320 (rev 03)

Here's the logging. Too bad I don't understand most of this stuff ;)

Is it possible to say something from this or should I try different
log level or kernel version?

Thanks.

/mjt

16:25:35 SCSI error : <0 0 0 0> return code = 0x10000
16:25:36 end_request: I/O error, dev sda, sector 3003999
16:25:36 md: write_disk_sb failed for device sda2
16:25:36 SCSI error : <0 0 0 0> return code = 0x10000
16:25:36 end_request: I/O error, dev sda, sector 6263238
16:25:36 raid5: Disk failure on sda5, disabling device. Operation continuing on 3 devices
16:25:36 md: errors occurred during superblock update, repeating
16:25:36 SCSI error : <0 0 0 0> return code = 0x10000
16:25:36 end_request: I/O error, dev sda, sector 3003999
16:25:36 md: write_disk_sb failed for device sda2

.....repeated sequence of the above lines, with different sectors...

16:26:04 end_request: I/O error, dev sda, sector 3003999
16:26:04 scsi0:0:0:0: Attempting to abort cmd dbc10680: 0x2a 0x0 0x4 0x0 0x29 0x49md: write_disk_sb failed for device sda2
16:26:04 0x0 0x0 0x8 0x0

BTW, this should go in one line. The machine is SMP... ;)

16:26:04 scsi0: At time of recovery, card was not paused
16:26:04 >>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<<
16:26:04 scsi0: Dumping Card State at program address 0x0 Mode 0x22
16:26:04 Card was paused
16:26:04 HS_MAILBOX[0x0] INTCTL[0xc0]:(SWTMINTEN|SWTMINTMASK)
16:26:04 SEQINTSTAT[0x10]:(SEQ_SWTMRTO) SAVED_MODE[0x11] DFFSTAT[0x31]:(CURRFIFO_1|FIFO0FREE|FIFO1FREE)
16:26:04 SCSISIGI[0x0]:(P_DATAOUT) SCSIPHASE[0x0] SCSIBUS[0x0]
16:26:04 LASTPHASE[0x1]:(P_DATAOUT|P_BUSFREE) SCSISEQ0[0x0]
16:26:04 SCSISEQ1[0x12]:(ENAUTOATNP|ENRSELI) SEQCTL0[0x10]:(FASTMODE)
16:26:04 SEQINTCTL[0x0] SEQ_FLAGS[0xc0]:(NO_CDB_SENT|NOT_IDENTIFIED)
16:26:04 SEQ_FLAGS2[0x0] SSTAT0[0x0] SSTAT1[0x8]:(BUSFREE)
16:26:04 SSTAT2[0x0] SSTAT3[0x0] PERRDIAG[0x8]:(AIPERR) SIMODE1[0xa4]:(ENSCSIPERR|ENSCSIRST|ENSELTIMO)
16:26:04 LQISTAT0[0x0] LQISTAT1[0x0] LQISTAT2[0x0] LQOSTAT0[0x0]
16:26:04 LQOSTAT1[0x0] LQOSTAT2[0x1]:(LQOSTOP0)
16:26:04
16:26:04 SCB Count = 128 CMDS_PENDING = 1 LASTSCB 0x4f CURRSCB 0x4f NEXTSCB 0xff40
16:26:04 qinstart = 34129 qinfifonext = 34129
16:26:04 QINFIFO:
16:26:04 WAITING_TID_QUEUES:
16:26:04 Pending list:
16:26:04 41 FIFO_USE[0x0] SCB_CONTROL[0x60]:(TAG_ENB|DISCENB) SCB_SCSIID[0x7]
16:26:04 Total 1
16:26:04 Kernel Free SCB list: 126 78 79 12 127 10 93 113 30 96 49 37 73 33 52 95 103 27 72 106 71 18 61 85 83 110 20 87 94 75 115 8 51 99 45 22 3 92 50 86 125 28 48 15 122 63 107 80 114 70 36 23 104 26 2 16 68 42 21 64 118 7 81 67 59 88 117 34 24 82 105 25 31 84 4 76 58 91 66 11 121 102 14 1 97 57 44 120 29 65 39 100 89 74 6 124 32 35 54 17 69 19 111 55 47 46 5 53 77 0 112 109 9 119 60 101 108 13 38 40 98 62 56 116 43 123 90
16:26:04 Sequencer Complete DMA-inprog list:
16:26:04 Sequencer Complete list:
16:26:04 Sequencer DMA-Up and Complete list:
16:26:04
16:26:04 scsi0: FIFO0 Free, LONGJMP == 0x80ff, SCB 0x49
16:26:04 SEQIMODE[0x3f]:(ENCFG4TCMD|ENCFG4ICMD|ENCFG4TSTAT|ENCFG4ISTAT|ENCFG4DATA|ENSAVEPTRS)
16:26:04 SEQINTSRC[0x0] DFCNTRL[0x0] DFSTATUS[0x89]:(FIFOEMP|HDONE|PRELOAD_AVAIL)
16:26:04 SG_CACHE_SHADOW[0x2]:(LAST_SEG) SG_STATE[0x0] DFFSXFRCTL[0x0]
16:26:04 SOFFCNT[0x0] MDFFSTAT[0x5]:(FIFOFREE|DLZERO) SHADDR = 0x00, SHCNT = 0x0
16:26:04 HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x10]:(SG_CACHE_AVAIL)
16:26:04 scsi0: FIFO1 Free, LONGJMP == 0x8277, SCB 0x7e
16:26:04 SEQIMODE[0x3f]:(ENCFG4TCMD|ENCFG4ICMD|ENCFG4TSTAT|ENCFG4ISTAT|ENCFG4DATA|ENSAVEPTRS)
16:26:04 SEQINTSRC[0x0] DFCNTRL[0x4]:(DIRECTION) DFSTATUS[0x89]:(FIFOEMP|HDONE|PRELOAD_AVAIL)
16:26:04 SG_CACHE_SHADOW[0x2]:(LAST_SEG) SG_STATE[0x0] DFFSXFRCTL[0x0]
16:26:04 SOFFCNT[0x0] MDFFSTAT[0x5]:(FIFOFREE|DLZERO) SHADDR = 0x00, SHCNT = 0x0
16:26:04 HADDR = 0x00, HCNT = 0x0 CCSGCTL[0x10]:(SG_CACHE_AVAIL)
16:26:04 LQIN: 0x55 0x0 0x0 0x7e 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0 0x0
16:26:04 scsi0: LQISTATE = 0x0, LQOSTATE = 0x0, OPTIONMODE = 0x42
16:26:04 scsi0: OS_SPACE_CNT = 0x20 MAXCMDCNT = 0x1
16:26:04
16:26:04 SIMODE0[0xc]:(ENOVERRUN|ENIOERR)
16:26:04 CCSCBCTL[0x0]
16:26:04 scsi0: REG0 == 0x4f, SINDEX = 0x122, DINDEX = 0xe1
16:26:04 scsi0: SCBPTR == 0x7e, SCB_NEXT == 0xff80, SCB_NEXT2 == 0xff62
16:26:04 CDB 2a 0 0 80 8 c8
16:26:04 STACK: 0x125 0x125 0x125 0x25e 0x240 0x25e 0x29 0x15
16:26:04 <<<<<<<<<<<<<<<<< Dump Card State Ends >>>>>>>>>>>>>>>>>>
16:26:04 DevQ(0:0:0): 0 waiting
16:26:04 DevQ(0:1:0): 0 waiting
16:26:04 DevQ(0:2:0): 0 waiting
16:26:04 DevQ(0:4:0): 0 waiting
16:26:04 (scsi0:A:0:0): Device is disconnected, re-queuing SCB
16:26:04 Recovery code sleeping
16:26:04 md: errors occurred during superblock update, repeating
16:26:04 Recovery SCB completes
16:26:04 Recovery code awake
16:26:14 scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0
16:26:14 SCSI error : <0 0 0 0> return code = 0x8000002
16:26:14 sda: Current: sense key: Aborted Command
16:26:14 Additional sense: No additional sense information
16:26:14 Info fld=0x0
16:26:14 end_request: I/O error, dev sda, sector 67119433
16:26:14 scsi0 (0:0): rejecting I/O to offline device
16:26:14 md: write_disk_sb failed for device sda6
16:26:14 md: write_disk_sb failed for device sda1
16:26:14 (scsi0:A:0): 320.000MB/s transfers (160.000MHz DT|IU|QAS, 16bit)
16:26:14 md: errors occurred during superblock update, repeating
16:26:14 scsi0 (0:0): rejecting I/O to offline device
16:26:14 md: write_disk_sb failed for device sda6
.....

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/