Mailing List Archive: [BUG] ide dma_timer_expiry, then hard lockup

[BUG] ide dma_timer_expiry, then hard lockup

linas at austin

Jun 18, 2007, 10:57 AM

Post #1 of 27 (1344 views)

I've got a hard lockup in the ide subsystem, probably
due to some irq spew or something like that.

I've just bought a brand new Maxtor 320GB disk driver
for the insane price of $70 US to replace another
failing drive. It works well under light load;
I was able to copy about 60GB to it. However,
under heavy load, such as reconstruction of an MD
RAID-1 array, it'll lock up the kernel. Which means
that my system won't boot :-(

I'm running 2.6.21.1, although the problem seems to occur
in 2.6.19 and 2.6.18 too; its been there a while; I vageuly
remember similar problems in 2.6.5 or 2.6.10.

I get an
"hdc: dma_timer_expiry: dma status == 0x21"

and 10 seconds later,

"hdc: DMA Timeout error"

at which point the system is locked up hard.
Magic sysreq does not work at all. The hard drive activity light
stays fully lit. Inserting printk's into the kernel, I find the
hang to be in a surprising place:

ide_dma_timeout_retry() in ide-io.c
prints the "hdc: DMA Timeout error" then calls
HWIF(drive)->ide_dma_end(drive);
which returns, and then calls
hwif->INB(IDE_STATUS_REG) which is needed as an argument to ide_error()

But this hangs! -- The INB never returns.
Now: hwif->INB = ide_inb; in ide-iops.c

So putting a printk into ide_inb() shows that
the printk before the readb() is printed, and the
printk after the readb is not (!!)

I find this rather surpriseing, as I can't imagine how the
readb can fail. My current vague theory is that doing this
readb makes the hard drive go really nuts, and it probably
ties some interrupt line high, and so the linux kernel
gets stuck trying to handle the irq flood. I just don't know
enough about the i386 architecture, or about interrupts, to
prove or disprove this.

Background: this is on an old dual-cpu intel (coppermine??)
box; the controller is an HighPoint HPT366 on the motherboard.
This is an old parallel ATA (80-pin cable) setup.

I can get the system to boot by sneaking in an
"hdparm -d0 /dev/hdc" early in the boot process, to turn off
the use of DMA, but it seems that PIO is so slow, that it takes
forever to get NFS started.

I can get it to boot, by unplugging /dev/hdc. Unfortunately,
given the RAID mirroring, the only usable copy of /, /usr is on
/dev/hda and te only usable copy of /home is on /dev/hdc, so
I'm screwed ...

Any suggestions, experiments, experimental patches, data gathering,
etc. is welcome. The sooner, the better...

--linas

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

RE: [BUG] ide dma_timer_expiry, then hard lockup [ In reply to ]

Stuart_Hayes at Dell

Jun 18, 2007, 11:11 AM

Post #2 of 27 (1336 views)

I think reading the IDE status register clears the interrupt in the IDE
device, which might be causing the drive to think it's OK to generate
another interrupt. This could either cause it to get stuck trying to
service an interrupt that is never getting cleared as you suggested, or
possibly when the next IRQ comes in the IDE IRQ handler gets stuck
waiting for a spinlock that the code you're looking at already owns...?

Perhaps a printk in the IDE IRQ handler would be informative? It
wouldn't help you figure out how it got where it is, but it might help
you figure out why the system is hanging.

Stuart

-----Original Message-----
From: linux-ide-owner@vger.kernel.org
[mailto:linux-ide-owner@vger.kernel.org] On Behalf Of Linas Vepstas
Sent: Monday, June 18, 2007 12:57 PM
To: linux-ide@vger.kernel.org; linux-kernel@vger.kernel.org
Subject: [BUG] ide dma_timer_expiry, then hard lockup

I've got a hard lockup in the ide subsystem, probably due to some irq
spew or something like that.

I've just bought a brand new Maxtor 320GB disk driver for the insane
price of $70 US to replace another failing drive. It works well under
light load; I was able to copy about 60GB to it. However, under heavy
load, such as reconstruction of an MD
RAID-1 array, it'll lock up the kernel. Which means that my system
won't boot :-(

I'm running 2.6.21.1, although the problem seems to occur in 2.6.19 and
2.6.18 too; its been there a while; I vageuly remember similar problems
in 2.6.5 or 2.6.10.

I get an
"hdc: dma_timer_expiry: dma status == 0x21"

and 10 seconds later,

"hdc: DMA Timeout error"

at which point the system is locked up hard.
Magic sysreq does not work at all. The hard drive activity light stays
fully lit. Inserting printk's into the kernel, I find the hang to be in
a surprising place:

ide_dma_timeout_retry() in ide-io.c
prints the "hdc: DMA Timeout error" then calls
HWIF(drive)->ide_dma_end(drive);
which returns, and then calls
hwif->INB(IDE_STATUS_REG) which is needed as an argument to
ide_error()

But this hangs! -- The INB never returns.
Now: hwif->INB = ide_inb; in ide-iops.c

So putting a printk into ide_inb() shows that
the printk before the readb() is printed, and the
printk after the readb is not (!!)

I find this rather surpriseing, as I can't imagine how the
readb can fail. My current vague theory is that doing this
readb makes the hard drive go really nuts, and it probably
ties some interrupt line high, and so the linux kernel
gets stuck trying to handle the irq flood. I just don't know
enough about the i386 architecture, or about interrupts, to
prove or disprove this.

Background: this is on an old dual-cpu intel (coppermine??)
box; the controller is an HighPoint HPT366 on the motherboard.
This is an old parallel ATA (80-pin cable) setup.

I can get the system to boot by sneaking in an
"hdparm -d0 /dev/hdc" early in the boot process, to turn off
the use of DMA, but it seems that PIO is so slow, that it takes
forever to get NFS started.

I can get it to boot, by unplugging /dev/hdc. Unfortunately,
given the RAID mirroring, the only usable copy of /, /usr is on
/dev/hda and te only usable copy of /home is on /dev/hdc, so
I'm screwed ...

Any suggestions, experiments, experimental patches, data gathering,
etc. is welcome. The sooner, the better...

--linas

-
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [BUG] ide dma_timer_expiry, then hard lockup [ In reply to ]

alan at lxorguk

Jun 18, 2007, 1:27 PM

Post #3 of 27 (1319 views)

> ide_dma_timeout_retry() in ide-io.c
> prints the "hdc: DMA Timeout error" then calls
> HWIF(drive)->ide_dma_end(drive);
> which returns, and then calls
> hwif->INB(IDE_STATUS_REG) which is needed as an argument to ide_error()
>
> But this hangs! -- The INB never returns.
> Now: hwif->INB = ide_inb; in ide-iops.c

Yep and the I/O cycle never completes so the box hangs. This occurs if
the drive blows up and never switches IORDY to indicate completion. The
hpt will also do this sometimes if it gets addled by a confused drive,
while an intel one often won't.

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [BUG] ide dma_timer_expiry, then hard lockup [ In reply to ]

linas at austin

Jun 18, 2007, 1:46 PM

Post #4 of 27 (1347 views)

On Mon, Jun 18, 2007 at 09:27:04PM +0100, Alan Cox wrote:
> > ide_dma_timeout_retry() in ide-io.c
> > prints the "hdc: DMA Timeout error" then calls
> > HWIF(drive)->ide_dma_end(drive);
> > which returns, and then calls
> > hwif->INB(IDE_STATUS_REG) which is needed as an argument to ide_error()
> >
> > But this hangs! -- The INB never returns.
> > Now: hwif->INB = ide_inb; in ide-iops.c
>
> Yep and the I/O cycle never completes so the box hangs. This occurs if
> the drive blows up and never switches IORDY to indicate completion. The
> hpt will also do this sometimes if it gets addled by a confused drive,
> while an intel one often won't.

So what do you suggest? (I could buy an alternate ide controller,
and hope that goes away, or just buy a different hard drive. But
that's beside the point).

I can prepare a patch, but only with a lot of guidance. I can test
& debug, I'm highly motivated just right now ...

--linas
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [BUG] ide dma_timer_expiry, then hard lockup [ In reply to ]

alan at lxorguk

Jun 18, 2007, 2:04 PM

Post #5 of 27 (1323 views)

> So what do you suggest? (I could buy an alternate ide controller,
> and hope that goes away, or just buy a different hard drive. But
> that's beside the point).

The DMA timeout itself could be all sorts of things - crap driver, crap
hardware, PCI bus contention, noise, problem disk, phase of the moon.

> I can prepare a patch, but only with a lot of guidance. I can test
> & debug, I'm highly motivated just right now ...

If you've got a nice repeatable problem please try using the libata
driver. That handles the error paths differently and doesn't try a FIFO
drain which might matter in this case I guess.

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [BUG] ide dma_timer_expiry, then hard lockup [ In reply to ]

linas at austin

Jun 18, 2007, 2:22 PM

Post #6 of 27 (1316 views)

On Mon, Jun 18, 2007 at 10:04:41PM +0100, Alan Cox wrote:
>
> If you've got a nice repeatable problem

Very highly repeatable :-(

> please try using the libata
> driver. That handles the error paths differently and doesn't try a FIFO
> drain which might matter in this case I guess.

Dohh, yes, of course. Completely forgot about that. (I assume you mean
CONFIG_ATA). Will report tommorrow.

--linas
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [BUG] ide dma_timer_expiry, then hard lockup [ In reply to ]

sshtylyov at ru

Jun 19, 2007, 7:07 AM

Post #7 of 27 (1326 views)

Hello.

Stuart_Hayes@Dell.com wrote:
> I think reading the IDE status register clears the interrupt in the IDE
> device, which might be causing the drive to think it's OK to generate
> another interrupt.

This is not how IDE drives are supposed to act -- they won't proceed any
further until "interrupt pending" condition is cleared, so these aren't
supposed to be "stacked". This behavior however is not strictly specified by
ATA standards IIRC, but I can't readily imagine such situaltion anyway unless
tagged command queueing (which is not supported by IDE core) and/or ATAPI
command overlapping is in action...

> This could either cause it to get stuck trying to
> service an interrupt that is never getting cleared as you suggested, or
> possibly when the next IRQ comes in the IDE IRQ handler gets stuck
> waiting for a spinlock that the code you're looking at already owns...?

I could also imagine the HPT366 chip going mad and stalling the reads if
the taskfile regs forever because of the incomplete DMA or even the drive
going mad and not replying to I/O cycles with proper -IORDY handshake (i.e.
holding it low all the time)...

> Perhaps a printk in the IDE IRQ handler would be informative? It
> wouldn't help you figure out how it got where it is, but it might help
> you figure out why the system is hanging.

> Stuart

> -----Original Message-----
> From: linux-ide-owner@vger.kernel.org
> [mailto:linux-ide-owner@vger.kernel.org] On Behalf Of Linas Vepstas
> Sent: Monday, June 18, 2007 12:57 PM
> To: linux-ide@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: [BUG] ide dma_timer_expiry, then hard lockup

> I've got a hard lockup in the ide subsystem, probably due to some irq
> spew or something like that.
>
> I've just bought a brand new Maxtor 320GB disk driver for the insane
> price of $70 US to replace another failing drive. It works well under
> light load; I was able to copy about 60GB to it. However, under heavy
> load, such as reconstruction of an MD
> RAID-1 array, it'll lock up the kernel. Which means that my system
> won't boot :-(
>
> I'm running 2.6.21.1, although the problem seems to occur in 2.6.19 and
> 2.6.18 too; its been there a while; I vageuly remember similar problems
> in 2.6.5 or 2.6.10.
>
> I get an
> "hdc: dma_timer_expiry: dma status == 0x21"

This means "DMA not complete".

> and 10 seconds later,

The above condition causes another, 10 sec timeout...

> "hdc: DMA Timeout error"

> at which point the system is locked up hard.
> Magic sysreq does not work at all. The hard drive activity light stays
> fully lit. Inserting printk's into the kernel, I find the hang to be in
> a surprising place:

> ide_dma_timeout_retry() in ide-io.c
> prints the "hdc: DMA Timeout error" then calls
> HWIF(drive)->ide_dma_end(drive);
> which returns, and then calls
> hwif->INB(IDE_STATUS_REG) which is needed as an argument to
> ide_error()

> But this hangs! -- The INB never returns.
> Now: hwif->INB = ide_inb; in ide-iops.c

> So putting a printk into ide_inb() shows that
> the printk before the readb() is printed, and the
> printk after the readb is not (!!)

> I find this rather surpriseing, as I can't imagine how the
> readb can fail. My current vague theory is that doing this
> readb makes the hard drive go really nuts, and it probably

As I said, this is not the only way how it all might have gone nuts... :-)

> ties some interrupt line high, and so the linux kernel
> gets stuck trying to handle the irq flood. I just don't know
> enough about the i386 architecture, or about interrupts, to
> prove or disprove this.

> Any suggestions, experiments, experimental patches, data gathering,
> etc. is welcome. The sooner, the better...

> --linas

MBR, Sergei
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [BUG] ide dma_timer_expiry, then hard lockup [ In reply to ]

sshtylyov at ru

Jun 19, 2007, 7:10 AM

Post #8 of 27 (1312 views)

Hello.

Alan Cox wrote:
>>I can prepare a patch, but only with a lot of guidance. I can test
>>& debug, I'm highly motivated just right now ...

> If you've got a nice repeatable problem please try using the libata
> driver. That handles the error paths differently and doesn't try a FIFO
> drain which might matter in this case I guess.

FIFO drain for DMA commands?

MBR, Sergei
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [BUG] ide dma_timer_expiry, then hard lockup [ In reply to ]

alan at lxorguk

Jun 19, 2007, 7:19 AM

Post #9 of 27 (1316 views)

On Tue, 19 Jun 2007 18:10:04 +0400
Sergei Shtylyov <sshtylyov@ru.mvista.com> wrote:

> Hello.
>
> Alan Cox wrote:
> >>I can prepare a patch, but only with a lot of guidance. I can test
> >>& debug, I'm highly motivated just right now ...
>
> > If you've got a nice repeatable problem please try using the libata
> > driver. That handles the error paths differently and doesn't try a FIFO
> > drain which might matter in this case I guess.
>
> FIFO drain for DMA commands?

Welcome to the old IDE layer which I am so glad I left behind 8)

ide_ata_error will try and do a PIO flush regardless of the command type
if DRQ_STAT is asserted. See ide_dma_intr -> ide_error -> ...

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [BUG] ide dma_timer_expiry, then hard lockup [ In reply to ]

sshtylyov at ru

Jun 19, 2007, 7:24 AM

Post #10 of 27 (1313 views)

Alan Cox wrote:

>>>>I can prepare a patch, but only with a lot of guidance. I can test
>>>>& debug, I'm highly motivated just right now ...

>>>If you've got a nice repeatable problem please try using the libata
>>>driver. That handles the error paths differently and doesn't try a FIFO
>>>drain which might matter in this case I guess.

>> FIFO drain for DMA commands?

> Welcome to the old IDE layer which I am so glad I left behind 8)

> ide_ata_error will try and do a PIO flush regardless of the command type
> if DRQ_STAT is asserted. See ide_dma_intr -> ide_error -> ...

Indeed... but the thing is we don't know what's asserted in this case --
remember, it's reading the status register that locks everything up...

> Alan

MBR, Sergei
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [BUG] ide dma_timer_expiry, then hard lockup [ In reply to ]

linas at austin

Jun 19, 2007, 8:05 AM

Post #11 of 27 (1338 views)

Hi Sergei,

On Tue, Jun 19, 2007 at 06:07:07PM +0400, Sergei Shtylyov wrote:
>
> Stuart_Hayes@Dell.com wrote:
> >I think reading the IDE status register clears the interrupt in the IDE
> >device, which might be causing the drive to think it's OK to generate
> >another interrupt.
>
> This is not how IDE drives are supposed to act -- they won't proceed any
> further until "interrupt pending" condition is cleared, so these aren't
> supposed to be "stacked". This behavior however is not strictly specified
> by ATA standards IIRC, but I can't readily imagine such situaltion anyway
> unless tagged command queueing (which is not supported by IDE core) and/or
> ATAPI command overlapping is in action...

The problem only manifests during high io load; perhaps a missing mutex
somewhere is blasting one thing too many out to the hard drive?

> > This could either cause it to get stuck trying to
> >service an interrupt that is never getting cleared as you suggested, or
> >possibly when the next IRQ comes in the IDE IRQ handler gets stuck
> >waiting for a spinlock that the code you're looking at already owns...?
>
> I could also imagine the HPT366 chip going mad and stalling the reads if
> the taskfile regs forever because of the incomplete DMA or even the drive
> going mad and not replying to I/O cycles with proper -IORDY handshake (i.e.
> holding it low all the time)...

In my case, ctrl-alt-sysrq doesn't work, which makes it hard to debug.

I'm thinking that trying to debug libata is a better idea, rather than
investing time in ide, right? Although at the moment, libata works even
less; see other email.

--linas

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [BUG] ide dma_timer_expiry, then hard lockup [ In reply to ]

Jun 19, 2007, 8:38 AM

Post #12 of 27 (1313 views)

Sergei Shtylyov wrote:
> Alan Cox wrote:
>
>>>>> I can prepare a patch, but only with a lot of guidance. I can test
>>>>> & debug, I'm highly motivated just right now ...
>
>>>> If you've got a nice repeatable problem please try using the libata
>>>> driver. That handles the error paths differently and doesn't try a FIFO
>>>> drain which might matter in this case I guess.
>
>>> FIFO drain for DMA commands?
>
>> Welcome to the old IDE layer which I am so glad I left behind 8)
>
>> ide_ata_error will try and do a PIO flush regardless of the command type
>> if DRQ_STAT is asserted. See ide_dma_intr -> ide_error -> ...
>
> Indeed... but the thing is we don't know what's asserted in this case
> -- remember, it's reading the status register that locks everything up...

Exactly. And IORDY shouldn't really apply there,
unless some nitwit standards person wrote it into a spec..

-ml
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [BUG] ide dma_timer_expiry, then hard lockup [ In reply to ]

sshtylyov at ru

Jun 19, 2007, 8:51 AM

Post #13 of 27 (1310 views)

Mark Lord wrote:

>>>>>> I can prepare a patch, but only with a lot of guidance. I can test
>>>>>> & debug, I'm highly motivated just right now ...

>>>>> If you've got a nice repeatable problem please try using the libata
>>>>> driver. That handles the error paths differently and doesn't try a
>>>>> FIFO
>>>>> drain which might matter in this case I guess.

>>>> FIFO drain for DMA commands?

>>> Welcome to the old IDE layer which I am so glad I left behind 8)

>>> ide_ata_error will try and do a PIO flush regardless of the command type
>>> if DRQ_STAT is asserted. See ide_dma_intr -> ide_error -> ...

>> Indeed... but the thing is we don't know what's asserted in this
>> case -- remember, it's reading the status register that locks
>> everything up...

> Exactly. And IORDY shouldn't really apply there,
> unless some nitwit standards person wrote it into a spec..

Wrote what? IORDY throttling does *apply* to both data and non-data
register accesses, of course.

> -ml

MBR, Sergei
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [BUG] ide dma_timer_expiry, then hard lockup [ In reply to ]

sshtylyov at ru

Jun 19, 2007, 9:10 AM

Post #14 of 27 (1313 views)

Hello.

Linas Vepstas wrote:

>>Stuart_Hayes@Dell.com wrote:

>>>I think reading the IDE status register clears the interrupt in the IDE
>>>device, which might be causing the drive to think it's OK to generate
>>>another interrupt.

>> This is not how IDE drives are supposed to act -- they won't proceed any
>>further until "interrupt pending" condition is cleared, so these aren't
>>supposed to be "stacked". This behavior however is not strictly specified
>>by ATA standards IIRC, but I can't readily imagine such situaltion anyway
>>unless tagged command queueing (which is not supported by IDE core) and/or
>>ATAPI command overlapping is in action...

> The problem only manifests during high io load; perhaps a missing mutex
> somewhere is blasting one thing too many out to the hard drive?

Hm... not sure about this.

>>>This could either cause it to get stuck trying to
>>>service an interrupt that is never getting cleared as you suggested, or
>>>possibly when the next IRQ comes in the IDE IRQ handler gets stuck
>>>waiting for a spinlock that the code you're looking at already owns...?

>> I could also imagine the HPT366 chip going mad and stalling the reads if
>>the taskfile regs forever because of the incomplete DMA or even the drive
>>going mad and not replying to I/O cycles with proper -IORDY handshake (i.e.
>>holding it low all the time)...

> In my case, ctrl-alt-sysrq doesn't work, which makes it hard to debug.

> I'm thinking that trying to debug libata is a better idea, rather than
> investing time in ide, right? Although at the moment, libata works even
> less; see other email.

Which makes me think this really is some *hardware* issue.

> --linas

MBR, Sergei
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [BUG] ide dma_timer_expiry, then hard lockup [ In reply to ]

alan at lxorguk

Jun 19, 2007, 9:17 AM

Post #15 of 27 (1311 views)

> > Indeed... but the thing is we don't know what's asserted in this case
> > -- remember, it's reading the status register that locks everything up...
>
> Exactly. And IORDY shouldn't really apply there,
> unless some nitwit standards person wrote it into a spec..

Could it be we need to reset the state machine at this point before we
touch the registers again - that wouldn't be the first controller with
this limit and undocumented.

On the 370 we already

Linas; For the debug on the libata one turn on ATA_DEBUG and
ATA_VERBOSE_DEBUG in include/linux/libata.h and it should spew
diagnostics before the freeze. I suspect thats a different problem to the
hang you see now but I'd like to debug both.

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [BUG] ide dma_timer_expiry, then hard lockup [ In reply to ]

sshtylyov at ru

Jun 19, 2007, 9:32 AM

Post #16 of 27 (1315 views)

Alan Cox wrote:
>>> Indeed... but the thing is we don't know what's asserted in this case
>>>-- remember, it's reading the status register that locks everything up...

>>Exactly. And IORDY shouldn't really apply there,
>>unless some nitwit standards person wrote it into a spec..

> Could it be we need to reset the state machine at this point before we
> touch the registers again - that wouldn't be the first controller with
> this limit and undocumented.

> On the 370 we already

Yeah, that could be. And because IORDY pin becomes DSTROBE for UltraDMA it
might have stuck low due to this (if the chip never asserted STOP)...

> Alan

MBR, Sergei
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [BUG] ide dma_timer_expiry, then hard lockup [ In reply to ]

linas at austin

Jun 19, 2007, 9:48 AM

Post #17 of 27 (1311 views)

On Tue, Jun 19, 2007 at 08:10:25PM +0400, Sergei Shtylyov wrote:
>
> >I'm thinking that trying to debug libata is a better idea, rather than
> >investing time in ide, right? Although at the moment, libata works even
> >less; see other email.
>
> Which makes me think this really is some *hardware* issue.

There are two distinct issues.
-- libata locks up in partition table read on an hpt366+old maxtor disk
that has ben working fine for many years with old ide driver. (It
still works fine when I boot to the alternate ide-based kernel).

-- ide driver locks up on hpt366+new maxtor disk under heavy
i/o load. I was able to copy 60GB from old to new disk without a
problem; however, raid reconstruction locks it up, maybe after 5-15
seconds.

This probably is "hardware related"; its something that the new
hard drive does. Given that its being sold at a big discount, it
may even be that the sellers know that this is a crappy disk. :-)

All I want is some way of resetting the disk, and continuing on.

I'm stalled in debugging; I'm not sue what I'm looking for.

--linas

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [BUG] ide dma_timer_expiry, then hard lockup [ In reply to ]

bzolnier at gmail

Jun 19, 2007, 11:43 AM

Post #18 of 27 (1313 views)

Hi,

On Tuesday 19 June 2007, Linas Vepstas wrote:
> On Tue, Jun 19, 2007 at 08:10:25PM +0400, Sergei Shtylyov wrote:
> >
> > >I'm thinking that trying to debug libata is a better idea, rather than
> > >investing time in ide, right? Although at the moment, libata works even
> > >less; see other email.
> >
> > Which makes me think this really is some *hardware* issue.

Linas, have you checked that there are no firmware updates available
for this drive?

> There are two distinct issues.
> -- libata locks up in partition table read on an hpt366+old maxtor disk
> that has ben working fine for many years with old ide driver. (It
> still works fine when I boot to the alternate ide-based kernel).
>
> -- ide driver locks up on hpt366+new maxtor disk under heavy
> i/o load. I was able to copy 60GB from old to new disk without a
> problem; however, raid reconstruction locks it up, maybe after 5-15
> seconds.
>
> This probably is "hardware related"; its something that the new
> hard drive does. Given that its being sold at a big discount, it
> may even be that the sellers know that this is a crappy disk. :-)
>
> All I want is some way of resetting the disk, and continuing on.

It would be useful to see hdparm --Istdout output for *both* disks.

> I'm stalled in debugging; I'm not sue what I'm looking for.

Sergei, do you think that testing the drive with DMA disabled may
tell us something new?

Thanks,
Bart
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [BUG] ide dma_timer_expiry, then hard lockup [ In reply to ]

sshtylyov at ru

Jun 19, 2007, 1:07 PM

Post #19 of 27 (1312 views)

Bartlomiej Zolnierkiewicz wrote:

>>There are two distinct issues.
>>-- libata locks up in partition table read on an hpt366+old maxtor disk
>> that has ben working fine for many years with old ide driver. (It
>> still works fine when I boot to the alternate ide-based kernel).

>>-- ide driver locks up on hpt366+new maxtor disk under heavy
>> i/o load. I was able to copy 60GB from old to new disk without a
>> problem; however, raid reconstruction locks it up, maybe after 5-15
>> seconds.

>> This probably is "hardware related"; its something that the new
>> hard drive does. Given that its being sold at a big discount, it
>> may even be that the sellers know that this is a crappy disk. :-)

>> All I want is some way of resetting the disk, and continuing on.

> It would be useful to see hdparm --Istdout output for *both* disks.

>>I'm stalled in debugging; I'm not sue what I'm looking for.

> Sergei, do you think that testing the drive with DMA disabled may
> tell us something new?

Not sure. I'll try to come up with a patch esetting the state machine in
dma_timeout() method (following Alan's idea) -- HPT366 regs are different
enough to use the one for HPT370.

> Thanks,
> Bart

MBR, Sergei
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [BUG] ide dma_timer_expiry, then hard lockup [ In reply to ]

linas at austin

Jun 20, 2007, 9:28 AM

Post #20 of 27 (1311 views)

On Wed, Jun 20, 2007 at 12:07:19AM +0400, Sergei Shtylyov wrote:
> Bartlomiej Zolnierkiewicz wrote:
>
> [...frmware...]

Google seems to show that there is no publically available
firmware updates for Maxtor disks.

> >It would be useful to see hdparm --Istdout output for *both* disks.

Lets do one at a time. Appended below is the one for the
older, "known good" disk.

> >Sergei, do you think that testing the drive with DMA disabled may
> >tell us something new?

FWIW, the "buggy" disk seems to work fine with DMA turned off (with
hdparm). I just copied 60GB from it; although this did take about 16
hours at high cpu usage.... There were maybe a a dozen DriveReady
SeekComplete Timeout errors clustered a few minutes apart.

----
Re: The libata problem. This is a hang during the read of the
partition table during boot, of the "known good" disk. I turned
on scsi and libata debugging, reproduced the hang, dilligently
copied to a piece of paper, but then left the darned piece of
paper at home.

From what I remember, the ata command was translated to scsi,
by ata_queuecommand, and then handed off to the scsi subsystem.
Presumably, its sent to the drive, but the drive does not respond.

30 seconds later, the scsi eh runs, and ands the error back to
libata, which takes a few ineffectual shots at recovery, and
then hangs.

I'll try to get the details later.

Is there a way of viewing the contents of he command queue on
the hard drive, to see if the command actually made it across?

--linas

/dev/hda:
multcount = 16 (on)
IO_support = 0 (default 16-bit)
unmaskirq = 0 (off)
using_dma = 1 (on)
keepsettings = 0 (off)
readonly = 0 (off)
readahead = 256 (on)
geometry = 24792/255/63, sectors = 398297088, start = 0
0040 3fff c837 0010 0000 0000 003f 0000
0000 0000 5936 3130 4d45 4345 2020 2020
2020 2020 2020 2020 0003 3e00 0039 5941
5234 3142 5730 4d61 7874 6f72 2036 5932
3030 5030 2020 2020 2020 2020 2020 2020
2020 2020 2020 2020 2020 2020 2020 8010
0000 2f00 4000 0200 0000 0007 ffff 0001
003f ffc1 003e 0110 ffff 0fff 0000 0007
0003 0078 0078 0078 0078 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
00fe 001e 7c6b 7f09 4003 7c69 3e01 4003
107f 0000 0000 0000 fffe 600d c0fe 0000
0000 0000 0000 0000 8800 17bd 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0001 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0001 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 60a5

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [BUG] ide dma_timer_expiry, then hard lockup [ In reply to ]

alan at lxorguk

Jun 20, 2007, 10:01 AM

Post #21 of 27 (1316 views)

> Google seems to show that there is no publically available
> firmware updates for Maxtor disks.

There are for some but only if you irritate the tech support people.

> hours at high cpu usage.... There were maybe a a dozen DriveReady
> SeekComplete Timeout errors clustered a few minutes apart.

That suggests the drive is having problems occassionally and that the DMA
path code then blows up when they occur.

> Is there a way of viewing the contents of he command queue on
> the hard drive, to see if the command actually made it across?

queue ? You are overestimating IDE ;)

When the command is written you wait 400nS and then BSY is supposed to be
asserted. DRQ and other bits then handshake the data at the software
level with IORDY doing it at the hardware level for PIO (except in early
drives/low speeds where its done by the prayer and timing tolerance
approach)

Its unlikely the command got lost. The IRQ could have done but the error
path tries to spot that case by reading the status register - which
hangs. So in theory it could be a lost IRQ and if the reset works we'll
find that out.

Alan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [BUG] ide dma_timer_expiry, then hard lockup [ In reply to ]

sshtylyov at ru

Jun 21, 2007, 10:58 AM

Post #22 of 27 (1303 views)

Hello.

Alan Cox wrote:

>>Google seems to show that there is no publically available
>>firmware updates for Maxtor disks.

> There are for some but only if you irritate the tech support people.

>>hours at high cpu usage.... There were maybe a a dozen DriveReady
>>SeekComplete Timeout errors clustered a few minutes apart.

> That suggests the drive is having problems occassionally and that the DMA
> path code then blows up when they occur.

>>Is there a way of viewing the contents of he command queue on
>>the hard drive, to see if the command actually made it across?

> queue ? You are overestimating IDE ;)

He's not -- there is queued commands support since ATA[PI]-5. I'm not sure
why but Linux decided not to support it.

MBR, Sergei
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [BUG] ide dma_timer_expiry, then hard lockup [ In reply to ]

linas at austin

Jun 21, 2007, 12:47 PM

Post #23 of 27 (1298 views)

On Wed, Jun 20, 2007 at 06:01:23PM +0100, Alan Cox wrote:
>
> Its unlikely the command got lost. The IRQ could have done but the error
> path tries to spot that case by reading the status register - which
> hangs. So in theory it could be a lost IRQ and if the reset works we'll
> find that out.

OK, here's the libata trace info (transcribed by hand, may have typos,
the numerical values should be correct).

This is during the first read of the partition table, during boot.

ata_scsi_dumb_cb: CDB(:1:0,0,0) 28 00 00 00 00 00 00 00 08
ata_scsi_translate: ENTER
scsi_10_lba_len: ten-byte command
ata_sg_setup: ENTER, ata1
ata_sg_setup: 1 sg elements mapped
ata_fill_sg: PRD[0] = (0x2FEEF000, 0x1000)
ata1: ata_dev_select: ENTER, device 0, wait 1
ata_tf_load: feat 0x0 nsect 0x8 lba 0x0 0x0 0x0
ata_tf_load: device 0xE0
ata_exec_command: ta1: cmd 0xc8
ata_scsi_translate: EXIT

then, 30 seconds later:

sd 0:0:0:0 [sda] Done: 0xeff3aba0 TIMEOUT
sd 0:0:0:0 [sda] Result: host_byte=DID_OK driver_byte=DRV_OK, SUG_OK
sd 0:0:0:0 [sda] CDB: Read(10): 28 00 00 ... 00 08 00
sd 0:0:0:0 [sda] scsi host busy 1 failed 0
ata_scsi_timed_out: ENTER
ata_scsi_timed_out: EXIT, ret=0
ata_port_flush_task: ENTER
ata_port_flush_task: flush #1
ata1: ata_port_flush_task: flush #2
ata_port_flush_task: EXIT

Then a hard hang here.

This was on 2.6.22-rc5-git1
Again, this disk and controller combo work spotlessly when using
the ide drivers.

--linas
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [BUG] ide dma_timer_expiry, then hard lockup [ In reply to ]

alan at lxorguk

Jun 21, 2007, 2:41 PM

Post #24 of 27 (1307 views)

> > queue ? You are overestimating IDE ;)
>
> He's not -- there is queued commands support since ATA[PI]-5. I'm not sure
> why but Linux decided not to support it.

Almost no hardware supports it and the functionality is really really
ugly to use when it works at all - NCQ is rather more elegant.

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: [BUG] ide dma_timer_expiry, then hard lockup [ In reply to ]

alan at lxorguk

Jun 21, 2007, 3:04 PM

Post #25 of 27 (1307 views)

> sd 0:0:0:0 [sda] Done: 0xeff3aba0 TIMEOUT
> sd 0:0:0:0 [sda] Result: host_byte=DID_OK driver_byte=DRV_OK, SUG_OK
> sd 0:0:0:0 [sda] CDB: Read(10): 28 00 00 ... 00 08 00
> sd 0:0:0:0 [sda] scsi host busy 1 failed 0
> ata_scsi_timed_out: ENTER
> ata_scsi_timed_out: EXIT, ret=0
> ata_port_flush_task: ENTER
> ata_port_flush_task: flush #1
> ata1: ata_port_flush_task: flush #2
> ata_port_flush_task: EXIT
>
> Then a hard hang here.

Thanks

Added to my bug collection to peer at.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/