Mailing List Archive

1 2  View All
Re: Serial related oops [ In reply to ]
On Wed, Feb 21, 2007 at 02:13:15PM +0000, Jose Goncalves wrote:
> <1>[18840.304048] Unable to handle kernel NULL pointer dereference at virtual address 00000012
> <1>[18840.313046] printing eip:
> <4>[18840.321687] c01bfa7a
> <1>[18840.321714] *pde = 00000000
> <0>[18840.331287] Oops: 0000 [#1]
> <4>[18840.340687] Modules linked in:
> <0>[18840.349749] CPU: 0
> <4>[18840.349767] EIP: 0060:[<c01bfa7a>] Not tainted VLI
> <4>[18840.349782] EFLAGS: 00010202 (2.6.16.41-mtm5-debug1 #1)
> <0>[18840.377277] EIP is at serial_in+0xa/0x4a
> <0>[18840.387221] eax: 00000060 ebx: 00000000 ecx: 00000000 edx: 00000000
> <0>[18840.397805] esi: 00000000 edi: 00000040 ebp: c728fe1c esp: c728fe18
> <0>[18840.408579] ds: 007b es: 007b ss: 0068
> <0>[18840.419624] Process gp_position (pid: 11629, threadinfo=c728e000 task=c7443a90)
> <0>[18840.420509] Stack: <0>00000000 00000000 c01c0f88 00000000 00000000 c031fef0 00000005 00000202
> <0>[18840.445655] c7161a1c c031fef0 c124b510 c728fe60 c01bd97d c031fef0 c124b510 c124b510
> <0>[18840.460540] 00000000 c773dbcc c728fe7c c01befe7 c124b510 00000000 ffffffed c773dbcc

Okay, this one is even more plainly "not a coding error".

> <0>[18840.566645] [<c01c0f88>] serial8250_startup+0x28f/0x2a9

The code around this point (with the return point marked) is:

> c01c0f78: 6a 05 push $0x5
> c01c0f7a: 53 push %ebx
> c01c0f7b: e8 f0 ea ff ff call c01bfa70 <serial_in>
> c01c0f80: 6a 00 push $0x0
> c01c0f82: 53 push %ebx
> c01c0f83: e8 e8 ea ff ff call c01bfa70 <serial_in>
> c01c0f88<<< 6a 02 push $0x2
> c01c0f8a: 53 push %ebx
> c01c0f8b: e8 e0 ea ff ff call c01bfa70 <serial_in>

and corresponds with this C code:

(void) serial_inp(up, UART_LSR);
(void) serial_inp(up, UART_RX);
(void) serial_inp(up, UART_IIR);

Now let's look at the words pushed on the stack around this code:

00000000
00000000
c01c0f88 <- return address for serial_in (serial8250_startup+0x28f/0x2a9)
00000000 <- from push %ebx at c01c0f82
00000000 <- from push $0x0 at c01c0f80
c031fef0 <- from push %ebx at c01c0f7a
00000005 <- from push %0x5 at c01c0f78

Plainly, %ebx changed across the call to serial_in() at c01c0f7b.
First thing to notice is this violates the C code - "up" can not
change.

Now let's look at serial_in:

c01bfa70: 55 push %ebp
c01bfa71: 89 e5 mov %esp,%ebp
c01bfa73: 53 push %ebx
...
c01bfab7: 5b pop %ebx
c01bfab8: 5d pop %ebp
c01bfab9: c3 ret

This code tells the CPU to preserves %ebx and %ebp. But we know %ebx
_wasn't_ preserved. Ergo, your CPU is plainly not doing what the code
told it to do.

Moreover, serial_in() has preserved %ebx in the past otherwise we'd
never got past all the other serial_in()s in serial8250_startup().

So I think it's very demonstrably a hardware fault, and not software
related.

For all we know, it could be a one-off fault on the hardware you
happen to have - other identical units may not behave the same (can
you check?)

If it is a one off case, you are welcome to patch that test out in
your kernel build to remove the problem, and if it's an isolated case
I encourage you to do this. This is one of the great advantages of
open source - if you hit such a problem rather than throwing the
hardware away you can work around such issues.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops [ In reply to ]
Are you using an unpatched gcc 4.1.1? Its optimizer did nasty things
to us, at least on an ARM target ...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops [ In reply to ]
Russell King wrote:

>
> Plainly, %ebx changed across the call to serial_in() at c01c0f7b.
> First thing to notice is this violates the C code - "up" can not
> change.
>
> Now let's look at serial_in:
>
> c01bfa70: 55 push %ebp
> c01bfa71: 89 e5 mov %esp,%ebp
> c01bfa73: 53 push %ebx
> ...
> c01bfab7: 5b pop %ebx
> c01bfab8: 5d pop %ebp
> c01bfab9: c3 ret
>
> This code tells the CPU to preserves %ebx and %ebp. But we know %ebx
> _wasn't_ preserved. Ergo, your CPU is plainly not doing what the code
> told it to do.
>

... assuming nothing else clobbered the stack slot (which would be a
compiler error, or a wild pointer.)

Got a disassembly of the whole function?

-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops [ In reply to ]
On Wed, Feb 21, 2007 at 09:57:50PM -0800, H. Peter Anvin wrote:
> Russell King wrote:
>
> >Plainly, %ebx changed across the call to serial_in() at c01c0f7b.
> >First thing to notice is this violates the C code - "up" can not
> >change.
> >Now let's look at serial_in:
> >c01bfa70: 55 push %ebp
> >c01bfa71: 89 e5 mov %esp,%ebp
> >c01bfa73: 53 push %ebx
> >...
> >c01bfab7: 5b pop %ebx
> >c01bfab8: 5d pop %ebp
> >c01bfab9: c3 ret
> >This code tells the CPU to preserves %ebx and %ebp. But we know %ebx
> >_wasn't_ preserved. Ergo, your CPU is plainly not doing what the code
> >told it to do.
>
> ... assuming nothing else clobbered the stack slot (which would be a compiler
> error, or a wild pointer.)
>
> Got a disassembly of the whole function?
>
Jose posted it higher in the thread:
http://lkml.org/lkml/2007/2/21/139

Regards,
Frederik
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops [ In reply to ]
On Wed, Feb 21, 2007 at 09:57:50PM -0800, H. Peter Anvin wrote:
> Russell King wrote:
>
> >
> >Plainly, %ebx changed across the call to serial_in() at c01c0f7b.
> >First thing to notice is this violates the C code - "up" can not
> >change.
> >
> >Now let's look at serial_in:
> >
> >c01bfa70: 55 push %ebp
> >c01bfa71: 89 e5 mov %esp,%ebp
> >c01bfa73: 53 push %ebx
> >...
> >c01bfab7: 5b pop %ebx
> >c01bfab8: 5d pop %ebp
> >c01bfab9: c3 ret
> >
> >This code tells the CPU to preserves %ebx and %ebp. But we know %ebx
> >_wasn't_ preserved. Ergo, your CPU is plainly not doing what the code
> >told it to do.
> >
>
> ... assuming nothing else clobbered the stack slot (which would be a
> compiler error, or a wild pointer.)
>
> Got a disassembly of the whole function?

See Jose's subsequent message to the one I replied to.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops [ In reply to ]
On Wed, Feb 21, 2007 at 04:34:15PM -0800, Michael K. Edwards wrote:
> Are you using an unpatched gcc 4.1.1? Its optimizer did nasty things
> to us, at least on an ARM target ...

That's ruled out. Please think about it for a moment - serial_in()
managed to work correctly most of the time, and then spontaneously
changes its well-defined ABI behaviour in a way that analysis of the
asm doesn't allow it to.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops [ In reply to ]
Russell King wrote:
> On Wed, Feb 21, 2007 at 02:13:15PM +0000, Jose Goncalves wrote:
>
>> <1>[18840.304048] Unable to handle kernel NULL pointer dereference at virtual address 00000012
>> <1>[18840.313046] printing eip:
>> <4>[18840.321687] c01bfa7a
>> <1>[18840.321714] *pde = 00000000
>> <0>[18840.331287] Oops: 0000 [#1]
>> <4>[18840.340687] Modules linked in:
>> <0>[18840.349749] CPU: 0
>> <4>[18840.349767] EIP: 0060:[<c01bfa7a>] Not tainted VLI
>> <4>[18840.349782] EFLAGS: 00010202 (2.6.16.41-mtm5-debug1 #1)
>> <0>[18840.377277] EIP is at serial_in+0xa/0x4a
>> <0>[18840.387221] eax: 00000060 ebx: 00000000 ecx: 00000000 edx: 00000000
>> <0>[18840.397805] esi: 00000000 edi: 00000040 ebp: c728fe1c esp: c728fe18
>> <0>[18840.408579] ds: 007b es: 007b ss: 0068
>> <0>[18840.419624] Process gp_position (pid: 11629, threadinfo=c728e000 task=c7443a90)
>> <0>[18840.420509] Stack: <0>00000000 00000000 c01c0f88 00000000 00000000 c031fef0 00000005 00000202
>> <0>[18840.445655] c7161a1c c031fef0 c124b510 c728fe60 c01bd97d c031fef0 c124b510 c124b510
>> <0>[18840.460540] 00000000 c773dbcc c728fe7c c01befe7 c124b510 00000000 ffffffed c773dbcc
>>
>
> Okay, this one is even more plainly "not a coding error".
>
>
>> <0>[18840.566645] [<c01c0f88>] serial8250_startup+0x28f/0x2a9
>>
>
> The code around this point (with the return point marked) is:
>
>
>> c01c0f78: 6a 05 push $0x5
>> c01c0f7a: 53 push %ebx
>> c01c0f7b: e8 f0 ea ff ff call c01bfa70 <serial_in>
>> c01c0f80: 6a 00 push $0x0
>> c01c0f82: 53 push %ebx
>> c01c0f83: e8 e8 ea ff ff call c01bfa70 <serial_in>
>> c01c0f88<<< 6a 02 push $0x2
>> c01c0f8a: 53 push %ebx
>> c01c0f8b: e8 e0 ea ff ff call c01bfa70 <serial_in>
>>
>
> and corresponds with this C code:
>
> (void) serial_inp(up, UART_LSR);
> (void) serial_inp(up, UART_RX);
> (void) serial_inp(up, UART_IIR);
>
> Now let's look at the words pushed on the stack around this code:
>
> 00000000
> 00000000
> c01c0f88 <- return address for serial_in (serial8250_startup+0x28f/0x2a9)
> 00000000 <- from push %ebx at c01c0f82
> 00000000 <- from push $0x0 at c01c0f80
> c031fef0 <- from push %ebx at c01c0f7a
> 00000005 <- from push %0x5 at c01c0f78
>
> Plainly, %ebx changed across the call to serial_in() at c01c0f7b.
> First thing to notice is this violates the C code - "up" can not
> change.
>
> Now let's look at serial_in:
>
> c01bfa70: 55 push %ebp
> c01bfa71: 89 e5 mov %esp,%ebp
> c01bfa73: 53 push %ebx
> ...
> c01bfab7: 5b pop %ebx
> c01bfab8: 5d pop %ebp
> c01bfab9: c3 ret
>
> This code tells the CPU to preserves %ebx and %ebp. But we know %ebx
> _wasn't_ preserved. Ergo, your CPU is plainly not doing what the code
> told it to do.
>
> Moreover, serial_in() has preserved %ebx in the past otherwise we'd
> never got past all the other serial_in()s in serial8250_startup().
>
> So I think it's very demonstrably a hardware fault, and not software
> related.
>

It could be a silly question (tamper with me as I'm not familiar with
such low level programming), but couldn't it be possible for a interrupt
to hit in the middle of the serial_in() calls and mess with %ebx?

What I find real hard to understand is why a hardware fault happens
always in the same software instruction! I would expect a hardware fault
to hit randomly...

I left my application running this night, with a 2.6.16.41 kernel
unpatched on the serial driver (my last Oops report was with Frederik
patch to remove the insertion made in 2.6.12) and it crashed again on
exactly the same point!

> For all we know, it could be a one-off fault on the hardware you
> happen to have - other identical units may not behave the same (can
> you check?)
>

Yes I have other units that I can test it. I'll do that to see if it's
really a one-off fault on the hardware.
If it continues to crash with other units I will then test with the
msleep(10) before the "And clear the interrupt registers again for
luck.", as you suggested earlier.

> If it is a one off case, you are welcome to patch that test out in
> your kernel build to remove the problem, and if it's an isolated case
> I encourage you to do this. This is one of the great advantages of
> open source - if you hit such a problem rather than throwing the
> hardware away you can work around such issues.
>

I didn't understand what you mean by "you are welcome to patch that test
out in your kernel build to remove the problem". Which test are you
talking about?

Regards,
José Gonçalves

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops [ In reply to ]
Russell King wrote:
> On Wed, Feb 21, 2007 at 04:34:15PM -0800, Michael K. Edwards wrote:
>
>> Are you using an unpatched gcc 4.1.1? Its optimizer did nasty things
>> to us, at least on an ARM target ...
>>
>
> That's ruled out. Please think about it for a moment - serial_in()
> managed to work correctly most of the time, and then spontaneously
> changes its well-defined ABI behaviour in a way that analysis of the
> asm doesn't allow it to.
>

I'm using gcc 3.4.6.
But I agree with Russell, if it was such a problem it would hit on the
first iteration of my application and not after 1 day of executing the
same piece of code...

Regards,
José Gonçalves

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops [ In reply to ]
On Thu, Feb 22, 2007 at 03:07:18PM +0000, Jose Goncalves wrote:
> Russell King wrote:
> > On Wed, Feb 21, 2007 at 04:34:15PM -0800, Michael K. Edwards wrote:
> >
> >> Are you using an unpatched gcc 4.1.1? Its optimizer did nasty things
> >> to us, at least on an ARM target ...
> >>
> >
> > That's ruled out. Please think about it for a moment - serial_in()
> > managed to work correctly most of the time, and then spontaneously
> > changes its well-defined ABI behaviour in a way that analysis of the
> > asm doesn't allow it to.
> >
>
> I'm using gcc 3.4.6.
> But I agree with Russell, if it was such a problem it would hit on the
> first iteration of my application and not after 1 day of executing the
> same piece of code...

One thing you might think about is running memtest86 on the machine
for the same kind of time interval, just in case it's something trivial
like bad ram.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops [ In reply to ]
On Thu, Feb 22, 2007 at 03:02:46PM +0000, Jose Goncalves wrote:
> It could be a silly question (tamper with me as I'm not familiar with
> such low level programming), but couldn't it be possible for a interrupt
> to hit in the middle of the serial_in() calls and mess with %ebx?

I'm no expert on x86, but if an interrupt was messing with %ebx, you'd
have random crashes verywhere - userspace, kernel space in unpredicatable
ways.

> What I find real hard to understand is why a hardware fault happens
> always in the same software instruction! I would expect a hardware fault
> to hit randomly...

Well, compared with your previous report, your latest report is different.
Your first report had both EIP and %ebx being zero (because they got
corrupted when returning from serial_in). This time only %ebx was
corrupted.

Consequently, this time we oopsed in the subsequent serial_in() rather
than trying to return to serial8250_startup() as last time.

> I left my application running this night, with a 2.6.16.41 kernel
> unpatched on the serial driver (my last Oops report was with Frederik
> patch to remove the insertion made in 2.6.12) and it crashed again on
> exactly the same point!

From that I take it that you removed the test in serial8250_startup which
sets UART_BUG_TXEN, and the problem persisted. That tends to suggest
that it's not the culpret.

> > For all we know, it could be a one-off fault on the hardware you
> > happen to have - other identical units may not behave the same (can
> > you check?)
>
> Yes I have other units that I can test it. I'll do that to see if it's
> really a one-off fault on the hardware.

Would be nice to know.

> If it continues to crash with other units I will then test with the
> msleep(10) before the "And clear the interrupt registers again for
> luck.", as you suggested earlier.
>
> > If it is a one off case, you are welcome to patch that test out in
> > your kernel build to remove the problem, and if it's an isolated case
> > I encourage you to do this. This is one of the great advantages of
> > open source - if you hit such a problem rather than throwing the
> > hardware away you can work around such issues.
>
> I didn't understand what you mean by "you are welcome to patch that test
> out in your kernel build to remove the problem". Which test are you
> talking about?

The one which sets UART_BUG_TXEN.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops [ In reply to ]
Quoting Russell King <rmk+lkml@arm.linux.org.uk>:

> On Thu, Feb 22, 2007 at 03:02:46PM +0000, Jose Goncalves wrote:
>> It could be a silly question (tamper with me as I'm not familiar with
>> such low level programming), but couldn't it be possible for a interrupt
>> to hit in the middle of the serial_in() calls and mess with %ebx?
>
> I'm no expert on x86, but if an interrupt was messing with %ebx, you'd
> have random crashes verywhere - userspace, kernel space in unpredicatable
> ways.
>
>> What I find real hard to understand is why a hardware fault happens
>> always in the same software instruction! I would expect a hardware fault
>> to hit randomly...
>
> Well, compared with your previous report, your latest report is different.
> Your first report had both EIP and %ebx being zero (because they got
> corrupted when returning from serial_in). This time only %ebx was
> corrupted.
>
> Consequently, this time we oopsed in the subsequent serial_in() rather
> than trying to return to serial8250_startup() as last time.

But there was also another difference. I CONFIGed the kernel to produce
more debug info. This should influence the Oops report...

>
>> I left my application running this night, with a 2.6.16.41 kernel
>> unpatched on the serial driver (my last Oops report was with Frederik
>> patch to remove the insertion made in 2.6.12) and it crashed again on
>> exactly the same point!
>
>> From that I take it that you removed the test in serial8250_startup which
> sets UART_BUG_TXEN, and the problem persisted. That tends to suggest
> that it's not the culpret.

From that I mean that with or without this code -
http://lkml.org/lkml/2007/2/19/124 - the problem persisted. The
difference is that, without it, the crashes happens more sparsly.

José Gonçalves


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops [ In reply to ]
Quoting Russell King <rmk+lkml@arm.linux.org.uk>:

> On Thu, Feb 22, 2007 at 03:07:18PM +0000, Jose Goncalves wrote:
>> Russell King wrote:
>> > On Wed, Feb 21, 2007 at 04:34:15PM -0800, Michael K. Edwards wrote:
>> >
>> >> Are you using an unpatched gcc 4.1.1? Its optimizer did nasty things
>> >> to us, at least on an ARM target ...
>> >>
>> >
>> > That's ruled out. Please think about it for a moment - serial_in()
>> > managed to work correctly most of the time, and then spontaneously
>> > changes its well-defined ABI behaviour in a way that analysis of the
>> > asm doesn't allow it to.
>> >
>>
>> I'm using gcc 3.4.6.
>> But I agree with Russell, if it was such a problem it would hit on the
>> first iteration of my application and not after 1 day of executing the
>> same piece of code...
>
> One thing you might think about is running memtest86 on the machine
> for the same kind of time interval, just in case it's something trivial
> like bad ram.
>

OK. That's another thing to do.

Meanwhile I've switched to another SBC and I'm now running my
application on the new unit. Lets wait and see...

José Gonçalves


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops [ In reply to ]
On Thu, Feb 22, 2007 at 03:02:46PM +0000, Jose Goncalves wrote:
> What I find real hard to understand is why a hardware fault happens
> always in the same software instruction! I would expect a hardware fault
> to hit randomly...

I've experienced just such a hardware fault.

The Infineon DSCC4 serial controller has a hardware bug
in the PCI request/grant handling that can lead to the
device driving the PCI bus in conflict with another device.

While the results were random (as the oops in this problem
seem to be), the trigger was always activating certain
devices in combination.

In your case, altering the timing/behavior of the serial
device during open may be provoking the hardware fault.

--
Paul Fulghum
Microgate Systems, Ltd.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops [ In reply to ]
Russell, thanks again for offering to look at this; the more oopses
and soft lockups I see on this board, the more I think you're right
and we have an IRQ handling race.

Here's the struct irqchip setup:

/* mask irq, refer ssection 2.6 under chip 8618 document */
static void mv88w8xx8_mask_irq(unsigned int irq)
{
MV88W8XX8_REG_WRITE(MV88W8XX8_INT_ENABLE_CLR,(1 << irq));
}

/* unmask irq, refer ssection 2.6 under chip 8618 document */
static void mv88w8xx8_unmask_irq(unsigned int irq)
{
MV88W8XX8_REG_WRITE(MV88W8XX8_INT_ENABLE_SET,(1 << irq));
}

/* ack to CPU interrupts and also individual timer interrupts */
static void mv88w8xx8_mask_ack_irq(unsigned int irq)
{
mv88w8xx8_mask_irq(irq);

if (irq < IRQ_TIMER1 || irq > IRQ_TIMER4) return;

/* write 0 to clear interrupt and re-enable further interrupts */
MV88W8XX8_REG_WRITE(MV88W8XX8_TIMER_INT_SOURCE, ~(1<<(irq-4)));
}

static struct irqchip mv88w8xx8_chip = {
.ack = mv88w8xx8_mask_ack_irq,
.mask = mv88w8xx8_mask_irq,
.unmask = mv88w8xx8_unmask_irq,
};

/**
* called by core.c to initialize the IRQ module
*/
void mv88w8xx8_init_irq(void)
{
int irq;

for (irq = 0; irq < NR_IRQS; irq++) {
set_irq_chip(irq, &mv88w8xx8_chip);
set_irq_handler(irq, do_level_IRQ);
set_irq_flags(irq, IRQF_VALID | IRQF_PROBE);
}
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops [ In reply to ]
Hi again Russel,

I'm back, after some more testing. Here goes my report.

I've switched to another SBC and the kernel still Oops, so is not a
one-off fault on the hardware.

I've also run memtest86+ on this board for the maximum period that I
reach an Oops with my application (24 H) and it not detected any fault
(in 21 passes).

As I've said earlier, our hardware as an extra serial controller
(TL16C554A). To isolate the problem, I've removed the board with this
extra controller and used only the SBC (Vortex86-6070 -
http://www.icop.com.tw/products_detail.asp?ProductID=70). Still, with
that setup and with my application using only ttyS1, I get kernel Oops,
and always in the same point:

<1>[43477.986867] Unable to handle kernel NULL pointer dereference at
virtual address 00000012
<1>[43477.995067] printing eip:
<4>[43478.003087] c01bfa7a
<1>[43478.003116] *pde = 00000000
<0>[43478.011231] Oops: 0000 [#1]
<4>[43478.019188] Modules linked in:
<0>[43478.027308] CPU: 0
<4>[43478.027325] EIP: 0060:[<c01bfa7a>] Not tainted VLI
<4>[43478.027341] EFLAGS: 00010202 (2.6.16.41-mtm6-debug1 #1)
<0>[43478.052490] EIP is at serial_in+0xa/0x4a
<0>[43478.061448] eax: 00000060 ebx: 00000000 ecx: 00000000 edx:
00000000
<0>[43478.070945] esi: 00000000 edi: 00000040 ebp: c7237e1c esp:
c7237e18
<0>[43478.080720] ds: 007b es: 007b ss: 0068
<0>[43478.090470] Process gp_position (pid: 26205, threadinfo=c7236000
task=c775dab0)
<0>[43478.091319] Stack: <0>00000000 00000000 c01c0f88 00000000 00000000
c031fef0 00000005 00000202
<0>[43478.113464] c717fa1c c031fef0 c124b510 c7237e60 c01bd97d
c031fef0 c124b510 c124b510
<0>[43478.126484] 00000000 c760c52c c7237e7c c01befe7 c124b510
00000000 ffffffed c760c52c
<0>[43478.139984] Call Trace:
<0>[43478.152627] [<c0102a35>] show_stack_log_lvl+0xa5/0xad
<0>[43478.166200] [<c0102b70>] show_registers+0x106/0x16f
<0>[43478.179852] [<c0102d06>] die+0xb6/0x127
<0>[43478.193589] [<c0109677>] do_page_fault+0x380/0x4b3
<0>[43478.207616] [<c01026bf>] error_code+0x4f/0x60
<0>[43478.221803] [<c01c0f88>] serial8250_startup+0x28f/0x2a9
<0>[43478.236340] Code: 38 43 78 75 02 b2 01 89 d0 eb 10 8b 41 70 39 43
70 0f 94 c0 0f b6 c0 eb 02 31 c0 5b 5d c3 90 90 90 55 89 e5 53 8b 5d 08
8b 55 0c <0f> b6 4b 12 0f b6 43 13 d3 e2 83 f8 02 74 1a 7f 05 48 74 09 eb
<4>[43478.322255] BUG: gp_position/26205, lock held at task exit time!
<4>[43478.341721] [c124b528] {uart_register_driver}
<4>[43478.359168] .. held by: gp_position:26205 [c775dab0, 117]
<4>[43478.377112] ... acquired at: uart_get+0x28/0xde

I've also done your suggestion and I've inserted "msleep(10);" just
before the "And clear the interrupt registers again for luck." and my
application is now running without problems fore more than 24H! So,
inserting a delay in this point definitely makes some difference (has
was with adding some extra printk() in several points of
serial8250_startup()).

This said, for me, this is definitely a software problem. The question
is were?
I would appreciate if you (or anyone) could give me any pointers on how
to detect the cause of my kernel Oops (perhaps activating extra kernel
debug?)

Thanks,
José Gonçalves


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops [ In reply to ]
On Thu, Mar 01, 2007 at 01:33:28PM +0000, Jose Goncalves wrote:
> I've also done your suggestion and I've inserted "msleep(10);" just
> before the "And clear the interrupt registers again for luck." and my
> application is now running without problems fore more than 24H! So,
> inserting a delay in this point definitely makes some difference (has
> was with adding some extra printk() in several points of
> serial8250_startup()).
>
> This said, for me, this is definitely a software problem. The question
> is were?

I'm personally convinced it's hardware because according to my analysis
your CPU behaving in a way that the code is not asking it to do so.

Maybe others have some further insight; I certainly don't.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: Serial related oops [ In reply to ]
Russell King wrote:
> On Thu, Mar 01, 2007 at 01:33:28PM +0000, Jose Goncalves wrote:
>
>> I've also done your suggestion and I've inserted "msleep(10);" just
>> before the "And clear the interrupt registers again for luck." and my
>> application is now running without problems fore more than 24H! So,
>> inserting a delay in this point definitely makes some difference (has
>> was with adding some extra printk() in several points of
>> serial8250_startup()).
>>
>> This said, for me, this is definitely a software problem. The question
>> is were?
>>
>
> I'm personally convinced it's hardware because according to my analysis
> your CPU behaving in a way that the code is not asking it to do so.
>

It's not possible that a interrupt is hitting just after enabling
interrupts with "serial_outp(up, UART_IER, up->ier);" which triggers the
execution of some code that is not reported by the Oops dump (at least
with my current configuration) ?

José Gonçalves

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

1 2  View All