Mailing List Archive: Lockups

Lockups - lost interrupt

Sep 10, 1999, 10:50 AM

Post #1 of 35 (1230 views)

I've been having some lockups on numerous version of the 2.2 series
(10,11,12) in both SMP and non-SMP mode.
Usually the machine is just locked up hard (alt-sys-req does not work).
This last time though, some scsi timeouts, ide "lost interrupt" messages,
and ethernet timeouts were showing up on the screen. I still could not do
anything with alt-sys-req.
When an IDE timeout occurred the IDE drive showed some activity followed by
the "hda lost interrupt" message.
It looks like the system was losing ALL interrupts which would also explain
why the keyboard wasn't working.
Does anybody have any idea what could be causing this? Is this a bad
motherboard? CPU?
________________________________________
Michael D. Black Principal Engineer
mblack@csi.cc 407-676-2923,x203
http://www.csi.cc Computer Science Innovations
http://www.csi.cc/~mike My home page
FAX 407-676-2355
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Lockups - lost interrupt [ In reply to ]

whampton at staffnet

Sep 10, 1999, 12:22 PM

Post #2 of 35 (1235 views)

Permalink

Mike Black wrote:
>
> I've been having some lockups on numerous version of the 2.2 series
> (10,11,12) in both SMP and non-SMP mode.
I too have been having lockups and have posted here, with Penguin,
and with the redhat-list.
The machines are 2 Penguin dual PIII (network or load related),
and a Dell WS400 dual PII/300 (sound active then access the floppy).
>
> Usually the machine is just locked up hard (alt-sys-req does not work).
Same. The Dell has the kdb code (from SGI) and I can't access that
either.
>
> This last time though, some scsi timeouts, ide "lost interrupt" messages,
> and ethernet timeouts were showing up on the screen. I still could not do
> anything with alt-sys-req.
I saw some unexpected interrupts from the floppy drive, but it had
not locked up at that time -- it locked up later....
>
> When an IDE timeout occurred the IDE drive showed some activity followed by
> the "hda lost interrupt" message.
>
> It looks like the system was losing ALL interrupts which would also explain
> why the keyboard wasn't working.
>
> Does anybody have any idea what could be causing this? Is this a bad
> motherboard? CPU?
Since I have 3 systems doing it, others have also reported similar,
I think it could be kernel related....
--
W. Wade, Hampton <whampton@staffnet.com>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Lockups - lost interrupt [ In reply to ]

alan at lxorguk

Sep 10, 1999, 1:53 PM

Post #3 of 35 (1230 views)

Permalink

> > Does anybody have any idea what could be causing this? Is this a bad
> > motherboard? CPU?
> Since I have 3 systems doing it, others have also reported similar,
> I think it could be kernel related....
I'm still digging into this. I do have some ideas what may be involved. There
are about three different cases here
1. VIA chipset bug - known, understood, non SMP
2. A few triton boards - probably a hardware issue
3. SMP - looks like a lock bug.
Running the ikd patch is the best help here. I think it will show you a
spinlock deadlock. The trace from that should find the guilty party
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Lockups - lost interrupt [ In reply to ]

andre at suse

Sep 11, 1999, 12:21 AM

Post #4 of 35 (1239 views)

Permalink

Alan I know something about the lost interrupts is directly related to
drive timings. There is a spinlock somewhere to be caught, but I can not
find the beasty. Lob me that "ikd patch", please.
I can forcable wreck the timing values and invoke the error.
Also some of the quick drives must be forcable back speeded on slower
chipsets. Also there are a few chipsets that really get nasty because of
the extremely tight tolerances (well -- zero margin -- almost).
Lastly there are several drives that claim ATA-66, but timeout and there
is nothing to date to catch and convert/update the transfer rates.
With the huge cache buffers and dual processor drives, yes there is a
vender out there throttling the drives with two onboard processors, the
fun is only beginning.
To get an idea on the variables....
Leading or lagging interuppts from the chipset.
Variably based on the limit size imposed on the chipset FIFO.
Command queing of the drives are on this side of the horizen and moving
fast, ie SCSI work horse power is coming to IDE.
There is more, but it is late and I am now only one kernel behind the
curve to catch up on patching.
OT, do you want the very latest back code for 2.2.13 that I have to finish
creating for 2.3.18/9?
Andre Hedrick
The Linux IDE guy
On Fri, 10 Sep 1999, Alan Cox wrote:
> > > Does anybody have any idea what could be causing this? Is this a bad
> > > motherboard? CPU?
> > Since I have 3 systems doing it, others have also reported similar,
> > I think it could be kernel related....
>
> I'm still digging into this. I do have some ideas what may be involved. There
> are about three different cases here
>
> 1. VIA chipset bug - known, understood, non SMP
> 2. A few triton boards - probably a hardware issue
> 3. SMP - looks like a lock bug.
>
> Running the ikd patch is the best help here. I think it will show you a
> spinlock deadlock. The trace from that should find the guilty party
>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.rutgers.edu
> Please read the FAQ at http://www.tux.org/lkml/
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Lockups - lost interrupt [ In reply to ]

mblack at csihq

Sep 11, 1999, 4:36 PM

Post #5 of 35 (1230 views)

Permalink

Running 2.2.12 with raid0145-19990824-2.2.11 and 2.2.12-ikd1.
I had one raid5 array resyncing and got an oops...fortunately I still had a
console open and could get the oops from dmesg. I see the restore_flags but
no save_flags call...could this be the problem?? The postprocessed source
for raid5.c looks like:
static void raid5_kfree_bh(struct stripe_head *sh, struct buffer_head *bh)
{
unsigned long flags;
(( flags )=__global_save_flags()) ;
__global_cli() ;
put_free_bh(sh, bh);
__global_restore_flags( flags ) ;
}
Unable to handle kernel NULL pointer dereference at virtual address 00000000
current->tss.cr3 = 00101000, %cr3 = 00101000
*pde = 00000000
Oops: 0002
CPU: 0
EIP: 0010:[<c0147cf7>]
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010006
eax: 00000001 ebx: cff2e000 ecx: 00000004 edx: c0284954
esi: cfb22600 edi: 00000001 ebp: cff2fe3c esp: cff2fe1c
ds: 0018 es: 0018 ss: 0018
Process raid5d (pid: 9, process nr: 10, stackpage=cff2f000)
Stack: cff2fe3c c023d8b1 cfb0fc00 cffa0800 cff2fe4c c0119fcb cfb0fc00
00000000
cff2fe4c c010d574 cfb0fc00 cffd8078 cff2fe8c c01e6fc4 00000001
cfb22600
cff2fe8c c01e6fbb cfb22600 cfb0fc00 cff2fe8c c01e6fb1 c31d6440
00000004
Call Trace: [<c023d8b1>] [<c0119fcb>] [<c010d574>] [<c01e6fc4>] [<c01e6fbb>]
[<c01e6fb1>] [<c01e8306>]
[<c01e82f1>] [<c01e8247>] [<c021c555>] [<c01e8568>] [<c01e8456>]
[<c0109a39>] [<c0119ddf>] [<c01e92cc>]
[<c01e93a2>] [<c01e92e3>] [<c01cd5e2>] [<c01cd5cc>] [<c01e0018>]
[<c0109669>]
Code: c6 05 00 00 00 00 00 8b 5d e8 89 ec 5d c3 8d 76 00 55 89 e5
>>EIP; c0147cf7 <mcount+20b/21c> <=====
Trace; c023d8b1 <__udelay+49/50>
Trace; c0119fcb <__wake_up+13/7c>
Trace; c010d574 <__global_restore_flags+10/58>
Trace; c01e6fc4 <raid5_kfree_bh+38/44>
Trace; c01e6fbb <raid5_kfree_bh+2f/44>
Trace; c01e6fb1 <raid5_kfree_bh+25/44>
Trace; c01e8306 <complete_stripe+d2/19c>
Trace; c01e82f1 <complete_stripe+bd/19c>
Trace; c01e8247 <complete_stripe+13/19c>
Trace; c021c555 <requeue_sd_request+b4d/b5c>
Trace; c01e8568 <handle_stripe+128/c34>
Trace; c01e8456 <handle_stripe+16/c34>
Trace; c0109a39 <__switch_to+11/d0>
Trace; c0119ddf <schedule+1ff/3d8>
Trace; c01e92cc <unplug_devices+10/14>
Trace; c01e93a2 <raid5d+d2/12c>
Trace; c01e92e3 <raid5d+13/12c>
Trace; c01cd5e2 <md_thread+fa/1e0>
Trace; c01cd5cc <md_thread+e4/1e0>
Trace; c01e0018 <do_rw_disk+290/2a0>
Trace; c0109669 <kernel_thread+35/48>
Code; c0147cf7 <mcount+20b/21c>
00000000 <_EIP>:
Code; c0147cf7 <mcount+20b/21c> <=====
0: c6 05 00 00 00 00 00 movb $0x0,0x0 <=====
Code; c0147cfe <mcount+212/21c>
7: 8b 5d e8 movl 0xffffffe8(%ebp),%ebx
Code; c0147d01 <mcount+215/21c>
a: 89 ec movl %ebp,%esp
Code; c0147d03 <mcount+217/21c>
c: 5d popl %ebp
Code; c0147d04 <mcount+218/21c>
d: c3 ret
Code; c0147d05 <mcount+219/21c>
e: 8d 76 00 leal 0x0(%esi),%esi
Code; c0147d08 <mcount_internal+0/1ea>
11: 55 pushl %ebp
Code; c0147d09 <mcount_internal+1/1ea>
12: 89 e5 movl %esp,%ebp
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Lockups - lost interrupt [ In reply to ]

andrea at suse

Sep 11, 1999, 5:34 PM

Post #6 of 35 (1228 views)

Permalink

The Oops you seen is forced by the IKD debugging code and it's not a bug
(it's just to get a stack trace). But you must have also one other line
before the Oops to tell us the reason of the Oops. You didn't quoted it.
Also probably you want to disable everything in the IKD patch except the
NMI watchdog, if the NMI won't be helpful then you can use other things in
the IKD patch. Actually I am not 100% sure if the other debugging code is
completly reliable, as I don't use it since some time ago. I'll have a
look...
Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Lockups - lost interrupt [ In reply to ]

whampton at staffnet

Sep 11, 1999, 5:40 PM

Post #7 of 35 (1230 views)

Permalink

Mike Black wrote:
>
> I'm installing the ikd patch on 2.2.12 with Mingo's 2.2.12 raid patch. With
> some luck, it shouldn't take long to find the problem. I've got one script
> that takes about 10 minutes to lockup the machine (locks up in either UP or
> SMP mode).
Dumb question -- where can I get the ikd patch? I could put it on the
Dell (the most repeatable crash) and see if I could find the problem.
One of the Penguins (dual PIII) is running under load over the weekend
on 2.2.11 (yes, I went backwards, but I did not have any crashes with
2.2.11). I'll let you know how well it does Monday AM.
>
> There
> are about three different cases here
>
> 1. VIA chipset bug - known, understood, non SMP
> 2. A few triton boards - probably a hardware issue
> 3. SMP - looks like a lock bug.
>
> Running the ikd patch is the best help here. I think it will show you a
> spinlock deadlock. The trace from that should find the guilty party
My guess is 3 - a lock bug. Based on the fact that the Dell locks up
with the floppy and the others with NFS/network -- it could be in
multiple
places (same code copied maybe)?
I've been at kid's soccer, piano, etc. today and won't be able to do
any tests until Monday -- the computers that are having problems are
at work. If you have any luck, please let me know!
Cheers,
--
W. Wade, Hampton <whampton@staffnet.com>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Lockups - lost interrupt [ In reply to ]

andrea at suse

Sep 11, 1999, 5:42 PM

Post #8 of 35 (1230 views)

Permalink

On Sun, 12 Sep 1999, Andrea Arcangeli wrote:
>completly reliable, as I don't use it since some time ago. I'll have a
Did you enabled the SOFTLOCKUP feature? If so disable it and enable only
the NMI watchdog. ;)
Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Lockups - lost interrupt [ In reply to ]

andrea at suse

Sep 11, 1999, 6:08 PM

Post #9 of 35 (1228 views)

Permalink

On Sat, 11 Sep 1999, Wade Hampton wrote:
>Dumb question -- where can I get the ikd patch? I could put it on the
ftp://ftp.suse.com/pub/people/andrea/kernel-patches/ikd/2.2.12-ikd1.gz
Right now enable _only_ the NMI oopser and wait for a stack trace.
Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Lockups - lost interrupt [ In reply to ]

alan at lxorguk

Sep 11, 1999, 6:15 PM

Post #10 of 35 (1226 views)

Permalink

> My guess is 3 - a lock bug. Based on the fact that the Dell locks up
> with the floppy and the others with NFS/network -- it could be in
> multiple
> places (same code copied maybe)?
No. The only locks in the floppy code relating to sound are ISA DMA
locks. That wouldn't affect NFS/ethernet cards
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Lockups - lost interrupt [ In reply to ]

mblack at csihq

Sep 11, 1999, 7:15 PM

Post #11 of 35 (1228 views)

Permalink

OK...didn't think this one line mattered:
Sep 11 11:54:55 medusa kernel: Deadlock threshold exceeded, forcing Oops.
Sep 11 11:54:55 medusa kernel: Unable to handle kernel NULL pointer
dereference at virtual address 00000000
Sep 11 11:54:55 medusa kernel: current->tss.cr3 = 00101000, %cr3 = 00101000
Sep 11 11:54:55 medusa kernel: *pde = 00000000
Sep 11 11:54:55 medusa kernel: Oops: 0002
----- Original Message -----
From: Andrea Arcangeli <andrea@suse.de>
To: Mike Black <mblack@csihq.com>
Cc: Andre Hedrick <andre@suse.com>; Alan Cox <alan@lxorguk.ukuu.org.uk>;
Wade Hampton <whampton@staffnet.com>; <linux-kernel@vger.rutgers.edu>
Sent: Saturday, September 11, 1999 8:34 PM
Subject: Re: Lockups - lost interrupt
The Oops you seen is forced by the IKD debugging code and it's not a bug
(it's just to get a stack trace). But you must have also one other line
before the Oops to tell us the reason of the Oops. You didn't quoted it.
Also probably you want to disable everything in the IKD patch except the
NMI watchdog, if the NMI won't be helpful then you can use other things in
the IKD patch. Actually I am not 100% sure if the other debugging code is
completly reliable, as I don't use it since some time ago. I'll have a
look...
Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Lockups - lost interrupt [ In reply to ]

mingo at chiara

Sep 12, 1999, 1:17 AM

Post #12 of 35 (1228 views)

Permalink

On Sun, 12 Sep 1999, Andrea Arcangeli wrote:
> On Sat, 11 Sep 1999, Wade Hampton wrote:
>
> >Dumb question -- where can I get the ikd patch? I could put it on the
>
> ftp://ftp.suse.com/pub/people/andrea/kernel-patches/ikd/2.2.12-ikd1.gz
>
> Right now enable _only_ the NMI oopser and wait for a stack trace.
FYI, i'm working on a patch for 2.3 that adds the NMI oopser (optionally,
because 1) it doesnt work on all SMP boxes, and 2) it forces the timer irq
to the BP) to the 2.3 kernel. Looks like one of the most common uses of
IKD is lockup detection - the rest is mostly used by kernel hackers. I'll
post it together with some other x86 APIC fixes and irq cleanups soon.
-- mingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Lockups - lost interrupt [ In reply to ]

andrea at suse

Sep 12, 1999, 5:51 AM

Post #13 of 35 (1230 views)

Permalink

On Sat, 11 Sep 1999, Mike Black wrote:
>OK...didn't think this one line mattered:
>
>Sep 11 11:54:55 medusa kernel: Deadlock threshold exceeded, forcing Oops.
OK disable the SOFTLOCKUP detection. Thanks.
Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Lockups - lost interrupt [ In reply to ]

andrea at suse

Sep 12, 1999, 5:59 AM

Post #14 of 35 (1230 views)

Permalink

Re: PCI patch for 2.3.18 [ In reply to ]

mj at ucw

Sep 12, 1999, 12:15 PM

Post #15 of 35 (1230 views)

Permalink

Hi,
> o Don't link syscall.o and setup.o on the PC, they aren't used anyway.
>
> This is an unneeded change I really think. drivers/pci/pci.a is an
> archive, and the symbols+code won't be pulled into the kernel if they
> are unreferenced.
Yes, I know that, but I want to be sure it won't be pulled on a i386
accidentally. Also, it saves a few seconds from the compilation time, but
it probably doesn't matter.
Have a nice fortnight
--
Martin `MJ' Mares <mj@ucw.cz> http://atrey.karlin.mff.cuni.cz/~mj/
Faculty of Math and Physics, Charles University, Prague, Czech Rep., Earth
"Outside of a dog, a book is man's best friend. Inside a dog, it's too dark to read."
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Lockups - lost interrupt [ In reply to ]

sdw at lig

Sep 12, 1999, 6:10 PM

Post #16 of 35 (1236 views)

Permalink

Another datapoint:
I installed 2.2.12 on two servers, both Pentium Celeron 433/128K cache, 256MB
ram, no IDE, no floppy, no sound, text only console, on board Adaptec 2990
Ultra2Wide SCSI, 1 WD 18300 LVD Ultra2 Wide SCSI drive 18GB and onboard
Realtek 100-base-t ethernet. All of this is on the motherboard with a passive
backplane.
One ISA 3Com 515 ethernet card.
One server had no problem, the other locked up hard, requiring a manual fsck.
Thereafter it crashed and rebooted every few minutes.
Unfortunately these have no heads, being co-located, however I was able to
install 2.2.13Pre6 patch from Alan's directory and the problem disappeared
completely.
I do have the LVS (Linux Virtual Server/LinuxDirector) patch applied also.
It's working great except I'm having trouble keeping the loopback aliases from
generating ARP replies... (The subject of my next message.)
sdw
Mike Black wrote:
> I've been having some lockups on numerous version of the 2.2 series
> (10,11,12) in both SMP and non-SMP mode.
>
> Usually the machine is just locked up hard (alt-sys-req does not work).
>
> This last time though, some scsi timeouts, ide "lost interrupt" messages,
> and ethernet timeouts were showing up on the screen. I still could not do
> anything with alt-sys-req.
>
> When an IDE timeout occurred the IDE drive showed some activity followed by
> the "hda lost interrupt" message.
>
> It looks like the system was losing ALL interrupts which would also explain
> why the keyboard wasn't working.
>
> Does anybody have any idea what could be causing this? Is this a bad
> motherboard? CPU?
>
> ________________________________________
> Michael D. Black Principal Engineer
> mblack@csi.cc 407-676-2923,x203
> http://www.csi.cc Computer Science Innovations
> http://www.csi.cc/~mike My home page
> FAX 407-676-2355
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.rutgers.edu
> Please read the FAQ at http://www.tux.org/lkml/
--
OptimaLogic - Finding Optimal Solutions Web/Crypto/OO/Unix/Comm/Video/DBMS
sdw@lig.net Stephen D. Williams Senior Consultant/Architect http://sdw.st
43392 Wayside Cir,Ashburn,VA 20147-4622 703-724-0118W 703-995-0407Fax 5Jan1999
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Lockups - lost interrupt [ In reply to ]

yodaiken at chelm

Sep 12, 1999, 7:55 PM

Post #17 of 35 (1230 views)

Permalink

There are parts of the new irq.c that are not obviously there to support
RTLinux. Please don't chop em. Especially important is: the functions
in the low level handler structure do not invoke any spinlocks and
there are labels on the low level irq catch code that allow the RTL
module to patch to take control.
On Sun, Sep 12, 1999 at 10:17:04AM +0200, mingo@chiara.csoma.elte.hu wrote:
>
> On Sun, 12 Sep 1999, Andrea Arcangeli wrote:
>
> > On Sat, 11 Sep 1999, Wade Hampton wrote:
> >
> > >Dumb question -- where can I get the ikd patch? I could put it on the
> >
> > ftp://ftp.suse.com/pub/people/andrea/kernel-patches/ikd/2.2.12-ikd1.gz
> >
> > Right now enable _only_ the NMI oopser and wait for a stack trace.
>
> FYI, i'm working on a patch for 2.3 that adds the NMI oopser (optionally,
> because 1) it doesnt work on all SMP boxes, and 2) it forces the timer irq
> to the BP) to the 2.3 kernel. Looks like one of the most common uses of
> IKD is lockup detection - the rest is mostly used by kernel hackers. I'll
> post it together with some other x86 APIC fixes and irq cleanups soon.
>
> -- mingo
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.rutgers.edu
> Please read the FAQ at http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Lockups - lost interrupt [ In reply to ]

mingo at chiara

Sep 12, 1999, 11:42 PM

Post #18 of 35 (1229 views)

Permalink

On Sun, 12 Sep 1999 yodaiken@chelm.cs.nmt.edu wrote:
> There are parts of the new irq.c that are not obviously there to support
> RTLinux. Please don't chop em. Especially important is: the functions
> in the low level handler structure do not invoke any spinlocks and
> there are labels on the low level irq catch code that allow the RTL
> module to patch to take control.
i'm not touching the architecture part, thats i think pretty clean right
now. I ment minor stuff like moving the no_irq controller definition out
of i8259.c and the like. (no_irq_type is not really Intel-dependent. The
#if SMP thing in ack_none is just an expression of 'what should we do if
the vector is illegal', which is architecture dependent. But this doesnt
make the no_irq_type controller truly architecture-dependent.) Another
more generic thing i'm thinking about (not done yet), to move the vector
building defines near to every controller's source code section. This
makes the thing a little bit more modular. Not all controllers are truly
independent (there are obvious interactions between 8259A and the first
IOAPIC in the system), but this is not a problem. The APIC/IOAPIC code
OTOH has major modifications/fixes.
what labels do you mean?
-- mingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Lockups - lost interrupt [ In reply to ]

yodaiken at chelm

Sep 13, 1999, 6:12 AM

Post #19 of 35 (1230 views)

Permalink

On Mon, Sep 13, 1999 at 08:42:20AM +0200, mingo@chiara.csoma.elte.hu wrote:
>
> On Sun, 12 Sep 1999 yodaiken@chelm.cs.nmt.edu wrote:
>
> > There are parts of the new irq.c that are not obviously there to support
> > RTLinux. Please don't chop em. Especially important is: the functions
> > in the low level handler structure do not invoke any spinlocks and
> > there are labels on the low level irq catch code that allow the RTL
> > module to patch to take control.
>
> i'm not touching the architecture part, thats i think pretty clean right
> now. I ment minor stuff like moving the no_irq controller definition out
> of i8259.c and the like. (no_irq_type is not really Intel-dependent. The
> #if SMP thing in ack_none is just an expression of 'what should we do if
> the vector is illegal', which is architecture dependent. But this doesnt
Yes. But it was not obvious to me what to do when it was illegal so
I left it there.
> make the no_irq_type controller truly architecture-dependent.) Another
> more generic thing i'm thinking about (not done yet), to move the vector
> building defines near to every controller's source code section. This
> makes the thing a little bit more modular. Not all controllers are truly
> independent (there are obvious interactions between 8259A and the first
> IOAPIC in the system), but this is not a problem. The APIC/IOAPIC code
> OTOH has major modifications/fixes.
>
> what labels do you mean?
If you look at the build irq macros, you will see that common irq
has a label on the line of code that does "call do_IRQ" (unless linus
removed that later) and the other low level routines are similarly
labled. RTLinux is going to be nearly totally moduler and the init
code in the rtl module will patch that code to call rtl_intercept
instead of do_IRQ. What's missing for me now is an ifdefed code
section that will fill in a data structure with the pointers
#ifdef RTL_CONFIG
struct rtl_code {
void * call_common,do_irq,call_smpx,do_smpx ...
irq_desc, ... }
= initialize
#endif
Then there is a similar size ifdefed parts in system.h
#ifdef RTL_CONFIG
struct irq_control { do_cli,do_sti .... }
#define __cli() irq_control.do_cli()
...
and the export in ksysms
On insert the rtl module will patch the code, make a copy
of irq_desc handler list, replace irq_desc.handler in eah irq
with a soft handler pointer and change the irq_control structure
to point to soft cli and soft sti etc.
Anyway, that's the theory.
On module cleanup, the rtl module will unpatch and restore everything
to original. Interstingly enough, lmbench shows no performance
consequences of replacing cli/sti with indirect function calls.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Lockups - lost interrupt [ In reply to ]

yodaiken at chelm

Sep 13, 1999, 6:51 AM

Post #20 of 35 (1237 views)

Permalink

On Mon, Sep 13, 1999 at 04:39:32PM +0200, mingo@chiara.csoma.elte.hu wrote:
>
> On Mon, 13 Sep 1999 yodaiken@chelm.cs.nmt.edu wrote:
>
> > > what labels do you mean?
> >
> > If you look at the build irq macros, you will see that common irq
> > has a label on the line of code that does "call do_IRQ" [...]
>
> oh, ok, i see it.
>
> > #define __cli() irq_control.do_cli()
>
> i'm not sure wether this will ever be accepted into the main kernel -
> __cli()/__sti()/etc. right now is heavily used and inlined (mostly via
The price is paid only if you select RTL in config.
my idea is that system.h does
#define do_not_use_this_cli_directly() __asm__("cli")
#ifndef RTL_CONFIG
#define __cli() do_not_use_this_clu_directly()
...
#else
struct irq_control ...
Even with the indirect jump
Lmbench can't detect any performance loss -- remember that cli and sti
are not cheap instructions anyways.
RTL should actually show a slight gain, because in operation __cli and
__sti will be
call x
set memory value
return
which is cheaper on a modern processor than __asm__("cli");
> spinlocks) and it's a single instruction. Maybe building a table of 'cli,
> sti, popfl, pushfl' addresses into a special section can do the trick
> without interfering with the 'normal' kernel? A single-instruction 'int 3'
> could be patched into those places, or something like that.
The "int" would cost too much in the rtl case. On
the other hand, I had thought of
of a section. Not sure what the advantage would be. With the structure,
the compiler generates
movel N+irq_desc,%eax
call *%eax
>
> -- mingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Lockups - lost interrupt [ In reply to ]

yodaiken at chelm

Sep 13, 1999, 7:24 AM

Post #21 of 35 (1236 views)

Permalink

On Mon, Sep 13, 1999 at 05:10:49PM +0200, mingo@chiara.csoma.elte.hu wrote:
>
> On Mon, 13 Sep 1999 yodaiken@chelm.cs.nmt.edu wrote:
>
> > > i'm not sure wether this will ever be accepted into the main kernel -
> > > __cli()/__sti()/etc. right now is heavily used and inlined (mostly via
> >
> > The price is paid only if you select RTL in config.
> > my idea is that system.h does
>
> oh, ok. I thought you want to do this 'runtime', by patching a pretty
> normal kernel dynamically.
I'll patch a rtl configed kernel. The idea is that if you selectrtl_config you
get calls via the patch table and if you then insert the rtl module
the patch is made. If you don't want the possibility of rtl you get
inlined cli/sti . Note that on other architectures there is not even
a bloat price since we don't have 8bit cli/sti instructions.
> > movel N+irq_desc,%eax
> > call *%eax
>
> if the int3 solution is implementable (it's a tough problem i think), the
> advantage would be a completely unaffected 'main kernel'. RTL could be
> switched on/off runtime. OTOH, people recompile kernels routinely anyway.
My feeling is that the cost is undetectable and that if you don't want
rtl you run standard code, if you do want, you run a jump table that
allows rtl to be turned on and off.
But I do like the idea of a section, I just don't know how to do it
wouldnt it compile out to similar code ?
...
jmp 1f
.section cli_stuff
1: cli
.section text
...
>
> -- mingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Lockups - lost interrupt [ In reply to ]

mingo at chiara

Sep 13, 1999, 7:39 AM

Post #22 of 35 (1230 views)

Permalink

On Mon, 13 Sep 1999 yodaiken@chelm.cs.nmt.edu wrote:
> > what labels do you mean?
>
> If you look at the build irq macros, you will see that common irq
> has a label on the line of code that does "call do_IRQ" [...]
oh, ok, i see it.
> #define __cli() irq_control.do_cli()
i'm not sure wether this will ever be accepted into the main kernel -
__cli()/__sti()/etc. right now is heavily used and inlined (mostly via
spinlocks) and it's a single instruction. Maybe building a table of 'cli,
sti, popfl, pushfl' addresses into a special section can do the trick
without interfering with the 'normal' kernel? A single-instruction 'int 3'
could be patched into those places, or something like that.
-- mingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Lockups - lost interrupt [ In reply to ]

yodaiken at chelm

Sep 13, 1999, 8:06 AM

Post #23 of 35 (1233 views)

Permalink

On Mon, Sep 13, 1999 at 05:41:31PM +0200, mingo@chiara.csoma.elte.hu wrote:
> > My feeling is that the cost is undetectable [...]
>
> i understand what you mean, but Linux kernel's speed is a 'sum' of many
> such 'undetectable' improvements. You cannot remove any of those speedups
> just because the speedup is undetectable.
Sure. That's why it should be a config unless we can figure out
this table idea. I'll have to look at the exceptions code -- but I
think there are some jumps necessary.
>
> > [...] and that if you don't want
> > rtl you run standard code, if you do want, you run a jump table that
> > allows rtl to be turned on and off.
> >
> > But I do like the idea of a section, I just don't know how to do it
> > wouldnt it compile out to similar code ?
> > ...
> > jmp 1f
> > .section cli_stuff
> > 1: cli
> > .section text
> > ...
>
> thats the hard part i think too. One way to do it is like for exceptions
> (check out how exceptions build their tables, Documentation/exception.txt)
> : patch int3 into the necessery places if RT is enabled (int3 [or
> equivalent] in this case is a full replacement for all 4 type of
> instructions, cli, sti, popfl and pushfl), then search the 'exception
> table' for the address. (the return address is pushed onto the stack by
int3 for cli
int4 for sti
int5 for pushfl
int6 for popfl
> int3) This search can be rather slow though, and thats the main problem i
> think. There is no cost to the main kernel, apart from the (presumably not
> very big) kernel-resident address-tables.
>
> -- mingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Lockups - lost interrupt [ In reply to ]

mingo at chiara

Sep 13, 1999, 8:10 AM

Post #24 of 35 (1235 views)

Permalink

On Mon, 13 Sep 1999 yodaiken@chelm.cs.nmt.edu wrote:
> > i'm not sure wether this will ever be accepted into the main kernel -
> > __cli()/__sti()/etc. right now is heavily used and inlined (mostly via
>
> The price is paid only if you select RTL in config.
> my idea is that system.h does
oh, ok. I thought you want to do this 'runtime', by patching a pretty
normal kernel dynamically.
> Lmbench can't detect any performance loss -- remember that cli and sti
> are not cheap instructions anyways.
they are ~7 cycles, but the real cost is the slight kernel bloat (== more
cache footprint) caused by the inlined function call. Anyway, this is of
course not a problem for an optional thing, i thought you are trying to do
this runtime as well.
> > spinlocks) and it's a single instruction. Maybe building a table of 'cli,
> > sti, popfl, pushfl' addresses into a special section can do the trick
> > without interfering with the 'normal' kernel? A single-instruction 'int 3'
> > could be patched into those places, or something like that.
>
> The "int" would cost too much in the rtl case. On
the int is basically a function call if you do it on ring 0, but yes it's
more expensive than a normal function call.
> the other hand, I had thought of
> of a section. Not sure what the advantage would be. With the structure,
> the compiler generates
>
> movel N+irq_desc,%eax
> call *%eax
if the int3 solution is implementable (it's a tough problem i think), the
advantage would be a completely unaffected 'main kernel'. RTL could be
switched on/off runtime. OTOH, people recompile kernels routinely anyway.
-- mingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Lockups - lost interrupt [ In reply to ]

mingo at chiara

Sep 13, 1999, 8:41 AM

Post #25 of 35 (1234 views)

Permalink

On Mon, 13 Sep 1999 yodaiken@chelm.cs.nmt.edu wrote:
> > advantage would be a completely unaffected 'main kernel'. RTL could be
> > switched on/off runtime. OTOH, people recompile kernels routinely anyway.
>
> My feeling is that the cost is undetectable [...]
i understand what you mean, but Linux kernel's speed is a 'sum' of many
such 'undetectable' improvements. You cannot remove any of those speedups
just because the speedup is undetectable.
> [...] and that if you don't want
> rtl you run standard code, if you do want, you run a jump table that
> allows rtl to be turned on and off.
>
> But I do like the idea of a section, I just don't know how to do it
> wouldnt it compile out to similar code ?
> ...
> jmp 1f
> .section cli_stuff
> 1: cli
> .section text
> ...
thats the hard part i think too. One way to do it is like for exceptions
(check out how exceptions build their tables, Documentation/exception.txt)
: patch int3 into the necessery places if RT is enabled (int3 [or
equivalent] in this case is a full replacement for all 4 type of
instructions, cli, sti, popfl and pushfl), then search the 'exception
table' for the address. (the return address is pushed onto the stack by
int3) This search can be rather slow though, and thats the main problem i
think. There is no cost to the main kernel, apart from the (presumably not
very big) kernel-resident address-tables.
-- mingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Lockups - lost interrupt [ In reply to ]

mingo at chiara

Sep 13, 1999, 9:26 AM

Post #26 of 35 (392 views)

Permalink

On Mon, 13 Sep 1999 yodaiken@chelm.cs.nmt.edu wrote:
> int3 for cli
> int4 for sti
> int5 for pushfl
> int6 for popfl
only int3 is a single-byte opcode.
-- mingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Lockups - lost interrupt [ In reply to ]

slurn at griffin

Sep 13, 1999, 10:23 AM

Post #27 of 35 (387 views)

Permalink

> On Sun, 12 Sep 1999 mingo@chiara.csoma.elte.hu wrote:
>
> >FYI, i'm working on a patch for 2.3 that adds the NMI oopser (optionally,
> >because 1) it doesnt work on all SMP boxes, and 2) it forces the timer irq
> >to the BP) to the 2.3 kernel. Looks like one of the most common uses of
> >IKD is lockup detection - the rest is mostly used by kernel hackers. I'll
>
> Yes. Also the print-eip patch may be useful to sort out lockups.
>
> >post it together with some other x86 APIC fixes and irq cleanups soon.
>
> If you agree I can merge it into the ikd patch when your new one will be
> ready. Also I think it's the time to port the IKD patch to 2.3.18ac1... ;)
>
> I have a question about debuggers. Is kdb GPL'd? If so and if Scott will
Yes, KDB is gpl'ed.
> agree I can merge it into the IKD patch while porting the current ikd
Please feel free.
> patch to 2.3.x. If I'll merge kdb I can remove the old debugger that
> doesn't work for people (SIGTRAP probelem and I had not the time to go
> into that myself).
Thanks.
scott
>
> Andrea
>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Lockups - lost interrupt [ In reply to ]

whampton at staffnet

Sep 13, 1999, 11:12 AM

Post #28 of 35 (386 views)

Permalink

Wade Hampton wrote:
>
> Mike Black wrote:
> >
> > I'm installing the ikd patch on 2.2.12 with Mingo's 2.2.12 raid patch. With
> > some luck, it shouldn't take long to find the problem. I've got one script
> > that takes about 10 minutes to lockup the machine (locks up in either UP or
> > SMP mode).
> Dumb question -- where can I get the ikd patch? I could put it on the
> Dell (the most repeatable crash) and see if I could find the problem.
Results so far on the Dell WS400 (dual PII/300):
1. stock kernel 2.2.12 with kdb 0.5 patch
2. installed ikd patch
a) had to manually fix the Makefile and arch/i386/config.in (.rej
files)
b) setup for serial console
c) setup for: Detect software lockups
Print %eip,
SMP-IOAPIC NMI SW watchdog,
IRQ 0
Serial console worked fine. Control-A allowed me to break on the
serial console. All seemed fine. Started X, sound, x11amp on the
soundblaster, a play loop on the crystal. Started mformat a:,
dd if=junk of=/dev/fd0 loop on the floppy. After about 5 minutes
the system did not hang so I started a NFS tar from another machine
of about 1GB of MP3 files. After about 5 minutes of this abuse,
the system hung. No response on the serial console, no OOPS's,
nothing.
As my display was on X, the "Print %eip" did not help....
During the dd'ing, I was getting:
floppy0: unexpected interrupt
floppy0: sensi repl[0]=80
Trial #2, same as above, but no floppy activity. Did a make xconfig
and it hung....
I am on trial #3.... The kernel is being built to NMI on IRQ 3
(ttyS1)...

> One of the Penguins (dual PIII) is running under load over the weekend
> on 2.2.11 (yes, I went backwards, but I did not have any crashes with
> 2.2.11). I'll let you know how well it does Monday AM.
This machine is still up and running fine. It was the machine doing the
massive NFS copy from the Dell and it is functioning without any errors.
I am continueing to load this machine on 2.2.11 and will keep you posted
if it crashes.
> >
> > There
> > are about three different cases here
> >
> > 1. VIA chipset bug - known, understood, non SMP
> > 2. A few triton boards - probably a hardware issue
> > 3. SMP - looks like a lock bug.
> >
> > Running the ikd patch is the best help here. I think it will show you a
> > spinlock deadlock. The trace from that should find the guilty party
If I ever get a trace.... Should I move the NMI to another IRQ, for
example
the one from the serial console? This is an older PII motherboard!
Cheers,
--
W. Wade, Hampton <whampton@staffnet.com>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Lockups - lost interrupt [ In reply to ]

whampton at staffnet

Sep 13, 1999, 12:48 PM

Post #29 of 35 (385 views)

Permalink

Andrea Arcangeli wrote:
>
> On Mon, 13 Sep 1999, Wade Hampton wrote:
>
> >If I ever get a trace.... Should I move the NMI to another IRQ, for
> >example
> >the one from the serial console? This is an older PII motherboard!
>
> Yes, you should move it to the IRQ 0 (default). Why was you using the
> serial irq?
I tried twice on IRQ 0, but never got the OOPS. I tried on IRQ 3
(serial)
and never booted. I tried on IRQ 1 but the keyboard appears hung and
I never got the OOPS as well.
I am now able to lockup my Dell on a regular bassis.
The only information I have been able to get so far
was the addresses via the %eip:
CPU1: c010ced0-4014 in do_IRQ (c010ce90)
CPU2: c010a712-2c56 in ret_from_sys_ccall (c010a70c)
I am basically at a loss but am continueing my tests....
Could there be a problem between the ikd patch and the kdb patch?
Also, I am now using a serial console (ttyS1, 9600 baud).
I am going to remake my kernel from make clean.... reinstall it,
use the default NMI of 0, etc.
Any help or pointers would be appreciated!
--
W. Wade, Hampton <whampton@staffnet.com>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Lockups - lost interrupt [ In reply to ]

andrea at suse

Sep 13, 1999, 1:23 PM

Post #30 of 35 (386 views)

Permalink

On Mon, 13 Sep 1999, Wade Hampton wrote:
>I tried twice on IRQ 0, but never got the OOPS. I tried on IRQ 3
While you are running with the IRQ zero as NMI source, try a `cat
/proc/interrupts`, you should get NMI irqs and you should also get timer
interrupt _only_ on the CPU0 marked as XT (8259) source of irqs. If you
get both NMI and the timer irqs then the NMI-watchdog should work
correctly on your hardware.
If it's working correctly and you get no-oops on the screen, then probably
the local-apic irq is still running on the CPUs. NOTE: the NMI oopser can
help _only_ if at least one CPU is lockedup with irq _disabled_. If the
irqs are enabled in both CPU then the NMI won't do nothing. In such case
you don't need the NMI patch but you only need to enable the SYSRQ keys in
the kernel configuration and then press SYSRQ+P at lockup time to get the
interesting debugging information out of the kernel. You should write on
paper the EIP addresses you'll get from SYSRQ+P. I am not sure if you just
tried the SYSRQ+P approch (I supposed so because you was just using the
NMI code).
Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Lockups - lost interrupt [ In reply to ]

macro at ds2

Sep 16, 1999, 7:28 AM

Post #31 of 35 (393 views)

Permalink

On Sun, 12 Sep 1999 mingo@chiara.csoma.elte.hu wrote:
> FYI, i'm working on a patch for 2.3 that adds the NMI oopser (optionally,
> because 1) it doesnt work on all SMP boxes, and 2) it forces the timer irq
> to the BP) to the 2.3 kernel. Looks like one of the most common uses of
> IKD is lockup detection - the rest is mostly used by kernel hackers. I'll
> post it together with some other x86 APIC fixes and irq cleanups soon.
Why wouldn't the NMI oopser work on a given SMP system? And why the
timer has to be delivered to the BP if these NMIs are exploited? I don't
see any reasons but I might be missing something.
Note that even if IRQ0 is not connected to any I/O APIC, the LINT0 line
is usually broadcasted and even if it's not, NMI may be distributed by the
catcher using "all excluding self" IPI. It is theoretically possible that
there exist i82489DX based systems that have the NMI output of local APICs
not connected to the NMI input of CPUs, but have you ever seen or heard of
such a brain-damaged system? It wouldn't be MPS-compliant, anyway.
I might add a modified NMI oopser to my pending APIC patches (I should
make them available tomorrow, BTW -- they are almost ready but I need to
perform some additional tests), if you don't object, that would handle all
legal cases of timer configuration. The current implementation included
in the ikd patch is somewhat simplistic, indeed.
I don't object the oopser to be optional, of course.
--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: macro@ds2.pg.gda.pl, PGP key available +
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Lockups - lost interrupt [ In reply to ]

mingo at chiara

Sep 16, 1999, 7:58 AM

Post #32 of 35 (395 views)

Permalink

On Thu, 16 Sep 1999, Maciej W. Rozycki wrote:
> Why wouldn't the NMI oopser work on a given SMP system? And why the
> timer has to be delivered to the BP if these NMIs are exploited? I don't
> see any reasons but I might be missing something.
>
> Note that even if IRQ0 is not connected to any I/O APIC, the LINT0 line
> is usually broadcasted and even if it's not, NMI may be distributed by the
> catcher using "all excluding self" IPI. It is theoretically possible that
> there exist i82489DX based systems that have the NMI output of local APICs
> not connected to the NMI input of CPUs, but have you ever seen or heard of
> such a brain-damaged system? It wouldn't be MPS-compliant, anyway.
the NMI input of the CPUs is turned off once the local APIC is turned on.
So the only input signals to the CPU are LINT0, LINT1 and the APIC bus.
LINT1 is historically the NMI signal of the motherboard, broadcasted to
all CPUs. We knew this pretty well. I'm right now experimenting with
configuring LINT0 as NMI as well, although i feel uneasy about this hack,
we do have legitimate cases of through-local-APIC interrupts - which would
thus become NMIs. It's also horribly dangerous because i have to ack the
8259A IRQ0 interrupt from within the NMI handler - shudders. And it
violates the MP standard. I've tried some other hacks as well, and there
is yet another trick to be tried: we can also set up the local APIC's
performance counter LVT into NMI mode, and switch on the performance
counter that counts timer cycles ... then we'll get periodic NMIs. I'm
pretty much dedicated now to get the NMI oopser done without impacting
IRQ0 distribution.
> I might add a modified NMI oopser to my pending APIC patches (I should
dont bother, i've already put the NMI oopser into my APIC tree (and
modified it), and hacked away on it. Please for now just send your timer
and 486 fixes and i'll merge them in - after that we can still look for
better ways of doing NMI-broadcasts.
> make them available tomorrow, BTW -- they are almost ready but I need to
> perform some additional tests), if you don't object, that would handle all
> legal cases of timer configuration. The current implementation included
> in the ikd patch is somewhat simplistic, indeed.
yep.
-- mingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Lockups - lost interrupt [ In reply to ]

macro at ds2

Sep 16, 1999, 8:40 AM

Post #33 of 35 (400 views)

Permalink

On Thu, 16 Sep 1999 mingo@chiara.csoma.elte.hu wrote:
> the NMI input of the CPUs is turned off once the local APIC is turned on.
> So the only input signals to the CPU are LINT0, LINT1 and the APIC bus.
> LINT1 is historically the NMI signal of the motherboard, broadcasted to
> all CPUs. We knew this pretty well. I'm right now experimenting with
Not necessarily -- per MPS, NMI needs only to be delivered to the BSP.
Exact LINT0 and LINT1 routing should be described by the MP Configuration
Table (provided it's correct, sigh).
> configuring LINT0 as NMI as well, although i feel uneasy about this hack,
> we do have legitimate cases of through-local-APIC interrupts - which would
> thus become NMIs. It's also horribly dangerous because i have to ack the
> 8259A IRQ0 interrupt from within the NMI handler - shudders. And it
No, no, no. Although MPS does not explicitly forbide ceasing to connect
of any IRQ lines to I/O APICs, in real life the only IRQs that may be
unconnected are IRQ0 and IRQ13 due to legacy EISA chipsets. As we do not
care of EISA DMA chaining interrupts (do we?), we may set up IRQ0 as a
"through-8259A" interrupt which is more reliable and it needs no acking at
all. This is the Intel-recommended way of handling DMA chaining
interrupts, BTW -- see the 82489DX datasheet. In short, after merging in
my patches, no board should ever use ExtINTA interrupts when in the
symmetric mode.
In fact there is no difference between configuring LINT0 as Fixed and NMI
-- none of them make INTA cycles reach 8259As.
Of course, there might exist a board that would need ExtINTA interrupts
but it's pretty unlikely and even if it existed it would be an ancient
proprietary design and it really would not be able to receive periodic
NMIs in a standard manner. I doubt such boards exist. If you know of
such one, let me know.
> violates the MP standard. I've tried some other hacks as well, and there
AFAIK, the MPS does not force any specific requirements on delivery modes
in the symmetric mode -- it does specify the routing of IRQ lines and the
AT compatibility. If you know of a specific paragraph of MPS that stands
in contradiction to my proposals, please name it!
> is yet another trick to be tried: we can also set up the local APIC's
> performance counter LVT into NMI mode, and switch on the performance
> counter that counts timer cycles ... then we'll get periodic NMIs. I'm
> pretty much dedicated now to get the NMI oopser done without impacting
> IRQ0 distribution.
How would you perform this? The PC does not provide a configurable
delivery mode -- it's always "Fixed". And even if it would it's
completely unportable -- it does not exist on i486 and Pentium systems at
all.
> dont bother, i've already put the NMI oopser into my APIC tree (and
> modified it), and hacked away on it. Please for now just send your timer
> and 486 fixes and i'll merge them in - after that we can still look for
> better ways of doing NMI-broadcasts.
Tomorrow, as I wrote. The oopser is not an absolute necessity -- it may
as well go into 2.5 -- I'd just like to look at it while I am at i386/SMP.
--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: macro@ds2.pg.gda.pl, PGP key available +
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Lockups - lost interrupt [ In reply to ]

mingo at chiara

Sep 16, 1999, 9:08 AM

Post #34 of 35 (402 views)

Permalink

On Thu, 16 Sep 1999, Maciej W. Rozycki wrote:
> > configuring LINT0 as NMI as well, although i feel uneasy about this hack,
> > we do have legitimate cases of through-local-APIC interrupts - which would
> > thus become NMIs. It's also horribly dangerous because i have to ack the
> > 8259A IRQ0 interrupt from within the NMI handler - shudders. And it
>
> No, no, no. Although MPS does not explicitly forbide ceasing to connect
> of any IRQ lines to I/O APICs, in real life the only IRQs that may be
> unconnected are IRQ0 and IRQ13 due to legacy EISA chipsets. As we do not
... except for the fact that there _are_ motherboards which have no IRQ0
connected to the IOAPIC, and their mptable lies about this fact. This is
why we need mixed mode, or at least this is one of the reasons why we do
not (yet) want to kill LINT0 based interrupts, yet.
> care of EISA DMA chaining interrupts (do we?), we may set up IRQ0 as a
> "through-8259A" interrupt which is more reliable and it needs no acking at
> all. [...]
what exactly do you mean here? To set up LINT0 as ExtINT and to unmask
IRQ0 in the 8259A, and to mask the IOAPIC pin? Or to set up the according
IOAPIC routing entry as ExtINT? The later one is completely pointless - if
there is an IOAPIC pin for IRQ0 then we want that to be a LowPrio IRQ. If
we set it up through the local APIC's LINT0 pin, then we lose the ability
to route to multiple CPUs. (it makes no difference that the 8259A's INTR
output signal is driven to all CPUs, there is no mechanizm to distribute
this between CPUs, we'd get an interrupt on all CPUs at once, and had to
do some sort of software-selection - clearly complex and suboptimal.)
> > is yet another trick to be tried: we can also set up the local APIC's
> > performance counter LVT into NMI mode, and switch on the performance
> > counter that counts timer cycles ... then we'll get periodic NMIs. I'm
> > pretty much dedicated now to get the NMI oopser done without impacting
> > IRQ0 distribution.
>
> How would you perform this? The PC does not provide a configurable
> delivery mode -- it's always "Fixed". And even if it would it's
no, delivery mode can be configured for PCINT.
> completely unportable -- it does not exist on i486 and Pentium systems at
> all.
and? The NMI oopser doesnt work on UP boxes either.
-- mingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/

Re: Lockups - lost interrupt [ In reply to ]

macro at ds2

Sep 16, 1999, 9:48 AM

Post #35 of 35 (404 views)

Permalink

On Thu, 16 Sep 1999 mingo@chiara.csoma.elte.hu wrote:
> ... except for the fact that there _are_ motherboards which have no IRQ0
> connected to the IOAPIC, and their mptable lies about this fact. This is
> why we need mixed mode, or at least this is one of the reasons why we do
> not (yet) want to kill LINT0 based interrupts, yet.
I've already written it to you -- my patch solves this issue no matter
what the MP table tries to report of IRQ0.
> > care of EISA DMA chaining interrupts (do we?), we may set up IRQ0 as a
> > "through-8259A" interrupt which is more reliable and it needs no acking at
> > all. [...]
>
> what exactly do you mean here? To set up LINT0 as ExtINT and to unmask
> IRQ0 in the 8259A, and to mask the IOAPIC pin? Or to set up the according
Please see my patches against 2.3.13 -- the new version doesn't introduce
anything new in this matter -- just code rearrangements to fit the current
layout. Or wait till tomorrow.
> IOAPIC routing entry as ExtINT? The later one is completely pointless - if
> there is an IOAPIC pin for IRQ0 then we want that to be a LowPrio IRQ. If
Lowest priority can be done in the "through-8259A" mode and this is the
default operating mode being set by my patch when IRQ0 is not connected to
INTIN2.
> we set it up through the local APIC's LINT0 pin, then we lose the ability
> to route to multiple CPUs. (it makes no difference that the 8259A's INTR
> output signal is driven to all CPUs, there is no mechanizm to distribute
> this between CPUs, we'd get an interrupt on all CPUs at once, and had to
> do some sort of software-selection - clearly complex and suboptimal.)
If the "through-8259A" mode fails (quite possible, as the output of the
master 8259A need not be connected to any APIC either in Virtual Wire or
in PIC mode), I propose to use a "local through-8259A" mode which sets the
timer interrupt as "Fixed" to one of processors. It has the advantage of
not involving the ExtINTA trap (less circuitry means less chance for
errors, especially as the trap is a kind of a hardware hack) and getting
rid of INTA cycles on the external bus.
But instead of using a unicast "Fixed" delivery, we may use a broadcast
"NMI" (if I/O APIC can be involved), or try to broadcast through LINT0
(which need not succeed). If neither of these could be set up, we may
fall back to unicast NMI delivery and distribute NMI further using an IPI
(which is a single APIC write and is not an extreme porformance hit).
> > How would you perform this? The PC does not provide a configurable
> > delivery mode -- it's always "Fixed". And even if it would it's
>
> no, delivery mode can be configured for PCINT.
I recall it was hardwired. I'll look at the docs.
> > completely unportable -- it does not exist on i486 and Pentium systems at
> > all.
>
> and? The NMI oopser doesnt work on UP boxes either.
Well, for UP boxes it's mostly impossible to implement it (though EISA
systems used to have a second 8254 as a watchdog timer on NMI -- that
might be worthwhile to handle). But it's pretty easy to do this for SMP
systems in a compatible way. And what's more important, it would be
neither complicated nor time consuming.
--
+ Maciej W. Rozycki, Technical University of Gdansk, Poland +
+--------------------------------------------------------------+
+ e-mail: macro@ds2.pg.gda.pl, PGP key available +
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/