Mailing List Archive

Update: 2.0 kernels, tulip driver, crashes and reboots (long)
Here are the data points we've gathered over the weekend on our systems
that have been rebooting/locking:
The 9 systems that are running kernel 2.0.31 still have not locked or
rebooted. I'd have to say now that this is statistically significant. These
are a mixture of PNIC and ne2000 systems. Identical systems running 2.0.36
reboot an average of once every 8 unit days. These 9 systems have run over
6 days now (54 unit days without a crash).
We now appear to have a good method for inducing a reboot failure: running
netperf between two PNIC-tulip.89K/2.0.36 systems will cause a reboot in
12-24 hours. Throwing a hdparm -f -t /dev/hda3 into the background on the
same system will induce the failure reboot time to 5-15 minutes on the same
system.
My guess is that the VIA chipset is going into a state that ends up
asserting the processor reset signal. I just setup a digital o-scope to
trigger on the processor reset. After running the test for about 15
minutes, I plugged a USB keyboard into the client netperf system and it
rebooted (this is an interesting datapoint in itself). The digital o-scope
got the trigger on reset. We should have schematics for the motherboard
tommorrow to figure out exactly what is tied into reset to trace it back.
Epox is also contacting VIA about erratas for the chipset and should get
back to us tommorrow.
Other tests we're starting today:
Run netperf/hdparm test on Intel TX-based motherboards with
PNIC-tulip.89K/2.0.36 (running now, no failures after 2 hours)
Run netperf/hdparm test on VPX-based motherboards with ne2000/2.0.36
(running now, no failures after 2+ hours)
Run netperf/hdparm test on VPX-based motherboards with PNIC-tulip.89K/2.0.31
Run netperf test on VPX-based motherboards with Win98/PNIC-driver
I think we're getting close on this. Your ideas and comments on the above
are welcome and appreciated.
Thanks,
Al Youngwerth
alberty@apexxtech.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/
Update: 2.0 kernels, tulip driver, crashes and reboots (long) [ In reply to ]
I think we're starting to close in on something here...
We saw a patch on linuxmama that recognizes buggy Award PCI bioses and
disables the PCI bios in this case. Although that patch didn't recognize
our bios as being buggy, we went ahead and compiled up a 2.0.36 kernel.
Running the test that normally reboots a stock 2.0.36 kernel within 30
minutes, we ran overnight without any problems.
We're starting up more systems with this combination. I'll post the results
later.
Thanks,
Al Youngwerth
alberty@apexxtech.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/
Update: 2.0 kernels, tulip driver, crashes and reboots (long) [ In reply to ]
Sorry about the long delay on the update. I had to run off to the U.K. last
week and got my laptop stolen at Heathrow, making communications difficult
for me.
Just as a reminder, we were seeing random, occasional lockups in our VIA
VPX motherboard based systems. Through testing, the lockups appeared to be
related to kernel versions after 2.0.31 and perhaps various tulip drivers.
After finally getting the right equipment for the job (a Tektronix TLA 704
logic analyzer, my nominee for best Win95 application ever), we were able
to catch an instruction trace of the failure mode.
Turns out that when the system locks-up, the processor is in SMM (System
Management Mode) with no instruction fetches and what appear to be fairly
random bus cycles. It didn't seem to make much sense that we were in SMM
mode because we had APM disabled in the BIOS (more on this later).
With this data we were able to setup an end trace on the failure condition
to see what led up to the crash. The next crash trace showed an SMI (System
Management Interrupt) being serviced correctly, resumed, then a whole bunch
of OS/Application code executing, then another SMI and the crash. In this
last SMI, when the processor goes to fetch the SMI handler instructions, it
pulls garbage out of RAM and consequently the processor goes into the weeds
(locks or reboots).
Working with the motherboard vendor and reading the schematics and the data
sheet for the chipset, this should not happen. The SMI code lives in system
RAM in the same place video RAM is normally decoded. At startup, the BIOS
programs the chipset to map system RAM into the space normally occupied by
video RAM to hold the SMM code. After BIOS copies the SMM code into system
RAM, it sets a register in the chipset to protect this RAM so it can only
be read from or written to when in SMM mode. When an SMI is generated, the
chipset then maps in the system RAM (over the video RAM) to execute the SMM
code. So, if the SMM code is getting whacked, it has to be the fault of the
SMM code itself (bad self modifying code) or the chipset or motherboard
design (improper decoding of the memory space).
So then we get back to, "why are we generating SMIs in the first place if
APM is disabled?" Turns out the USB on the motherboard uses SMM to poll for
dumb devices like keyboards. (With the Award BIOS, if you plug in a regular
keyboard, it seems to shut off the USB polling.)
We still don't know exactly what causes the problem or why different
kernels seem to affect the problem. What we do know is that with APM and
USB disabled we have no problems.
We are still working with Epox (the motherboard vendor) to find the real
problem. When we find it, I'll post to the list. In the meantime, if you're
seeing lockups with a VIA VPX based motherboard, try turning off APM and
USB support in the BIOS setup and your problems should go away. If you
suspect you are seeing the problem, you can confirm it by very carefully
probing the SMIACT# signal (pin 58) on the VIA VT82C585VPX chipset with a
DMM. If it's stuck low, you've got the problem.
Thanks to everyone for their help and suggestions on this one.
Cheers,
Al Youngwerth
alberty@apexxtech.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/
Re: Update: 2.0 kernels, tulip driver, crashes and reboots (long) [ In reply to ]
> > We still don't know exactly what causes the problem or why different
> > kernels seem to affect the problem. What we do know is that with APM and
> > USB disabled we have no problems.
> There are some background hints on how the SMI stuff works with USB devices
> and USB "legacy mode". Thats the mode where SMM code is used to fake I/O
> to the keyboard port and talk USB hidbp to the USB devices from DOS. With
> no keyboard present its reasonable to expect the USB polls to be active.
The box that I was having the problems with had an attached keyboard with
APM and USB off.
djweis
--
David Weis
djweis@plconline.com
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/
Re: Update: 2.0 kernels, tulip driver, crashes and reboots (long) [ In reply to ]
> So then we get back to, "why are we generating SMIs in the first place if
> APM is disabled?" Turns out the USB on the motherboard uses SMM to poll for
> dumb devices like keyboards. (With the Award BIOS, if you plug in a regular
> keyboard, it seems to shut off the USB polling.)
> We still don't know exactly what causes the problem or why different
> kernels seem to affect the problem. What we do know is that with APM and
> USB disabled we have no problems.
There are some background hints on how the SMI stuff works with USB devices
and USB "legacy mode". Thats the mode where SMM code is used to fake I/O
to the keyboard port and talk USB hidbp to the USB devices from DOS. With
no keyboard present its reasonable to expect the USB polls to be active.
If you are curious get the OHCI and UHCI specs.
And yes the nasty explosion in SMM mode is going to be unfortunate and must
be "their bug" (tm)
Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/