Mailing List Archive: Capturing hard hang info?

Capturing hard hang info?

markknecht at gmail

Oct 19, 2013, 3:01 PM

Post #1 of 9 (3975 views)

Hello all,
I'm having a bit of an issue here with a machine that's completely
hanging and unresponsive. No magic sys request keys, keyboard and
mouse are dead, cannot shell in or even ping from another machine on
the network. For all intents and purposes it's a dead as one of
Dexter's targets. I believe the problem is virtualbox-4.3 related. I
didn't see it with vb-4.2.18 or any earlier version. However there
were 3.10.16 & 17 updates, along with whatever else Gentoo has
provided in the last week or two so there's a chance it could be
anything, including hardware.

Anyway, I've dropped back to gentoo-sources-3.10.15 which never
exhibited a problem, along with vb-4.2.18. I'm running memtest86 and
looking for anything it might spot. My RAID6 appears to have no
problems through any of these hangs, or at least a scrub of the RAID
doesn't show it.

There's nothing in /var/log/messages or Xorg.0.log that I see.

My question, after this long setup, is what can I do, if anything,
to capture and other info about what's really causing the problem?

I'm currently just doing obvious regression stuff. Installing old
versions, running for awhile, etc., trying to figure out if it's a
single package. Not real effective but maybe it's all I can do with
this sort of problem?

Thanks in advance,
Mark

Re: Capturing hard hang info? [ In reply to ]

rich0 at gentoo

Oct 19, 2013, 4:49 PM

Post #2 of 9 (3882 views)

On Sat, Oct 19, 2013 at 6:01 PM, Mark Knecht <markknecht@gmail.com> wrote:
> No magic sys request keys, keyboard and
> mouse are dead, cannot shell in or even ping from another machine on
> the network.

These types of situations are really annoying to debug. Do you get
anything on the console? Try leaving at a text console with no screen
saver so that you have a chance to see any panic message/etc that
might be left there. If you have something set to put your monitor to
sleep then after the panic your system will not wake up.

Serial console is another option, albeit not exactly convenient.

I have on my blog somewhere instructions for setting up kdump, but to
be honest with recent kernel versions it hasn't been working (that
could have changed). You can configure your kernel to auto-reboot to
a panic kernel which you can then use to dump core to disk, then you
can reboot back into your normal system to examine it at your leisure.
That should tell you what was going on when it crashed, but only if
the kernel actually detected a panic (usually it does).

Note that logs are useless in a panic (unless you're using kdump) as
the kernel will not write anything to disk following a panic. If you
get an oops/bug you might or might not get anything in your logs
depending on whether it affected the filesystem/disk/etc subsystems.
If the kernel knows its internals are scrambled the last thing you
want it doing is trying to write to your filesystems. With kdump it
does a reboot into a new kernel which fully re-initializes everything
and then dumps ram safely to disk.

Rich

Re: Capturing hard hang info? [ In reply to ]

markknecht at gmail

Oct 19, 2013, 5:25 PM

Post #3 of 9 (3883 views)

On Sat, Oct 19, 2013 at 4:49 PM, Rich Freeman <rich0@gentoo.org> wrote:
> On Sat, Oct 19, 2013 at 6:01 PM, Mark Knecht <markknecht@gmail.com> wrote:
>> No magic sys request keys, keyboard and
>> mouse are dead, cannot shell in or even ping from another machine on
>> the network.
>
> These types of situations are really annoying to debug. Do you get
> anything on the console? Try leaving at a text console with no screen
> saver so that you have a chance to see any panic message/etc that
> might be left there. If you have something set to put your monitor to
> sleep then after the panic your system will not wake up.
>

OK, it's a good idea just to have a Konsole terminal open. That might
catch something. Only issue is I'm running KDE, 6 desktops, 2
monitors, so I need to make sure it's always visible and always on
top.

> Serial console is another option, albeit not exactly convenient.
>

OK, so I remember years ago debugging something for Ingo Molnar using
the serial console, but in those days it was a real serial console on
a real serial port. None of my machine have those ports anymore. There
must be a more modern version of doing that. I'll go look for info.
Ethernet? USB? We've recently moved and the only other machine I've
got here at the apartment is a Gentoo laptop.

> I have on my blog somewhere instructions for setting up kdump, but to
> be honest with recent kernel versions it hasn't been working (that
> could have changed). You can configure your kernel to auto-reboot to
> a panic kernel which you can then use to dump core to disk, then you
> can reboot back into your normal system to examine it at your leisure.
> That should tell you what was going on when it crashed, but only if
> the kernel actually detected a panic (usually it does).
>

There's a gentoo.wiki.org page here:

http://wiki.gentoo.org/wiki/Kernel_Crash_Dumps

The setup looks reasonably straight forward so I've reconfigured
3.10.17 following those instructions.

One question for now. In the Kernel Hacking section there's an option
for "Detect Hard and Soft Lockups" which on the surface looks like a
good thing to turn on but it's not mentioned in these instructions.
When turned on it has options for Panic (Reboot) for both types. Seems
like I probably want that all turned on?

Comments?

> Note that logs are useless in a panic (unless you're using kdump) as
> the kernel will not write anything to disk following a panic. If you
> get an oops/bug you might or might not get anything in your logs
> depending on whether it affected the filesystem/disk/etc subsystems.
> If the kernel knows its internals are scrambled the last thing you
> want it doing is trying to write to your filesystems. With kdump it
> does a reboot into a new kernel which fully re-initializes everything
> and then dumps ram safely to disk.
>
> Rich
>

As I expected about the logs. If the machine's dead then I don't want
stuff getting written to disk anyway. kdump sounds like the best
solution going right now. I'll try and see if I can get it working.

Thanks very much Rich! Great ideas.

Cheers,
Mark

Re: Capturing hard hang info? [ In reply to ]

rich0 at gentoo

Oct 19, 2013, 6:18 PM

Post #4 of 9 (3880 views)

On Sat, Oct 19, 2013 at 8:25 PM, Mark Knecht <markknecht@gmail.com> wrote:
> OK, it's a good idea just to have a Konsole terminal open. That might
> catch something.

I'm not sure if panics show up in konsole. With a virtual console the
kernel actually outputs the message. Konsole under X11 is entirely
user-mode and I'm not sure that ANY user-mode code can ever run after
a panic.

I think a virtual console is a better bet.

> OK, so I remember years ago debugging something for Ingo Molnar using
> the serial console, but in those days it was a real serial console on
> a real serial port. None of my machine have those ports anymore. There
> must be a more modern version of doing that. I'll go look for info.
> Ethernet? USB? We've recently moved and the only other machine I've
> got here at the apartment is a Gentoo laptop.

That you'd have to look into. I'm not sure if the kernel can handle a
serial console on a PL2302/etc. It might - it is all kernel-mode I
think. You'd have to attach it to another device running a terminal
emulator, assuming you don't have a vt100/etc lying around.

> There's a gentoo.wiki.org page here:
>
> http://wiki.gentoo.org/wiki/Kernel_Crash_Dumps
>
> The setup looks reasonably straight forward so I've reconfigured
> 3.10.17 following those instructions.

Yeah, I forgot - that was actually started based on my blog entry,
actually. It may very well have been improved on since.

>
> One question for now. In the Kernel Hacking section there's an option
> for "Detect Hard and Soft Lockups" which on the surface looks like a
> good thing to turn on but it's not mentioned in these instructions.

Probably not a bad idea.

> When turned on it has options for Panic (Reboot) for both types. Seems
> like I probably want that all turned on?

You could try setting it to no and see if you actually can capture any
meaningful logs that way - there is a chance you could recover your
system without rebooting. However, a panic would be the only real
sure way to ensure a dump.

Oh, and don't forget that there is a magic sysrq that triggers a
panic. Only issue with that is that you'll have to hunt around for
whatever caused the actual hangup because it won't be in the panic
backtrace (that will just lead you to the sysrq code).

> As I expected about the logs. If the machine's dead then I don't want
> stuff getting written to disk anyway. kdump sounds like the best
> solution going right now. I'll try and see if I can get it working.

Yeah - one of these days I'll see if I can get kdump working again.
What it really needs is an initramfs that will automatically capture
the dump and reboot. That's how other distros handle it. The dumps
are pretty big though - the size of your RAM.

If you get a dump there are a bunch of tools that can be used to analyze it.

Rich

Re: Capturing hard hang info? [ In reply to ]

thegeezer at thegeezer

Oct 20, 2013, 10:19 AM

Post #5 of 9 (3886 views)

On 10/19/2013 11:01 PM, Mark Knecht wrote:
> Hello all,
> I'm having a bit of an issue here with a machine that's completely
> hanging and unresponsive. No magic sys request keys, keyboard and
> mouse are dead, cannot shell in or even ping from another machine on
> the network. For all intents and purposes it's a dead as one of
> Dexter's targets. I believe the problem is virtualbox-4.3 related. I
> didn't see it with vb-4.2.18 or any earlier version. However there
> were 3.10.16 & 17 updates, along with whatever else Gentoo has
> provided in the last week or two so there's a chance it could be
> anything, including hardware.
>
> Anyway, I've dropped back to gentoo-sources-3.10.15 which never
> exhibited a problem, along with vb-4.2.18. I'm running memtest86 and
> looking for anything it might spot. My RAID6 appears to have no
> problems through any of these hangs, or at least a scrub of the RAID
> doesn't show it.
>
> There's nothing in /var/log/messages or Xorg.0.log that I see.
>
> My question, after this long setup, is what can I do, if anything,
> to capture and other info about what's really causing the problem?
>
> I'm currently just doing obvious regression stuff. Installing old
> versions, running for awhile, etc., trying to figure out if it's a
> single package. Not real effective but maybe it's all I can do with
> this sort of problem?
>
> Thanks in advance,
> Mark
>
http://forums.gentoo.org/viewtopic-p-3682124.html

check out netconsole -- it's fiddly to setup but essentially in the
dying breath it send dmesg entries over the network (LAN only)
it is really "last breath of death" - you need to add ip and mac address
and have netcat listening on the other end.

Re: Capturing hard hang info? [ In reply to ]

markknecht at gmail

Oct 20, 2013, 10:29 AM

Post #6 of 9 (3890 views)

Hi Rich,
Some progress, but questions/comments also

On Sat, Oct 19, 2013 at 6:18 PM, Rich Freeman <rich0@gentoo.org> wrote:
> On Sat, Oct 19, 2013 at 8:25 PM, Mark Knecht <markknecht@gmail.com> wrote:
>> OK, it's a good idea just to have a Konsole terminal open. That might
>> catch something.
>
> I'm not sure if panics show up in konsole. With a virtual console the
> kernel actually outputs the message. Konsole under X11 is entirely
> user-mode and I'm not sure that ANY user-mode code can ever run after
> a panic.
>
> I think a virtual console is a better bet.
>

I suspect you're right about Konsole sitting on the KDE desktop. I
only meant that sometimes that catches a few messages and was hopeful
it might do that here but it's certainly not a real solution.

That said I'm not clear about the virtual console point. I thought the
virtual consoles were Alt-Ctl-F[1,2,3,..] When this even occurs my
keyboard isn't working so I don't know how to get there. You must mean
something else?

>> OK, so I remember years ago debugging something for Ingo Molnar using
>> the serial console, but in those days it was a real serial console on
>> a real serial port. None of my machine have those ports anymore. There
>> must be a more modern version of doing that. I'll go look for info.
>> Ethernet? USB? We've recently moved and the only other machine I've
>> got here at the apartment is a Gentoo laptop.
>
> That you'd have to look into. I'm not sure if the kernel can handle a
> serial console on a PL2302/etc. It might - it is all kernel-mode I
> think. You'd have to attach it to another device running a terminal
> emulator, assuming you don't have a vt100/etc lying around.
>
>> There's a gentoo.wiki.org page here:
>>
>> http://wiki.gentoo.org/wiki/Kernel_Crash_Dumps
>>
>> The setup looks reasonably straight forward so I've reconfigured
>> 3.10.17 following those instructions.
>
> Yeah, I forgot - that was actually started based on my blog entry,
> actually. It may very well have been improved on since.
>

To make progress with /etc/local.d/kdump.start it turns out I also
needed to enable

File Systems -> Pseudo File Systems -> /proc/kcore

The Gentoo wiki only talked about vmcore.

<SNIP>
>> When turned on it has options for Panic (Reboot) for both types. Seems
>> like I probably want that all turned on?
>
> You could try setting it to no and see if you actually can capture any
> meaningful logs that way - there is a chance you could recover your
> system without rebooting. However, a panic would be the only real
> sure way to ensure a dump.
>
> Oh, and don't forget that there is a magic sysrq that triggers a
> panic. Only issue with that is that you'll have to hunt around for
> whatever caused the actual hangup because it won't be in the panic
> backtrace (that will just lead you to the sysrq code).
>

At this point I'm a bit beyond my depth. If the hang created by
Virtualbox isn't a panic, but my keyboard is completely locked up,
then I don't know how I'm going to issue the magic sysrq to get the
dump process going.

As a test however, with all of this stuff set up, I logged out of KDE,
switched to a console, disabled X and tried

echo c>/proc/sysrq-trigger

I get a error screen and the system reboots. The first time I did it I
had a bunch of disk activity - presumably stuff being copied to either
kcore or vmcore - and then much later X & KDE came up running a single
processor. This seemed like a positive result.

I then thought maybe I shouldn't start xdm in my init scripts so I
disabled it, rebooted the box, logged in a root and tried again. This
time after the error screen and apparent reboot I gave the machine 45
minutes but never got back to a login screen.

QUESTION 1: This machine has 24GB DRAM. I've set crashkernel=256M and
hoped for the best but don't know if that's a good setting.

QUESTION 2: Am I correct that the captured dump output is going to be
a file that's roughly 24GB? Maybe this takes hours or something being
that it's that big and I'm presumably saving it to a RAID6 which is
doing a lot more parity calcs all in single processor mode. Is there a
way to estimate how long I'd have to wait to even get to a login
prompt?

So far I've been unable to save anything from /proc/vmcore.

>> As I expected about the logs. If the machine's dead then I don't want
>> stuff getting written to disk anyway. kdump sounds like the best
>> solution going right now. I'll try and see if I can get it working.
>
> Yeah - one of these days I'll see if I can get kdump working again.
> What it really needs is an initramfs that will automatically capture
> the dump and reboot. That's how other distros handle it. The dumps
> are pretty big though - the size of your RAM.

That would be helpful certainly.

Thanks for all the info and support!

Cheers,
Mark

Re: Capturing hard hang info? [ In reply to ]

markknecht at gmail

Oct 20, 2013, 10:30 AM

Post #7 of 9 (3883 views)

On Sun, Oct 20, 2013 at 10:19 AM, thegeezer <thegeezer@thegeezer.net> wrote:
> On 10/19/2013 11:01 PM, Mark Knecht wrote:
>
<SNIP>
>
> http://forums.gentoo.org/viewtopic-p-3682124.html
>
> check out netconsole -- it's fiddly to setup but essentially in the dying
> breath it send dmesg entries over the network (LAN only)
> it is really "last breath of death" - you need to add ip and mac address and
> have netcat listening on the other end.
>
>

Thanks!

Re: Capturing hard hang info? [ In reply to ]

thegeezer at thegeezer

Oct 20, 2013, 10:46 AM

Post #8 of 9 (3886 views)

On 10/20/2013 06:30 PM, Mark Knecht wrote:
> On Sun, Oct 20, 2013 at 10:19 AM, thegeezer <thegeezer@thegeezer.net> wrote:
>> On 10/19/2013 11:01 PM, Mark Knecht wrote:
>>
> <SNIP>
>> http://forums.gentoo.org/viewtopic-p-3682124.html
>>
>> check out netconsole -- it's fiddly to setup but essentially in the dying
>> breath it send dmesg entries over the network (LAN only)
>> it is really "last breath of death" - you need to add ip and mac address and
>> have netcat listening on the other end.
>>
>>
> Thanks!
>
http://www.cyberciti.biz/tips/linux-netconsole-log-management-tutorial.html

might be a better tutorial. rather than keep rebooting you can modprobe
netconsole with new parameters til it is right.
be warned it is fiddly but oh so very useful

Re: Capturing hard hang info? [ In reply to ]

rich0 at gentoo

Oct 20, 2013, 12:03 PM

Post #9 of 9 (3883 views)

On Sun, Oct 20, 2013 at 1:29 PM, Mark Knecht <markknecht@gmail.com> wrote:
> That said I'm not clear about the virtual console point. I thought the
> virtual consoles were Alt-Ctl-F[1,2,3,..] When this even occurs my
> keyboard isn't working so I don't know how to get there. You must mean
> something else?

It will only be helpful if the console is displayed when the panic
occurs. This is helpful when the panics tend to happen when you're
away from the system.

> To make progress with /etc/local.d/kdump.start it turns out I also
> needed to enable
>
> File Systems -> Pseudo File Systems -> /proc/kcore
>
> The Gentoo wiki only talked about vmcore.

Feel free to update it, after you're sure you have everything figured out.

> At this point I'm a bit beyond my depth. If the hang created by
> Virtualbox isn't a panic, but my keyboard is completely locked up,
> then I don't know how I'm going to issue the magic sysrq to get the
> dump process going.

Are you SURE that it is COMPLETELY locked up? As in alt-sysrq-b
doesn't reboot? I've found that this almost always works, even if
sysrq otherwise appears to not work (most of the other options won't
appear do anything in a panic with the display not on a virtual
console while in a graphics mode).

> I get a error screen and the system reboots. The first time I did it I
> had a bunch of disk activity - presumably stuff being copied to either
> kcore or vmcore - and then much later X & KDE came up running a single
> processor. This seemed like a positive result.

Nothing gets "copied" to kcore/vmcore. The state of the previous
system is already in RAM, or at least it was until KDE launched and
overwrote everything. You need to have it boot to a crash kernel and
rescue shell for it to be of any use, unless you just want it to
auto-reboot to minimize downtime.

> QUESTION 1: This machine has 24GB DRAM. I've set crashkernel=256M and
> hoped for the best but don't know if that's a good setting.

It is probably fine - it just needs enough space to hold the kernel
and initramfs.

>
> QUESTION 2: Am I correct that the captured dump output is going to be
> a file that's roughly 24GB? Maybe this takes hours or something being
> that it's that big and I'm presumably saving it to a RAID6 which is
> doing a lot more parity calcs all in single processor mode. Is there a
> way to estimate how long I'd have to wait to even get to a login
> prompt?

You won't get a login prompt unless you have some script set to
auto-save the dump file. Typically you'll set things up to just
launch a root shell. Saving a 24GB file shouldn't take more than a
few minutes with nothing else touching the hard drive.

Per the wiki you should be running:
kexec -p /[path-to-kernel] --append="root=[root-device] single irqpoll
maxcpus=1 reset_devices"

Note the single option in there - you're not going to get a login
prompt. It will just dump you at a shell. If you've booted into that
kernel then /proc/vmcore will contain a core file from the paniced
kernel.

If you just rebooted normally then I don't think /proc/vmcore will
even exist. That was the problem I was having a while back - kexec
wasn't actually having any effect. I have no idea if whatever broke
it has been fixed.

Rich