Last week, the LMTP daemon on our mail server (HP DL360 G6) crashed.
People noticed that the mail stopped coming in, so I SSHed in to check
on it, and there were some weird traces in the dmesg. While trying to
investigate, I noticed some more badness:
# emerge -1 openntpd
Calculating dependencies... done!
>>> Verifying ebuild manifests
Killed
At that point I'm thinking, "hardware problem, there goes the weekend."
Most of my tools are committing suicide so I surrender and reboot. The
thing comes up fine and has been working ever since.
Today, another one of our web servers (HP DL360 G5?) does the same
thing. The nightly log report was empty, because there's no syslog
daemon running. This morning dmesg shows:
> [Fri May 9 11:00:42 2014] PAX: refcount overflow detected in: syslog-ng:21823, uid/euid: 0/0
> [Fri May 9 11:00:42 2014] CPU: 2 PID: 21823 Comm: syslog-ng Not tainted 3.11.7-hardened-r1 #1
> [Fri May 9 11:00:42 2014] task: ffff8802cffca080 ti: ffff8802cffca488 task.ti: ffff8802cffca488
> [Fri May 9 11:00:42 2014] RIP: 0010:[<ffffffff810e311e>] [<ffffffff810e311e>] 0xffffffff810e311e
> [Fri May 9 11:00:42 2014] RSP: 0018:ffff880416f21c78 EFLAGS: 00000a96
> [Fri May 9 11:00:42 2014] RAX: ffff88041f0048a0 RBX: ffff88041a1edf00 RCX: 0000000040276333
> [Fri May 9 11:00:42 2014] RDX: 0000000040276332 RSI: 0000000000000000 RDI: ffff88041d858720
> [Fri May 9 11:00:42 2014] RBP: 0000000000000008 R08: 0000000000010bc0 R09: ffff88042fb10bc0
> [Fri May 9 11:00:42 2014] R10: 8000000000000000 R11: ffffea000fec3040 R12: ffff88041f0048a0
> [Fri May 9 11:00:42 2014] R13: ffff88026628ef00 R14: ffff88041d858720 R15: ffff88041a1edf10
> [Fri May 9 11:00:42 2014] FS: 0000000000000000(0000) GS:ffff88042fb00000(0000) knlGS:0000000000000000
> [Fri May 9 11:00:42 2014] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [Fri May 9 11:00:42 2014] CR2: 0000035fb5abf850 CR3: 000000000138a000 CR4: 00000000000006b0
> [Fri May 9 11:00:42 2014] Stack:
> [Fri May 9 11:00:42 2014] 0000000000000000 ffffffff818dde60 ffff8804140ac100 ffff8802cffca570
> [Fri May 9 11:00:42 2014] ffff8802cffca080 ffff880416eb4200 ffff8802cffca080 ffffffff81052750
> [Fri May 9 11:00:42 2014] 0000000000000000 0000000000000001 ffff88038e6260d8 ffff8802cffca598
> [Fri May 9 11:00:42 2014] Call Trace:
> [Fri May 9 11:00:42 2014] [<ffffffff81052750>] ? 0xffffffff81052750
> [Fri May 9 11:00:42 2014] [<ffffffff81036e10>] ? 0xffffffff81036e10
> [Fri May 9 11:00:42 2014] [<ffffffff810371e8>] ? 0xffffffff810371e8
> [Fri May 9 11:00:42 2014] [<ffffffff810449cc>] ? 0xffffffff810449cc
> [Fri May 9 11:00:42 2014] [<ffffffff8100241f>] ? 0xffffffff8100241f
> [Fri May 9 11:00:42 2014] [<ffffffff81002a89>] ? 0xffffffff81002a89
> [Fri May 9 11:00:42 2014] [<ffffffff8137c212>] ? 0xffffffff8137c212
> [Fri May 9 11:00:42 2014] Code: e9 68 fd 01 00 0f 1f 84 00 00 00 00 00 48 8b 43 18 48 8b 7b 10 48 8b 40 30 f0 ff 88 30 01 00 00 71 09 f0 ff 80 30 01 00 00 cd 04 <0f> b7 00 89 c2 66 81 e2 00 b0 66 81 fa 00 20 0f 84 53 ff ff ff
> [Fri May 9 11:00:42 2014] PAX: refcount overflow detected in: syslog-ng:21823, uid/euid: 0/0
> [Fri May 9 11:00:42 2014] CPU: 2 PID: 21823 Comm: syslog-ng Not tainted 3.11.7-hardened-r1 #1
> [Fri May 9 11:00:42 2014] task: ffff8802cffca080 ti: ffff8802cffca488 task.ti: ffff8802cffca488
> [Fri May 9 11:00:42 2014] RIP: 0010:[<ffffffff810e311e>] [<ffffffff810e311e>] 0xffffffff810e311e
> [Fri May 9 11:00:42 2014] RSP: 0018:ffff880416f21c78 EFLAGS: 00000a96
> [Fri May 9 11:00:42 2014] RAX: ffff88041f0048a0 RBX: ffff88041a1edc00 RCX: 0000000040c384f8
> [Fri May 9 11:00:42 2014] RDX: 0000000040c384f7 RSI: 0000000000000000 RDI: ffff88041d858720
> [Fri May 9 11:00:42 2014] RBP: 0000000000000008 R08: 0000000000010b60 R09: ffff88042fb10b60
> [Fri May 9 11:00:42 2014] R10: 8000000000000000 R11: ffffea000f26a840 R12: ffff88041f0048a0
> [Fri May 9 11:00:42 2014] R13: ffff88026628e000 R14: ffff88041d858720 R15: ffff88041a1edc10
> [Fri May 9 11:00:42 2014] FS: 0000000000000000(0000) GS:ffff88042fb00000(0000) knlGS:0000000000000000
> [Fri May 9 11:00:42 2014] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [Fri May 9 11:00:42 2014] CR2: 0000035fb5abf850 CR3: 000000000138a000 CR4: 00000000000006b0
> [Fri May 9 11:00:42 2014] Stack:
> [Fri May 9 11:00:42 2014] 0000000000000000 ffffffff818dde60 ffff88041a1ed400 ffff8802cffca570
> [Fri May 9 11:00:42 2014] ffff8802cffca080 ffff880416eb4200 ffff8802cffca080 ffffffff81052750
> [Fri May 9 11:00:42 2014] 0000000000000000 0000000000000001 ffff88038e6260d8 ffff8802cffca598
> [Fri May 9 11:00:42 2014] Call Trace:
> [Fri May 9 11:00:42 2014] [<ffffffff81052750>] ? 0xffffffff81052750
> [Fri May 9 11:00:42 2014] [<ffffffff81036e10>] ? 0xffffffff81036e10
> [Fri May 9 11:00:42 2014] [<ffffffff810371e8>] ? 0xffffffff810371e8
> [Fri May 9 11:00:42 2014] [<ffffffff810449cc>] ? 0xffffffff810449cc
> [Fri May 9 11:00:42 2014] [<ffffffff8100241f>] ? 0xffffffff8100241f
> [Fri May 9 11:00:42 2014] [<ffffffff81002a89>] ? 0xffffffff81002a89
> [Fri May 9 11:00:42 2014] [<ffffffff8137c212>] ? 0xffffffff8137c212
> [Fri May 9 11:00:42 2014] Code: e9 68 fd 01 00 0f 1f 84 00 00 00 00 00 48 8b 43 18 48 8b 7b 10 48 8b 40 30 f0 ff 88 30 01 00 00 71 09 f0 ff 80 30 01 00 00 cd 04 <0f> b7 00 89 c2 66 81 e2 00 b0 66 81 fa 00 20 0f 84 53 ff ff ff
And things are segfaulting randomly. These machines have been running
3.11.7-hardened-r1 since 2014-01-03 without issue until now -- all of
our servers have. So the timing seems a little coincidental.
If it's not hardware (two different machines...), does this look like a
kernel bug? Should I upgrade over the weekend and pray?
People noticed that the mail stopped coming in, so I SSHed in to check
on it, and there were some weird traces in the dmesg. While trying to
investigate, I noticed some more badness:
# emerge -1 openntpd
Calculating dependencies... done!
>>> Verifying ebuild manifests
Killed
At that point I'm thinking, "hardware problem, there goes the weekend."
Most of my tools are committing suicide so I surrender and reboot. The
thing comes up fine and has been working ever since.
Today, another one of our web servers (HP DL360 G5?) does the same
thing. The nightly log report was empty, because there's no syslog
daemon running. This morning dmesg shows:
> [Fri May 9 11:00:42 2014] PAX: refcount overflow detected in: syslog-ng:21823, uid/euid: 0/0
> [Fri May 9 11:00:42 2014] CPU: 2 PID: 21823 Comm: syslog-ng Not tainted 3.11.7-hardened-r1 #1
> [Fri May 9 11:00:42 2014] task: ffff8802cffca080 ti: ffff8802cffca488 task.ti: ffff8802cffca488
> [Fri May 9 11:00:42 2014] RIP: 0010:[<ffffffff810e311e>] [<ffffffff810e311e>] 0xffffffff810e311e
> [Fri May 9 11:00:42 2014] RSP: 0018:ffff880416f21c78 EFLAGS: 00000a96
> [Fri May 9 11:00:42 2014] RAX: ffff88041f0048a0 RBX: ffff88041a1edf00 RCX: 0000000040276333
> [Fri May 9 11:00:42 2014] RDX: 0000000040276332 RSI: 0000000000000000 RDI: ffff88041d858720
> [Fri May 9 11:00:42 2014] RBP: 0000000000000008 R08: 0000000000010bc0 R09: ffff88042fb10bc0
> [Fri May 9 11:00:42 2014] R10: 8000000000000000 R11: ffffea000fec3040 R12: ffff88041f0048a0
> [Fri May 9 11:00:42 2014] R13: ffff88026628ef00 R14: ffff88041d858720 R15: ffff88041a1edf10
> [Fri May 9 11:00:42 2014] FS: 0000000000000000(0000) GS:ffff88042fb00000(0000) knlGS:0000000000000000
> [Fri May 9 11:00:42 2014] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [Fri May 9 11:00:42 2014] CR2: 0000035fb5abf850 CR3: 000000000138a000 CR4: 00000000000006b0
> [Fri May 9 11:00:42 2014] Stack:
> [Fri May 9 11:00:42 2014] 0000000000000000 ffffffff818dde60 ffff8804140ac100 ffff8802cffca570
> [Fri May 9 11:00:42 2014] ffff8802cffca080 ffff880416eb4200 ffff8802cffca080 ffffffff81052750
> [Fri May 9 11:00:42 2014] 0000000000000000 0000000000000001 ffff88038e6260d8 ffff8802cffca598
> [Fri May 9 11:00:42 2014] Call Trace:
> [Fri May 9 11:00:42 2014] [<ffffffff81052750>] ? 0xffffffff81052750
> [Fri May 9 11:00:42 2014] [<ffffffff81036e10>] ? 0xffffffff81036e10
> [Fri May 9 11:00:42 2014] [<ffffffff810371e8>] ? 0xffffffff810371e8
> [Fri May 9 11:00:42 2014] [<ffffffff810449cc>] ? 0xffffffff810449cc
> [Fri May 9 11:00:42 2014] [<ffffffff8100241f>] ? 0xffffffff8100241f
> [Fri May 9 11:00:42 2014] [<ffffffff81002a89>] ? 0xffffffff81002a89
> [Fri May 9 11:00:42 2014] [<ffffffff8137c212>] ? 0xffffffff8137c212
> [Fri May 9 11:00:42 2014] Code: e9 68 fd 01 00 0f 1f 84 00 00 00 00 00 48 8b 43 18 48 8b 7b 10 48 8b 40 30 f0 ff 88 30 01 00 00 71 09 f0 ff 80 30 01 00 00 cd 04 <0f> b7 00 89 c2 66 81 e2 00 b0 66 81 fa 00 20 0f 84 53 ff ff ff
> [Fri May 9 11:00:42 2014] PAX: refcount overflow detected in: syslog-ng:21823, uid/euid: 0/0
> [Fri May 9 11:00:42 2014] CPU: 2 PID: 21823 Comm: syslog-ng Not tainted 3.11.7-hardened-r1 #1
> [Fri May 9 11:00:42 2014] task: ffff8802cffca080 ti: ffff8802cffca488 task.ti: ffff8802cffca488
> [Fri May 9 11:00:42 2014] RIP: 0010:[<ffffffff810e311e>] [<ffffffff810e311e>] 0xffffffff810e311e
> [Fri May 9 11:00:42 2014] RSP: 0018:ffff880416f21c78 EFLAGS: 00000a96
> [Fri May 9 11:00:42 2014] RAX: ffff88041f0048a0 RBX: ffff88041a1edc00 RCX: 0000000040c384f8
> [Fri May 9 11:00:42 2014] RDX: 0000000040c384f7 RSI: 0000000000000000 RDI: ffff88041d858720
> [Fri May 9 11:00:42 2014] RBP: 0000000000000008 R08: 0000000000010b60 R09: ffff88042fb10b60
> [Fri May 9 11:00:42 2014] R10: 8000000000000000 R11: ffffea000f26a840 R12: ffff88041f0048a0
> [Fri May 9 11:00:42 2014] R13: ffff88026628e000 R14: ffff88041d858720 R15: ffff88041a1edc10
> [Fri May 9 11:00:42 2014] FS: 0000000000000000(0000) GS:ffff88042fb00000(0000) knlGS:0000000000000000
> [Fri May 9 11:00:42 2014] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [Fri May 9 11:00:42 2014] CR2: 0000035fb5abf850 CR3: 000000000138a000 CR4: 00000000000006b0
> [Fri May 9 11:00:42 2014] Stack:
> [Fri May 9 11:00:42 2014] 0000000000000000 ffffffff818dde60 ffff88041a1ed400 ffff8802cffca570
> [Fri May 9 11:00:42 2014] ffff8802cffca080 ffff880416eb4200 ffff8802cffca080 ffffffff81052750
> [Fri May 9 11:00:42 2014] 0000000000000000 0000000000000001 ffff88038e6260d8 ffff8802cffca598
> [Fri May 9 11:00:42 2014] Call Trace:
> [Fri May 9 11:00:42 2014] [<ffffffff81052750>] ? 0xffffffff81052750
> [Fri May 9 11:00:42 2014] [<ffffffff81036e10>] ? 0xffffffff81036e10
> [Fri May 9 11:00:42 2014] [<ffffffff810371e8>] ? 0xffffffff810371e8
> [Fri May 9 11:00:42 2014] [<ffffffff810449cc>] ? 0xffffffff810449cc
> [Fri May 9 11:00:42 2014] [<ffffffff8100241f>] ? 0xffffffff8100241f
> [Fri May 9 11:00:42 2014] [<ffffffff81002a89>] ? 0xffffffff81002a89
> [Fri May 9 11:00:42 2014] [<ffffffff8137c212>] ? 0xffffffff8137c212
> [Fri May 9 11:00:42 2014] Code: e9 68 fd 01 00 0f 1f 84 00 00 00 00 00 48 8b 43 18 48 8b 7b 10 48 8b 40 30 f0 ff 88 30 01 00 00 71 09 f0 ff 80 30 01 00 00 cd 04 <0f> b7 00 89 c2 66 81 e2 00 b0 66 81 fa 00 20 0f 84 53 ff ff ff
And things are segfaulting randomly. These machines have been running
3.11.7-hardened-r1 since 2014-01-03 without issue until now -- all of
our servers have. So the timing seems a little coincidental.
If it's not hardware (two different machines...), does this look like a
kernel bug? Should I upgrade over the weekend and pray?