Mailing List Archive

Weird coincidental PAX crashes
Last week, the LMTP daemon on our mail server (HP DL360 G6) crashed.
People noticed that the mail stopped coming in, so I SSHed in to check
on it, and there were some weird traces in the dmesg. While trying to
investigate, I noticed some more badness:

# emerge -1 openntpd
Calculating dependencies... done!

>>> Verifying ebuild manifests
Killed

At that point I'm thinking, "hardware problem, there goes the weekend."
Most of my tools are committing suicide so I surrender and reboot. The
thing comes up fine and has been working ever since.

Today, another one of our web servers (HP DL360 G5?) does the same
thing. The nightly log report was empty, because there's no syslog
daemon running. This morning dmesg shows:

> [Fri May 9 11:00:42 2014] PAX: refcount overflow detected in: syslog-ng:21823, uid/euid: 0/0
> [Fri May 9 11:00:42 2014] CPU: 2 PID: 21823 Comm: syslog-ng Not tainted 3.11.7-hardened-r1 #1
> [Fri May 9 11:00:42 2014] task: ffff8802cffca080 ti: ffff8802cffca488 task.ti: ffff8802cffca488
> [Fri May 9 11:00:42 2014] RIP: 0010:[<ffffffff810e311e>] [<ffffffff810e311e>] 0xffffffff810e311e
> [Fri May 9 11:00:42 2014] RSP: 0018:ffff880416f21c78 EFLAGS: 00000a96
> [Fri May 9 11:00:42 2014] RAX: ffff88041f0048a0 RBX: ffff88041a1edf00 RCX: 0000000040276333
> [Fri May 9 11:00:42 2014] RDX: 0000000040276332 RSI: 0000000000000000 RDI: ffff88041d858720
> [Fri May 9 11:00:42 2014] RBP: 0000000000000008 R08: 0000000000010bc0 R09: ffff88042fb10bc0
> [Fri May 9 11:00:42 2014] R10: 8000000000000000 R11: ffffea000fec3040 R12: ffff88041f0048a0
> [Fri May 9 11:00:42 2014] R13: ffff88026628ef00 R14: ffff88041d858720 R15: ffff88041a1edf10
> [Fri May 9 11:00:42 2014] FS: 0000000000000000(0000) GS:ffff88042fb00000(0000) knlGS:0000000000000000
> [Fri May 9 11:00:42 2014] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [Fri May 9 11:00:42 2014] CR2: 0000035fb5abf850 CR3: 000000000138a000 CR4: 00000000000006b0
> [Fri May 9 11:00:42 2014] Stack:
> [Fri May 9 11:00:42 2014] 0000000000000000 ffffffff818dde60 ffff8804140ac100 ffff8802cffca570
> [Fri May 9 11:00:42 2014] ffff8802cffca080 ffff880416eb4200 ffff8802cffca080 ffffffff81052750
> [Fri May 9 11:00:42 2014] 0000000000000000 0000000000000001 ffff88038e6260d8 ffff8802cffca598
> [Fri May 9 11:00:42 2014] Call Trace:
> [Fri May 9 11:00:42 2014] [<ffffffff81052750>] ? 0xffffffff81052750
> [Fri May 9 11:00:42 2014] [<ffffffff81036e10>] ? 0xffffffff81036e10
> [Fri May 9 11:00:42 2014] [<ffffffff810371e8>] ? 0xffffffff810371e8
> [Fri May 9 11:00:42 2014] [<ffffffff810449cc>] ? 0xffffffff810449cc
> [Fri May 9 11:00:42 2014] [<ffffffff8100241f>] ? 0xffffffff8100241f
> [Fri May 9 11:00:42 2014] [<ffffffff81002a89>] ? 0xffffffff81002a89
> [Fri May 9 11:00:42 2014] [<ffffffff8137c212>] ? 0xffffffff8137c212
> [Fri May 9 11:00:42 2014] Code: e9 68 fd 01 00 0f 1f 84 00 00 00 00 00 48 8b 43 18 48 8b 7b 10 48 8b 40 30 f0 ff 88 30 01 00 00 71 09 f0 ff 80 30 01 00 00 cd 04 <0f> b7 00 89 c2 66 81 e2 00 b0 66 81 fa 00 20 0f 84 53 ff ff ff
> [Fri May 9 11:00:42 2014] PAX: refcount overflow detected in: syslog-ng:21823, uid/euid: 0/0
> [Fri May 9 11:00:42 2014] CPU: 2 PID: 21823 Comm: syslog-ng Not tainted 3.11.7-hardened-r1 #1
> [Fri May 9 11:00:42 2014] task: ffff8802cffca080 ti: ffff8802cffca488 task.ti: ffff8802cffca488
> [Fri May 9 11:00:42 2014] RIP: 0010:[<ffffffff810e311e>] [<ffffffff810e311e>] 0xffffffff810e311e
> [Fri May 9 11:00:42 2014] RSP: 0018:ffff880416f21c78 EFLAGS: 00000a96
> [Fri May 9 11:00:42 2014] RAX: ffff88041f0048a0 RBX: ffff88041a1edc00 RCX: 0000000040c384f8
> [Fri May 9 11:00:42 2014] RDX: 0000000040c384f7 RSI: 0000000000000000 RDI: ffff88041d858720
> [Fri May 9 11:00:42 2014] RBP: 0000000000000008 R08: 0000000000010b60 R09: ffff88042fb10b60
> [Fri May 9 11:00:42 2014] R10: 8000000000000000 R11: ffffea000f26a840 R12: ffff88041f0048a0
> [Fri May 9 11:00:42 2014] R13: ffff88026628e000 R14: ffff88041d858720 R15: ffff88041a1edc10
> [Fri May 9 11:00:42 2014] FS: 0000000000000000(0000) GS:ffff88042fb00000(0000) knlGS:0000000000000000
> [Fri May 9 11:00:42 2014] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [Fri May 9 11:00:42 2014] CR2: 0000035fb5abf850 CR3: 000000000138a000 CR4: 00000000000006b0
> [Fri May 9 11:00:42 2014] Stack:
> [Fri May 9 11:00:42 2014] 0000000000000000 ffffffff818dde60 ffff88041a1ed400 ffff8802cffca570
> [Fri May 9 11:00:42 2014] ffff8802cffca080 ffff880416eb4200 ffff8802cffca080 ffffffff81052750
> [Fri May 9 11:00:42 2014] 0000000000000000 0000000000000001 ffff88038e6260d8 ffff8802cffca598
> [Fri May 9 11:00:42 2014] Call Trace:
> [Fri May 9 11:00:42 2014] [<ffffffff81052750>] ? 0xffffffff81052750
> [Fri May 9 11:00:42 2014] [<ffffffff81036e10>] ? 0xffffffff81036e10
> [Fri May 9 11:00:42 2014] [<ffffffff810371e8>] ? 0xffffffff810371e8
> [Fri May 9 11:00:42 2014] [<ffffffff810449cc>] ? 0xffffffff810449cc
> [Fri May 9 11:00:42 2014] [<ffffffff8100241f>] ? 0xffffffff8100241f
> [Fri May 9 11:00:42 2014] [<ffffffff81002a89>] ? 0xffffffff81002a89
> [Fri May 9 11:00:42 2014] [<ffffffff8137c212>] ? 0xffffffff8137c212
> [Fri May 9 11:00:42 2014] Code: e9 68 fd 01 00 0f 1f 84 00 00 00 00 00 48 8b 43 18 48 8b 7b 10 48 8b 40 30 f0 ff 88 30 01 00 00 71 09 f0 ff 80 30 01 00 00 cd 04 <0f> b7 00 89 c2 66 81 e2 00 b0 66 81 fa 00 20 0f 84 53 ff ff ff


And things are segfaulting randomly. These machines have been running
3.11.7-hardened-r1 since 2014-01-03 without issue until now -- all of
our servers have. So the timing seems a little coincidental.

If it's not hardware (two different machines...), does this look like a
kernel bug? Should I upgrade over the weekend and pray?
Re: Weird coincidental PAX crashes [ In reply to ]
Maybe a bug somewhere else too, which combination kernel/grsec/pax was used?

On 05/09/2014 05:15 PM, Michael Orlitzky wrote:
> Last week, the LMTP daemon on our mail server (HP DL360 G6) crashed.
> People noticed that the mail stopped coming in, so I SSHed in to check
> on it, and there were some weird traces in the dmesg. While trying to
> investigate, I noticed some more badness:
>
> # emerge -1 openntpd
> Calculating dependencies... done!
>
> Killed
>
> At that point I'm thinking, "hardware problem, there goes the weekend."
> Most of my tools are committing suicide so I surrender and reboot. The
> thing comes up fine and has been working ever since.
>
> Today, another one of our web servers (HP DL360 G5?) does the same
> thing. The nightly log report was empty, because there's no syslog
> daemon running. This morning dmesg shows:
>
>> [Fri May 9 11:00:42 2014] PAX: refcount overflow detected in: syslog-ng:21823, uid/euid: 0/0
>> [Fri May 9 11:00:42 2014] CPU: 2 PID: 21823 Comm: syslog-ng Not tainted 3.11.7-hardened-r1 #1
>> [Fri May 9 11:00:42 2014] task: ffff8802cffca080 ti: ffff8802cffca488 task.ti: ffff8802cffca488
>> [Fri May 9 11:00:42 2014] RIP: 0010:[<ffffffff810e311e>] [<ffffffff810e311e>] 0xffffffff810e311e
>> [Fri May 9 11:00:42 2014] RSP: 0018:ffff880416f21c78 EFLAGS: 00000a96
>> [Fri May 9 11:00:42 2014] RAX: ffff88041f0048a0 RBX: ffff88041a1edf00 RCX: 0000000040276333
>> [Fri May 9 11:00:42 2014] RDX: 0000000040276332 RSI: 0000000000000000 RDI: ffff88041d858720
>> [Fri May 9 11:00:42 2014] RBP: 0000000000000008 R08: 0000000000010bc0 R09: ffff88042fb10bc0
>> [Fri May 9 11:00:42 2014] R10: 8000000000000000 R11: ffffea000fec3040 R12: ffff88041f0048a0
>> [Fri May 9 11:00:42 2014] R13: ffff88026628ef00 R14: ffff88041d858720 R15: ffff88041a1edf10
>> [Fri May 9 11:00:42 2014] FS: 0000000000000000(0000) GS:ffff88042fb00000(0000) knlGS:0000000000000000
>> [Fri May 9 11:00:42 2014] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [Fri May 9 11:00:42 2014] CR2: 0000035fb5abf850 CR3: 000000000138a000 CR4: 00000000000006b0
>> [Fri May 9 11:00:42 2014] Stack:
>> [Fri May 9 11:00:42 2014] 0000000000000000 ffffffff818dde60 ffff8804140ac100 ffff8802cffca570
>> [Fri May 9 11:00:42 2014] ffff8802cffca080 ffff880416eb4200 ffff8802cffca080 ffffffff81052750
>> [Fri May 9 11:00:42 2014] 0000000000000000 0000000000000001 ffff88038e6260d8 ffff8802cffca598
>> [Fri May 9 11:00:42 2014] Call Trace:
>> [Fri May 9 11:00:42 2014] [<ffffffff81052750>] ? 0xffffffff81052750
>> [Fri May 9 11:00:42 2014] [<ffffffff81036e10>] ? 0xffffffff81036e10
>> [Fri May 9 11:00:42 2014] [<ffffffff810371e8>] ? 0xffffffff810371e8
>> [Fri May 9 11:00:42 2014] [<ffffffff810449cc>] ? 0xffffffff810449cc
>> [Fri May 9 11:00:42 2014] [<ffffffff8100241f>] ? 0xffffffff8100241f
>> [Fri May 9 11:00:42 2014] [<ffffffff81002a89>] ? 0xffffffff81002a89
>> [Fri May 9 11:00:42 2014] [<ffffffff8137c212>] ? 0xffffffff8137c212
>> [Fri May 9 11:00:42 2014] Code: e9 68 fd 01 00 0f 1f 84 00 00 00 00 00 48 8b 43 18 48 8b 7b 10 48 8b 40 30 f0 ff 88 30 01 00 00 71 09 f0 ff 80 30 01 00 00 cd 04 <0f> b7 00 89 c2 66 81 e2 00 b0 66 81 fa 00 20 0f 84 53 ff ff ff
>> [Fri May 9 11:00:42 2014] PAX: refcount overflow detected in: syslog-ng:21823, uid/euid: 0/0
>> [Fri May 9 11:00:42 2014] CPU: 2 PID: 21823 Comm: syslog-ng Not tainted 3.11.7-hardened-r1 #1
>> [Fri May 9 11:00:42 2014] task: ffff8802cffca080 ti: ffff8802cffca488 task.ti: ffff8802cffca488
>> [Fri May 9 11:00:42 2014] RIP: 0010:[<ffffffff810e311e>] [<ffffffff810e311e>] 0xffffffff810e311e
>> [Fri May 9 11:00:42 2014] RSP: 0018:ffff880416f21c78 EFLAGS: 00000a96
>> [Fri May 9 11:00:42 2014] RAX: ffff88041f0048a0 RBX: ffff88041a1edc00 RCX: 0000000040c384f8
>> [Fri May 9 11:00:42 2014] RDX: 0000000040c384f7 RSI: 0000000000000000 RDI: ffff88041d858720
>> [Fri May 9 11:00:42 2014] RBP: 0000000000000008 R08: 0000000000010b60 R09: ffff88042fb10b60
>> [Fri May 9 11:00:42 2014] R10: 8000000000000000 R11: ffffea000f26a840 R12: ffff88041f0048a0
>> [Fri May 9 11:00:42 2014] R13: ffff88026628e000 R14: ffff88041d858720 R15: ffff88041a1edc10
>> [Fri May 9 11:00:42 2014] FS: 0000000000000000(0000) GS:ffff88042fb00000(0000) knlGS:0000000000000000
>> [Fri May 9 11:00:42 2014] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [Fri May 9 11:00:42 2014] CR2: 0000035fb5abf850 CR3: 000000000138a000 CR4: 00000000000006b0
>> [Fri May 9 11:00:42 2014] Stack:
>> [Fri May 9 11:00:42 2014] 0000000000000000 ffffffff818dde60 ffff88041a1ed400 ffff8802cffca570
>> [Fri May 9 11:00:42 2014] ffff8802cffca080 ffff880416eb4200 ffff8802cffca080 ffffffff81052750
>> [Fri May 9 11:00:42 2014] 0000000000000000 0000000000000001 ffff88038e6260d8 ffff8802cffca598
>> [Fri May 9 11:00:42 2014] Call Trace:
>> [Fri May 9 11:00:42 2014] [<ffffffff81052750>] ? 0xffffffff81052750
>> [Fri May 9 11:00:42 2014] [<ffffffff81036e10>] ? 0xffffffff81036e10
>> [Fri May 9 11:00:42 2014] [<ffffffff810371e8>] ? 0xffffffff810371e8
>> [Fri May 9 11:00:42 2014] [<ffffffff810449cc>] ? 0xffffffff810449cc
>> [Fri May 9 11:00:42 2014] [<ffffffff8100241f>] ? 0xffffffff8100241f
>> [Fri May 9 11:00:42 2014] [<ffffffff81002a89>] ? 0xffffffff81002a89
>> [Fri May 9 11:00:42 2014] [<ffffffff8137c212>] ? 0xffffffff8137c212
>> [Fri May 9 11:00:42 2014] Code: e9 68 fd 01 00 0f 1f 84 00 00 00 00 00 48 8b 43 18 48 8b 7b 10 48 8b 40 30 f0 ff 88 30 01 00 00 71 09 f0 ff 80 30 01 00 00 cd 04 <0f> b7 00 89 c2 66 81 e2 00 b0 66 81 fa 00 20 0f 84 53 ff ff ff
>
>
> And things are segfaulting randomly. These machines have been running
> 3.11.7-hardened-r1 since 2014-01-03 without issue until now -- all of
> our servers have. So the timing seems a little coincidental.
>
> If it's not hardware (two different machines...), does this look like a
> kernel bug? Should I upgrade over the weekend and pray?
>
Re: Weird coincidental PAX crashes [ In reply to ]
On 05/09/2014 11:29 AM, Mark Gomersbach wrote:
> Maybe a bug somewhere else too, which combination kernel/grsec/pax was used?
>

Whatever came with sys-kernel/hardened-sources-3.11.7-r1:

# uname -a
Linux mmmc2 3.11.7-hardened-r1 #1 SMP Fri Jan 3 23:13:48 EST 2014
x86_64 Intel(R) Xeon(R) CPU 5160 @ 3.00GHz GenuineIntel GNU/Linux

Here's the hardened portion of the kernel .config for the web server
that blew up today. The config for the mail server should be almost
identical. I maintain the kernel configs for different hardware in
different repos, but unless I've made a mistake, the hardening options
should be the same.


#
# Security options
#

#
# Grsecurity
#
CONFIG_PAX_KERNEXEC_PLUGIN=y
CONFIG_PAX_PER_CPU_PGD=y
CONFIG_TASK_SIZE_MAX_SHIFT=42
CONFIG_PAX_USERCOPY_SLABS=y
CONFIG_GRKERNSEC=y
# CONFIG_GRKERNSEC_CONFIG_AUTO is not set
CONFIG_GRKERNSEC_CONFIG_CUSTOM=y

#
# Customize Configuration
#

#
# PaX
#
CONFIG_PAX=y

#
# PaX Control
#
# CONFIG_PAX_SOFTMODE is not set
# CONFIG_PAX_PT_PAX_FLAGS is not set
CONFIG_PAX_XATTR_PAX_FLAGS=y
CONFIG_PAX_NO_ACL_FLAGS=y
# CONFIG_PAX_HAVE_ACL_FLAGS is not set
# CONFIG_PAX_HOOK_ACL_FLAGS is not set

#
# Non-executable pages
#
CONFIG_PAX_NOEXEC=y
CONFIG_PAX_PAGEEXEC=y
# CONFIG_PAX_EMUTRAMP is not set
CONFIG_PAX_MPROTECT=y
# CONFIG_PAX_MPROTECT_COMPAT is not set
# CONFIG_PAX_ELFRELOCS is not set
CONFIG_PAX_KERNEXEC=y
# CONFIG_PAX_KERNEXEC_PLUGIN_METHOD_BTS is not set
CONFIG_PAX_KERNEXEC_PLUGIN_METHOD_OR=y
CONFIG_PAX_KERNEXEC_PLUGIN_METHOD="or"

#
# Address Space Layout Randomization
#
CONFIG_PAX_ASLR=y
# CONFIG_PAX_RANDKSTACK is not set
CONFIG_PAX_RANDUSTACK=y
CONFIG_PAX_RANDMMAP=y

#
# Miscellaneous hardening features
#
# CONFIG_PAX_MEMORY_SANITIZE is not set
# CONFIG_PAX_MEMORY_STACKLEAK is not set
CONFIG_PAX_MEMORY_STRUCTLEAK=y
CONFIG_PAX_MEMORY_UDEREF=y
CONFIG_PAX_REFCOUNT=y
CONFIG_PAX_CONSTIFY_PLUGIN=y
CONFIG_PAX_USERCOPY=y
# CONFIG_PAX_USERCOPY_DEBUG is not set
CONFIG_PAX_SIZE_OVERFLOW=y
# CONFIG_PAX_LATENT_ENTROPY is not set

#
# Memory Protections
#
CONFIG_GRKERNSEC_KMEM=y
CONFIG_GRKERNSEC_IO=y
CONFIG_GRKERNSEC_PERF_HARDEN=y
CONFIG_GRKERNSEC_RAND_THREADSTACK=y
CONFIG_GRKERNSEC_PROC_MEMMAP=y
CONFIG_GRKERNSEC_BRUTE=y
CONFIG_GRKERNSEC_MODHARDEN=y
# CONFIG_GRKERNSEC_HIDESYM is not set
# CONFIG_GRKERNSEC_KERN_LOCKOUT is not set

#
# Role Based Access Control Options
#
CONFIG_GRKERNSEC_NO_RBAC=y
# CONFIG_GRKERNSEC_ACL_HIDEKERN is not set
CONFIG_GRKERNSEC_ACL_MAXTRIES=3
CONFIG_GRKERNSEC_ACL_TIMEOUT=30

#
# Filesystem Protections
#
CONFIG_GRKERNSEC_PROC=y
CONFIG_GRKERNSEC_PROC_USER=y
CONFIG_GRKERNSEC_PROC_ADD=y
CONFIG_GRKERNSEC_LINK=y
# CONFIG_GRKERNSEC_SYMLINKOWN is not set
CONFIG_GRKERNSEC_FIFO=y
CONFIG_GRKERNSEC_SYSFS_RESTRICT=y
# CONFIG_GRKERNSEC_ROFS is not set
CONFIG_GRKERNSEC_DEVICE_SIDECHANNEL=y
CONFIG_GRKERNSEC_CHROOT=y
CONFIG_GRKERNSEC_CHROOT_MOUNT=y
CONFIG_GRKERNSEC_CHROOT_DOUBLE=y
CONFIG_GRKERNSEC_CHROOT_PIVOT=y
CONFIG_GRKERNSEC_CHROOT_CHDIR=y
CONFIG_GRKERNSEC_CHROOT_CHMOD=y
CONFIG_GRKERNSEC_CHROOT_FCHDIR=y
CONFIG_GRKERNSEC_CHROOT_MKNOD=y
CONFIG_GRKERNSEC_CHROOT_SHMAT=y
CONFIG_GRKERNSEC_CHROOT_UNIX=y
CONFIG_GRKERNSEC_CHROOT_FINDTASK=y
CONFIG_GRKERNSEC_CHROOT_NICE=y
CONFIG_GRKERNSEC_CHROOT_SYSCTL=y
CONFIG_GRKERNSEC_CHROOT_CAPS=y
# CONFIG_GRKERNSEC_CHROOT_INITRD is not set

#
# Kernel Auditing
#
# CONFIG_GRKERNSEC_AUDIT_GROUP is not set
# CONFIG_GRKERNSEC_EXECLOG is not set
CONFIG_GRKERNSEC_RESLOG=y
# CONFIG_GRKERNSEC_CHROOT_EXECLOG is not set
# CONFIG_GRKERNSEC_AUDIT_PTRACE is not set
# CONFIG_GRKERNSEC_AUDIT_CHDIR is not set
# CONFIG_GRKERNSEC_AUDIT_MOUNT is not set
CONFIG_GRKERNSEC_SIGNAL=y
CONFIG_GRKERNSEC_FORKFAIL=y
# CONFIG_GRKERNSEC_TIME is not set
CONFIG_GRKERNSEC_PROC_IPADDR=y
CONFIG_GRKERNSEC_RWXMAP_LOG=y

#
# Executable Protections
#
CONFIG_GRKERNSEC_DMESG=y
CONFIG_GRKERNSEC_HARDEN_PTRACE=y
CONFIG_GRKERNSEC_PTRACE_READEXEC=y
# CONFIG_GRKERNSEC_SETXID is not set
# CONFIG_GRKERNSEC_TPE is not set

#
# Network Protections
#
CONFIG_GRKERNSEC_RANDNET=y
# CONFIG_GRKERNSEC_BLACKHOLE is not set
CONFIG_GRKERNSEC_NO_SIMULT_CONNECT=y
# CONFIG_GRKERNSEC_SOCKET is not set

#
# Physical Protections
#
# CONFIG_GRKERNSEC_DENYUSB is not set

#
# Sysctl Support
#
# CONFIG_GRKERNSEC_SYSCTL is not set

#
# Logging Options
#
CONFIG_GRKERNSEC_FLOODTIME=1
CONFIG_GRKERNSEC_FLOODBURST=4
# CONFIG_KEYS is not set
CONFIG_SECURITY_DMESG_RESTRICT=y
CONFIG_SECURITY=y
# CONFIG_SECURITYFS is not set
# CONFIG_SECURITY_NETWORK is not set
# CONFIG_SECURITY_PATH is not set
# CONFIG_INTEL_TXT is not set
# CONFIG_SECURITY_SMACK is not set
# CONFIG_SECURITY_TOMOYO is not set
# CONFIG_SECURITY_APPARMOR is not set
# CONFIG_IMA is not set
Re: Weird coincidental PAX crashes [ In reply to ]
I encourage you to upgrade your kernel to the latest available in the
tree. Even if its keyworded currently. Such things pop up sometimes, come
and go. Grsec/PaX developers (spender/pipacs/ephox) fixes most of these
pretty quickly. I would also check out grsecurity support forums.
--
dr Tóth Attila, Radiológus, 06-20-825-8057
Attila Toth MD, Radiologist, +36-20-825-8057

2014.Május 9.(P) 17:39 időpontban Michael Orlitzky ezt írta:
> On 05/09/2014 11:29 AM, Mark Gomersbach wrote:
>> Maybe a bug somewhere else too, which combination kernel/grsec/pax was
>> used?
>>
>
> Whatever came with sys-kernel/hardened-sources-3.11.7-r1:
>
> # uname -a
> Linux mmmc2 3.11.7-hardened-r1 #1 SMP Fri Jan 3 23:13:48 EST 2014
> x86_64 Intel(R) Xeon(R) CPU 5160 @ 3.00GHz GenuineIntel GNU/Linux
>
> Here's the hardened portion of the kernel .config for the web server
> that blew up today. The config for the mail server should be almost
> identical. I maintain the kernel configs for different hardware in
> different repos, but unless I've made a mistake, the hardening options
> should be the same.
>
Re: Weird coincidental PAX crashes [ In reply to ]
On 05/09/2014 13:46, "Tóth Attila" wrote:
> 2014.Május 9.(P) 17:39 időpontban Michael Orlitzky ezt írta:
>> On 05/09/2014 11:29 AM, Mark Gomersbach wrote:
>>> Maybe a bug somewhere else too, which combination kernel/grsec/pax was
>>> used?
>>>
>>
>> Whatever came with sys-kernel/hardened-sources-3.11.7-r1:
>>
>> # uname -a
>> Linux mmmc2 3.11.7-hardened-r1 #1 SMP Fri Jan 3 23:13:48 EST 2014
>> x86_64 Intel(R) Xeon(R) CPU 5160 @ 3.00GHz GenuineIntel GNU/Linux
>>
>> Here's the hardened portion of the kernel .config for the web server
>> that blew up today. The config for the mail server should be almost
>> identical. I maintain the kernel configs for different hardware in
>> different repos, but unless I've made a mistake, the hardening options
>> should be the same.
>>
>
> I encourage you to upgrade your kernel to the latest available in the
> tree. Even if its keyworded currently. Such things pop up sometimes, come
> and go. Grsec/PaX developers (spender/pipacs/ephox) fixes most of these
> pretty quickly. I would also check out grsecurity support forums.

I think I ran into this, too, in 3.11. It takes a few days of uptime before
it happens. Running 3.13.x now on my x64 machine and haven't ran into it
again. So I second the suggestion to upgrade your kernel.

--
Joshua Kinard
Gentoo/MIPS
kumba@gentoo.org
4096R/D25D95E3 2011-03-28

"The past tempts us, the present confuses us, the future frightens us. And
our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic
Re: Weird coincidental PAX crashes [ In reply to ]
On 05/10/2014 07:14 AM, Joshua Kinard wrote:
>
> I think I ran into this, too, in 3.11. It takes a few days of uptime before
> it happens. Running 3.13.x now on my x64 machine and haven't ran into it
> again. So I second the suggestion to upgrade your kernel.
>

I couldn't come up with a better idea, so last night I upgraded
everything to hardened-sources-3.13.6-r3.
Re: Weird coincidental PAX crashes [ In reply to ]
On 05/10/14 07:39, Michael Orlitzky wrote:
> On 05/10/2014 07:14 AM, Joshua Kinard wrote:
>>
>> I think I ran into this, too, in 3.11. It takes a few days of uptime before
>> it happens. Running 3.13.x now on my x64 machine and haven't ran into it
>> again. So I second the suggestion to upgrade your kernel.
>>
>
> I couldn't come up with a better idea, so last night I upgraded
> everything to hardened-sources-3.13.6-r3.
>

Unfortunately I don't know what the "this" is because that trace
stripped all symbols --- at least in kernel land. In useland its pretty
obvious: refcount overflow detected in: syslog-ng. So some syscall
initiated by syslog-ng is hitting up against the overflow. An strace
might tell us what syscall, but it might be hard to go from there to
where in the kernel the overflow happens.

Not to sound like a snotty dev, but please open bugs with these oopses
so we have a record in bugzilla. Email just buries this info.

--
Anthony G. Basile, Ph. D.
Chair of Information Technology
D'Youville College
Buffalo, NY 14201
(716) 829-8197
Re: Weird coincidental PAX crashes [ In reply to ]
While a refcount overflow would be fair indicator. When this goes of
after approximately some months AND on different machines could hint of
false positives pilling up.
Re: Weird coincidental PAX crashes [ In reply to ]
On 05/10/2014 09:43, Anthony G. Basile wrote:
> On 05/10/14 07:39, Michael Orlitzky wrote:
>> On 05/10/2014 07:14 AM, Joshua Kinard wrote:
>>>
>>> I think I ran into this, too, in 3.11. It takes a few days of uptime before
>>> it happens. Running 3.13.x now on my x64 machine and haven't ran into it
>>> again. So I second the suggestion to upgrade your kernel.
>>>
>>
>> I couldn't come up with a better idea, so last night I upgraded
>> everything to hardened-sources-3.13.6-r3.
>>
>
> Unfortunately I don't know what the "this" is because that trace stripped
> all symbols --- at least in kernel land. In useland its pretty obvious:
> refcount overflow detected in: syslog-ng. So some syscall initiated by
> syslog-ng is hitting up against the overflow. An strace might tell us what
> syscall, but it might be hard to go from there to where in the kernel the
> overflow happens.
>
> Not to sound like a snotty dev, but please open bugs with these oopses so we
> have a record in bugzilla. Email just buries this info.

For me, I never had an actual oops. Just a note in dmesg that pax was
killing command-line processes at random. Running services didn't seem to
be affected, but I could go run grep or something and it'd just abruptly
terminate.

Kinda hard to file a bug on that and not have it closed as WONTFIX :)

--
Joshua Kinard
Gentoo/MIPS
kumba@gentoo.org
4096R/D25D95E3 2011-03-28

"The past tempts us, the present confuses us, the future frightens us. And
our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic
Re: Weird coincidental PAX crashes [ In reply to ]
On 05/13/14 15:39, Joshua Kinard wrote:
> On 05/10/2014 09:43, Anthony G. Basile wrote:
>> On 05/10/14 07:39, Michael Orlitzky wrote:
>>> On 05/10/2014 07:14 AM, Joshua Kinard wrote:
>>>>
>>>> I think I ran into this, too, in 3.11. It takes a few days of uptime before
>>>> it happens. Running 3.13.x now on my x64 machine and haven't ran into it
>>>> again. So I second the suggestion to upgrade your kernel.
>>>>
>>>
>>> I couldn't come up with a better idea, so last night I upgraded
>>> everything to hardened-sources-3.13.6-r3.
>>>
>>
>> Unfortunately I don't know what the "this" is because that trace stripped
>> all symbols --- at least in kernel land. In useland its pretty obvious:
>> refcount overflow detected in: syslog-ng. So some syscall initiated by
>> syslog-ng is hitting up against the overflow. An strace might tell us what
>> syscall, but it might be hard to go from there to where in the kernel the
>> overflow happens.
>>
>> Not to sound like a snotty dev, but please open bugs with these oopses so we
>> have a record in bugzilla. Email just buries this info.
>
> For me, I never had an actual oops. Just a note in dmesg that pax was
> killing command-line processes at random. Running services didn't seem to
> be affected, but I could go run grep or something and it'd just abruptly
> terminate.
>
> Kinda hard to file a bug on that and not have it closed as WONTFIX :)
>

That's not true. I take many approaches here and often don't find the
underlying problem. At the very least, I have a window of "bad
versions" even if I don't trace down why they are bad so I know what to
safely stabilize.

--
Anthony G. Basile, Ph. D.
Chair of Information Technology
D'Youville College
Buffalo, NY 14201
(716) 829-8197
Re: Weird coincidental PAX crashes [ In reply to ]
On 9 May 2014 at 11:15, Michael Orlitzky wrote:

> > [Fri May 9 11:00:42 2014] PAX: refcount overflow detected in: syslog-ng:21823, uid/euid: 0/0

this is the key message, the REFCOUNT feature triggered as it detected
an overflow somewhere.

> > [Fri May 9 11:00:42 2014] CPU: 2 PID: 21823 Comm: syslog-ng Not tainted 3.11.7-hardened-r1 #1

as a sidenote, this is a very old kernel and it's quite possible that
it's a false positive that we fixed since.

> > [Fri May 9 11:00:42 2014] Call Trace:
> > [Fri May 9 11:00:42 2014] [<ffffffff81052750>] ? 0xffffffff81052750
> > [Fri May 9 11:00:42 2014] [<ffffffff81036e10>] ? 0xffffffff81036e10
> > [Fri May 9 11:00:42 2014] [<ffffffff810371e8>] ? 0xffffffff810371e8
> > [Fri May 9 11:00:42 2014] [<ffffffff810449cc>] ? 0xffffffff810449cc
> > [Fri May 9 11:00:42 2014] [<ffffffff8100241f>] ? 0xffffffff8100241f
> > [Fri May 9 11:00:42 2014] [<ffffffff81002a89>] ? 0xffffffff81002a89
> > [Fri May 9 11:00:42 2014] [<ffffffff8137c212>] ? 0xffffffff8137c212

unfortunately the backtrace is not usable as is due to lack of symbols.
if you still have the original vmlinux around (or can reproduce it with
all the debug symbols) then i can take a look and perhaps figure out
where the refcount overflow was detected (and whether it was a false
positive or not).

> If it's not hardware (two different machines...), does this look like a
> kernel bug? Should I upgrade over the weekend and pray?

it's not a hardware issue (at least not directly) but a software one and
regardless of it being perhaps a false positive you should be using a newer
kernel that we support ;).
Re: Weird coincidental PAX crashes [ In reply to ]
On 13 May 2014 at 15:39, Joshua Kinard wrote:

> For me, I never had an actual oops. Just a note in dmesg that pax was
> killing command-line processes at random. Running services didn't seem to
> be affected, but I could go run grep or something and it'd just abruptly
> terminate.

when PaX kills something then there're always logs about the reason, so
you could post those and CC me on the bugs.
Re: Weird coincidental PAX crashes [ In reply to ]
On 05/15/2014 09:48 AM, PaX Team wrote:
>
> unfortunately the backtrace is not usable as is due to lack of symbols.
> if you still have the original vmlinux around (or can reproduce it with
> all the debug symbols) then i can take a look and perhaps figure out
> where the refcount overflow was detected (and whether it was a false
> positive or not).
>
>> If it's not hardware (two different machines...), does this look like a
>> kernel bug? Should I upgrade over the weekend and pray?
>
> it's not a hardware issue (at least not directly) but a software one and
> regardless of it being perhaps a false positive you should be using a newer
> kernel that we support ;).
>

Thanks, I wasn't sure if it was a kernel issue at first (which is why I
posted here instead of filing a bug). But it took out a production
server in the middle of the day, and I was out of the office, so I had
to just shut-up-and-fix-it quickly.

I've since upgraded to a newer kernel and everything looks OK, but the
issue only showed up after around 5 months of uptime. If I go another 5
months without an upgrade or reboot, maybe I'll find out if the issue is
reproducible =)