Hi all,
Recently?I run bro cluster with PF_RING 6.4.1 as the package capturing
framework. However, I found a strange scene: sometimes when the cluster
restarts, the kernel module of PF_RING will block. The problem seems to
generate when operates /proc to apply read-write lock, causing kernel CPU
soft lockup and then the whole server is stuck. I want to know whether the
deadlock is caused by the resource competition when the cluster progress
applies read-write lock at the same time, or other reasons.
Here is the log in /var/log/message:
Oct 22 01:26:10 TEST kernel: BUG: soft lockup - CPU#16 stuck for 22s!
[bro:17375]
Oct 22 01:26:10 TEST kernel: Modules linked in: binfmt_misc xt_CHECKSUM
iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat
nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack
nf_conntrack ipt_REJECT tun bridge stp llc ebtable_filter ebtables
ip6table_filter ip6_tables iptable_filter intel_powerclamp coretemp
intel_rapl kvm_intel iTCO_wdt kvm mei_me iTCO_vendor_support mxm_wmi mei
lpc_ich ipmi_ssif sb_edac crc32_pclmul ghash_clmulni_intel aesni_intel lrw
gf128mul glue_helper ablk_helper edac_core ipmi_si cryptd ioTESTma
ipmi_msghandler shpchp wmi sg dca i2c_i801 pcspkr mfd_core nfsd auth_rpcgss
nfs_acl lockd grace pf_ring(OE) sunrpc ip_tables ext4 mbcache jbd2 sd_mod
crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt
i2c_algo_bit drm_kms_helper ttm crct10dif_pclmul crct10dif_common
Oct 22 01:26:10 TEST kernel: crc32c_intel drm isci e1000e libsas ahci
scsi_transport_sas libahci ptp i2c_core libata pps_core ntb
Oct 22 01:26:10 TEST kernel: CPU: 16 PID: 17375 Comm: bro Tainted: G
OE ------------ 3.10.0-327.13.1.el7.x86_64 #1
Oct 22 01:26:10 TEST kernel: Hardware name: Supermicro
X9DRL-3F/iF/X9DRL-3F/iF, BIOS 3.2 09/22/2015
Oct 22 01:26:10 TEST kernel: task: ffff880fdad73980 ti: ffff8800287b8000
task.ti: ffff8800287b8000
Oct 22 01:26:10 TEST kernel: RIP: 0010:[<ffffffff81301909>]
[<ffffffff81301909>] __write_lock_failed+0x9/0x20
Oct 22 01:26:10 TEST kernel: RSP: 0018:ffff8800287bbe88 EFLAGS: 00000297
Oct 22 01:26:10 TEST kernel: RAX: 0000000000000000 RBX: 0000000092e79345
RCX: 0000000000000000
Oct 22 01:26:10 TEST kernel: RDX: 0000000000000000 RSI: ffff88104aabb000
RDI: ffffffffa03c6324
Oct 22 01:26:10 TEST kernel: RBP: ffff8800287bbe88 R08: 0000000000017600
R09: 0000000000000080
Oct 22 01:26:10 TEST kernel: R10: 0000000000000000 R11: 000000000000001b
R12: 00007f21f432e000
Oct 22 01:26:10 TEST kernel: R13: 0000000000000032 R14: ffff8800000000a8
R15: 0000000000000000
Oct 22 01:26:10 TEST kernel: FS: 00007f21f4e56880(0000)
GS:ffff88085fd00000(0000) knlGS:0000000000000000
Oct 22 01:26:10 TEST kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
Oct 22 01:26:10 TEST kernel: CR2: 00007f21f432eee0 CR3: 0000000648323000
CR4: 00000000000407e0
Oct 22 01:26:10 TEST kernel: DR0: 0000000000000000 DR1: 0000000000000000
DR2: 0000000000000000
Oct 22 01:26:10 TEST kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0
DR7: 0000000000000400
Oct 22 01:26:10 TEST kernel: Stack:
Oct 22 01:26:10 TEST kernel: ffff8800287bbe98 ffffffff8163ce37
ffff8800287bbeb8 ffffffffa03afe1b
Oct 22 01:26:10 TEST kernel: ffff88104aabb000 ffff88101a631800
ffff8800287bbee8 ffffffffa03b6903
Oct 22 01:26:10 TEST kernel: 0000000000000003 0000000000000300
0000000000000000 ffffffff81a25e00
Oct 22 01:26:10 TEST kernel: Call Trace:
Oct 22 01:26:10 TEST kernel: [<ffffffff8163ce37>] _raw_write_lock+0x17/0x20
Oct 22 01:26:10 TEST kernel: [<ffffffffa03afe1b>] ring_proc_add+0x1b/0xb0
[pf_ring]
Oct 22 01:26:10 TEST kernel: [<ffffffffa03b6903>] ring_create+0x2e3/0x4a0
[pf_ring]
Oct 22 01:26:10 TEST kernel: [<ffffffff815103e0>] __sock_create+0x110/0x260
Oct 22 01:26:10 TEST kernel: [<ffffffff81511ab1>] SyS_socket+0x61/0xf0
Oct 22 01:26:10 TEST kernel: [<ffffffff8163d948>] ? page_fault+0x28/0x30
Oct 22 01:26:10 TEST kernel: [<ffffffff81645ec9>]
system_call_fastpath+0x16/0x1b
Oct 22 01:26:10 TEST kernel: Code: 66 90 48 89 01 31 c0 66 66 90 c3 b8 f2
ff ff ff 66 66 90 c3 90 90 90 90 90 90 90 90 90 90 90 90 90 90 55 48 89 e5
f0 ff 07 f3 90 <83> 3f 01 75 f9 f0 ff 0f 75 f1 5d c3 66 66 2e 0f 1f 84 00
00 00
Info in /proc:
PF_RING Version : 6.4.1 ((detached:eded7ce625e60620639050575b1fba
0aa0412374)
Total rings : 16
Standard (non ZC) Options
Ring slots : 65534
Slot version : 16
Capture TX : No [RX only]
IP Defragment : No
Socket Mode : Standard
Total plugins : 0
Cluster Fragment Queue : 0
Cluster Fragment Discard : 0
cpu and memory info:
Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz * 2
total used free shared buff/cache
available
Mem: 62G 33G 25G 784M 4.0G
27G
Swap: 31G 0B 31G
Any insight would be helpful.
Recently?I run bro cluster with PF_RING 6.4.1 as the package capturing
framework. However, I found a strange scene: sometimes when the cluster
restarts, the kernel module of PF_RING will block. The problem seems to
generate when operates /proc to apply read-write lock, causing kernel CPU
soft lockup and then the whole server is stuck. I want to know whether the
deadlock is caused by the resource competition when the cluster progress
applies read-write lock at the same time, or other reasons.
Here is the log in /var/log/message:
Oct 22 01:26:10 TEST kernel: BUG: soft lockup - CPU#16 stuck for 22s!
[bro:17375]
Oct 22 01:26:10 TEST kernel: Modules linked in: binfmt_misc xt_CHECKSUM
iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat
nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack
nf_conntrack ipt_REJECT tun bridge stp llc ebtable_filter ebtables
ip6table_filter ip6_tables iptable_filter intel_powerclamp coretemp
intel_rapl kvm_intel iTCO_wdt kvm mei_me iTCO_vendor_support mxm_wmi mei
lpc_ich ipmi_ssif sb_edac crc32_pclmul ghash_clmulni_intel aesni_intel lrw
gf128mul glue_helper ablk_helper edac_core ipmi_si cryptd ioTESTma
ipmi_msghandler shpchp wmi sg dca i2c_i801 pcspkr mfd_core nfsd auth_rpcgss
nfs_acl lockd grace pf_ring(OE) sunrpc ip_tables ext4 mbcache jbd2 sd_mod
crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt
i2c_algo_bit drm_kms_helper ttm crct10dif_pclmul crct10dif_common
Oct 22 01:26:10 TEST kernel: crc32c_intel drm isci e1000e libsas ahci
scsi_transport_sas libahci ptp i2c_core libata pps_core ntb
Oct 22 01:26:10 TEST kernel: CPU: 16 PID: 17375 Comm: bro Tainted: G
OE ------------ 3.10.0-327.13.1.el7.x86_64 #1
Oct 22 01:26:10 TEST kernel: Hardware name: Supermicro
X9DRL-3F/iF/X9DRL-3F/iF, BIOS 3.2 09/22/2015
Oct 22 01:26:10 TEST kernel: task: ffff880fdad73980 ti: ffff8800287b8000
task.ti: ffff8800287b8000
Oct 22 01:26:10 TEST kernel: RIP: 0010:[<ffffffff81301909>]
[<ffffffff81301909>] __write_lock_failed+0x9/0x20
Oct 22 01:26:10 TEST kernel: RSP: 0018:ffff8800287bbe88 EFLAGS: 00000297
Oct 22 01:26:10 TEST kernel: RAX: 0000000000000000 RBX: 0000000092e79345
RCX: 0000000000000000
Oct 22 01:26:10 TEST kernel: RDX: 0000000000000000 RSI: ffff88104aabb000
RDI: ffffffffa03c6324
Oct 22 01:26:10 TEST kernel: RBP: ffff8800287bbe88 R08: 0000000000017600
R09: 0000000000000080
Oct 22 01:26:10 TEST kernel: R10: 0000000000000000 R11: 000000000000001b
R12: 00007f21f432e000
Oct 22 01:26:10 TEST kernel: R13: 0000000000000032 R14: ffff8800000000a8
R15: 0000000000000000
Oct 22 01:26:10 TEST kernel: FS: 00007f21f4e56880(0000)
GS:ffff88085fd00000(0000) knlGS:0000000000000000
Oct 22 01:26:10 TEST kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
Oct 22 01:26:10 TEST kernel: CR2: 00007f21f432eee0 CR3: 0000000648323000
CR4: 00000000000407e0
Oct 22 01:26:10 TEST kernel: DR0: 0000000000000000 DR1: 0000000000000000
DR2: 0000000000000000
Oct 22 01:26:10 TEST kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0
DR7: 0000000000000400
Oct 22 01:26:10 TEST kernel: Stack:
Oct 22 01:26:10 TEST kernel: ffff8800287bbe98 ffffffff8163ce37
ffff8800287bbeb8 ffffffffa03afe1b
Oct 22 01:26:10 TEST kernel: ffff88104aabb000 ffff88101a631800
ffff8800287bbee8 ffffffffa03b6903
Oct 22 01:26:10 TEST kernel: 0000000000000003 0000000000000300
0000000000000000 ffffffff81a25e00
Oct 22 01:26:10 TEST kernel: Call Trace:
Oct 22 01:26:10 TEST kernel: [<ffffffff8163ce37>] _raw_write_lock+0x17/0x20
Oct 22 01:26:10 TEST kernel: [<ffffffffa03afe1b>] ring_proc_add+0x1b/0xb0
[pf_ring]
Oct 22 01:26:10 TEST kernel: [<ffffffffa03b6903>] ring_create+0x2e3/0x4a0
[pf_ring]
Oct 22 01:26:10 TEST kernel: [<ffffffff815103e0>] __sock_create+0x110/0x260
Oct 22 01:26:10 TEST kernel: [<ffffffff81511ab1>] SyS_socket+0x61/0xf0
Oct 22 01:26:10 TEST kernel: [<ffffffff8163d948>] ? page_fault+0x28/0x30
Oct 22 01:26:10 TEST kernel: [<ffffffff81645ec9>]
system_call_fastpath+0x16/0x1b
Oct 22 01:26:10 TEST kernel: Code: 66 90 48 89 01 31 c0 66 66 90 c3 b8 f2
ff ff ff 66 66 90 c3 90 90 90 90 90 90 90 90 90 90 90 90 90 90 55 48 89 e5
f0 ff 07 f3 90 <83> 3f 01 75 f9 f0 ff 0f 75 f1 5d c3 66 66 2e 0f 1f 84 00
00 00
Info in /proc:
PF_RING Version : 6.4.1 ((detached:eded7ce625e60620639050575b1fba
0aa0412374)
Total rings : 16
Standard (non ZC) Options
Ring slots : 65534
Slot version : 16
Capture TX : No [RX only]
IP Defragment : No
Socket Mode : Standard
Total plugins : 0
Cluster Fragment Queue : 0
Cluster Fragment Discard : 0
cpu and memory info:
Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz * 2
total used free shared buff/cache
available
Mem: 62G 33G 25G 784M 4.0G
27G
Swap: 31G 0B 31G
Any insight would be helpful.