Mailing List Archive

repeated Kernel oops need help to debug
Hi All,
We have a problem that is ongoing for more than 1 month

We have several servers running xcp-ng and we are facing kernel oops that
crash the server

My skill is not enough to debug the issue So need someone to point me to
the right direction
the issue is not hardware related
it occurred on servers that are of different processor , nic and even
kernel version (all under 4.19)

the stack trace looks like this

[2399526.430672] ALERT: BUG: unable to handle kernel NULL pointer
dereference at 0000000000000004
[2399526.430695] INFO: PGD 447268067 P4D 447268067 PUD 44775f067 PMD 0
[2399526.430710] WARN: Oops: 0000 [#1] SMP NOPTI
[2399526.430720] WARN: CPU: 1 PID: 17 Comm: ksoftirqd/1 Not tainted
4.19.108 #1
[2399526.430728] WARN: Hardware name: HP ProLiant SL230s Gen8 /, BIOS
P75 05/24/2019
[2399526.430745] WARN: RIP: e030:pfifo_fast_dequeue+0xc9/0x140
[2399526.430753] WARN: Code: 50 28 48 8b 4f 58 f7 da 65 01 51 04 48 8b 57
50 65 48 03 15 11 64 99 7e 8b 88 cc 00 00 00 be 01 00 00 00 48 03 88 d0 00
00 00 <66> 83 79 04 00 74 04 0f b7 71 06 8b 48 28 01 72 08 48 01 0a f0 ff
[2399526.430773] WARN: RSP: e02b:ffffc900400c3de0 EFLAGS: 00010246
[2399526.430780] WARN: RAX: ffff88842087b900 RBX: 0000000000000001 RCX:
0000000000000000
[2399526.430789] WARN: RDX: ffffe8fffee60a1c RSI: 0000000000000001 RDI:
ffff8883de0b9c00
[2399526.430801] WARN: RBP: 0000000000000000 R08: 0000000000000000 R09:
0000000000000020
[2399526.430811] WARN: R10: 0000000000000000 R11: ffff8883de0b9d40 R12:
0000000000000001
[2399526.430823] WARN: R13: ffff8883db210a00 R14: 0000000000000002 R15:
ffff8883de0b9c00
[2399526.430852] WARN: FS: 00007ffac43fe700(0000)
GS:ffff888451240000(0000) knlGS:0000000000000000
[2399526.430868] WARN: CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[2399526.430879] WARN: CR2: 0000000000000004 CR3: 000000044ad58000 CR4:
0000000000040660
[2399526.430899] WARN: Call Trace:
[2399526.430914] WARN: __qdisc_run+0xa2/0x4f0
[2399526.430928] WARN: ? __switch_to_asm+0x41/0x70
[2399526.430940] WARN: net_tx_action+0x148/0x230
[2399526.430949] WARN: __do_softirq+0xd1/0x28c
[2399526.430966] WARN: run_ksoftirqd+0x26/0x40
[2399526.430980] WARN: smpboot_thread_fn+0x10e/0x160
[2399526.430993] WARN: kthread+0xf8/0x130
[2399526.431004] WARN: ? sort_range+0x20/0x20
[2399526.431010] WARN: ? kthread_bind+0x10/0x10
[2399526.431017] WARN: ret_from_fork+0x35/0x40
[2399526.431027] WARN: Modules linked in: act_police cls_basic
sch_ingress sch_tbf tun rpcsec_gss_krb5 auth_rpcgss oid_registry nfsv4 nfs
lockd grace fscache bnx2fc cnic uio fcoe libfcoe libfc scsi_transport_fc
openvswitch nsh nf_nat_ipv6 nf_nat_ipv4 nf_conncount nf_nat 8021q garp mrp
stp llc ipt_REJECT nf_reject_ipv4 dm_multipath xt_tcpudp xt_multiport
xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter
sunrpc hid_generic sb_edac intel_powerclamp crct10dif_pclmul crc32_pclmul
ghash_clmulni_intel pcbc dm_mod aesni_intel aes_x86_64 crypto_simd cryptd
glue_helper intel_rapl_perf psmouse lpc_ich usbhid hid sg hpilo ipmi_si
ipmi_devintf ipmi_msghandler acpi_power_meter ip_tables x_tables raid1
md_mod sd_mod serio_raw uhci_hcd ahci libahci igb libata ehci_pci ehci_hcd
bnx2x mdio libcrc32c mpt3sas
[2399526.431154] WARN: raid_class scsi_transport_sas scsi_dh_rdac
scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod ipv6 crc_ccitt
[2399526.431177] WARN: CR2: 0000000000000004
[2399526.431189] WARN: ---[ end trace 32a268c3653eb10c ]---
[2399526.431201] WARN: RIP: e030:pfifo_fast_dequeue+0xc9/0x140
[2399526.431212] WARN: Code: 50 28 48 8b 4f 58 f7 da 65 01 51 04 48 8b 57
50 65 48 03 15 11 64 99 7e 8b 88 cc 00 00 00 be 01 00 00 00 48 03 88 d0 00
00 00 <66> 83 79 04 00 74 04 0f b7 71 06 8b 48 28 01 72 08 48 01 0a f0 ff
[2399526.431238] WARN: RSP: e02b:ffffc900400c3de0 EFLAGS: 00010246
[2399526.431247] WARN: RAX: ffff88842087b900 RBX: 0000000000000001 RCX:
0000000000000000
[2399526.431260] WARN: RDX: ffffe8fffee60a1c RSI: 0000000000000001 RDI:
ffff8883de0b9c00
[2399526.431270] WARN: RBP: 0000000000000000 R08: 0000000000000000 R09:
0000000000000020
[2399526.431280] WARN: R10: 0000000000000000 R11: ffff8883de0b9d40 R12:
0000000000000001
[2399526.431289] WARN: R13: ffff8883db210a00 R14: 0000000000000002 R15:
ffff8883de0b9c00
[2399526.431307] WARN: FS: 00007ffac43fe700(0000)
GS:ffff888451240000(0000) knlGS:0000000000000000
[2399526.431319] WARN: CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[2399526.431331] WARN: CR2: 0000000000000004 CR3: 000000044ad58000 CR4:
0000000000040660
[2399526.431355] EMERG: Kernel panic - not syncing: Fatal exception in
interrupt


xen crash analyzer generate many other files as well
dmesg.kexec.log
dom0.log ( for each dom )
dom0.structures.log for each dom ( for each dom )
....
lspci-tv.out
lspci-vv.out
lspci-vvxxxx.out
readelf-Wl.out
readelf-Wn.out
time-v.out
xen.log
xen.pcpu0.stack.log ( for each pcpu)
...
xen-crashdump-analyser.log

the log can be seen from xenlog file as
Call Trace:
[ffffffff810014aa] xen_hypercall_kexec_op+0xa/0x20
ffffffff81071f85 panic+0x111/0x27c
ffffffff81027a7f oops_end+0xcf/0xd0
ffffffff8105da63 no_context+0x1b3/0x3c0
ffffffff816c0223 inet_gro_receive+0x213/0x2b0
ffffffff8105e32a __do_page_fault+0xaa/0x4f0
ffffffff8162cd44 netif_receive_skb_internal+0x34/0xe0
ffffffff81800f6e page_fault+0x1e/0x30
ffffffff81663ac9 pfifo_fast_dequeue+0xc9/0x140
ffffffff81663f38 __qdisc_run+0xa8/0x4e0
ffffffff816290c8 net_tx_action+0x148/0x220
ffffffff81a000d1 __softirqentry_text_start+0xd1/0x28c
ffffffff81077ff6 run_ksoftirqd+0x26/0x40
ffffffff8109763e smpboot_thread_fn+0x10e/0x160
ffffffff81093b68 kthread+0xf8/0x130
ffffffff81097530 smpboot_thread_fn+0/0x160
ffffffff81093a70 kthread+0/0x130
ffffffff81800215 ret_from_fork+0x35/0x40

I did use a tool to trace the source code where the issue occure
./decode_stacktrace.sh /usr/lib/debug/lib/modules/4.19.108/vmlinux
/usr/lib/debug/lib/modules/4.19.108/ < ./trace2 > out3

and this is the output

[ffffffff810014aa] xen_hypercall_kexec_op (arch/x86/kernel/.tmp_head_64.o:?)
ffffffff81071f85 panic (/usr/src/debug/kernel-4.19.19/kernel/panic.c:209)
ffffffff81027a7f oops_end
(/usr/src/debug/kernel-4.19.19/arch/x86/kernel/dumpstack.c:352)
ffffffff8105da63 no_context
(/usr/src/debug/kernel-4.19.19/arch/x86/mm/fault.c:808)
ffffffff816c0223 inet_gro_receive
(/usr/src/debug/kernel-4.19.19/include/linux/skbuff.h:2350
/usr/src/debug/kernel-4.19.19/net/ipv4/af_inet.c:1495)
ffffffff8105e32a __do_page_fault
(/usr/src/debug/kernel-4.19.19/arch/x86/mm/fault.c:1435)
ffffffff8162cd44 netif_receive_skb_internal
(/usr/src/debug/kernel-4.19.19/net/core/dev.c:5152)
ffffffff81800f6e page_fault
(/usr/src/debug////////kernel-4.19.19/arch/x86/entry/entry_64.S:1204)
ffffffff81663ac9 pfifo_fast_dequeue
(/usr/src/debug/kernel-4.19.19/include/net/sch_generic.h:723
/usr/src/debug/kernel-4.19.19/include/net/sch_generic.h:740
/usr/src/debug/kernel-4.19.19/include/net/sch_generic.h:747
/usr/src/debug/kernel-4.19.19/net/sched/sch_generic.c:677)
ffffffff81663f38 __qdisc_run
(/usr/src/debug/kernel-4.19.19/net/sched/sch_generic.c:283
/usr/src/debug/kernel-4.19.19/net/sched/sch_generic.c:385
/usr/src/debug/kernel-4.19.19/net/sched/sch_generic.c:403)
ffffffff816290c8 net_tx_action
(/usr/src/debug/kernel-4.19.19/include/linux/seqlock.h:235
/usr/src/debug/kernel-4.19.19/include/linux/seqlock.h:388
/usr/src/debug/kernel-4.19.19/include/net/sch_generic.h:145
/usr/src/debug/kernel-4.19.19/include/net/pkt_sched.h:121
/usr/src/debug/kernel-4.19.19/net/core/dev.c:4595)
ffffffff81a000d1 __softirqentry_text_start
(/usr/src/debug/kernel-4.19.19/kernel/softirq.c:292
/usr/src/debug/kernel-4.19.19/include/linux/jump_label.h:138
/usr/src/debug/kernel-4.19.19/include/trace/events/irq.h:142
/usr/src/debug/kernel-4.19.19/kernel/softirq.c:293)
ffffffff81077ff6 run_ksoftirqd
(/usr/src/debug/kernel-4.19.19/arch/x86/include/asm/paravirt.h:799
/usr/src/debug/kernel-4.19.19/kernel/softirq.c:654)
ffffffff8109763e smpboot_thread_fn
(/usr/src/debug/kernel-4.19.19/kernel/smpboot.c:164)
ffffffff81093b68 kthread
(/usr/src/debug/kernel-4.19.19/kernel/kthread.c:246)
ffffffff81097530 smpboot_thread_fn+0/0x160
ffffffff81093a70 kthread+0/0x130
ffffffff81800215 ret_from_fork
(/usr/src/debug////////kernel-4.19.19/arch/x86/entry/entry_64.S:421)

based on that the issue occurred when calling
_bstats_update
<https://elixir.bootlin.com/linux/v4.19.128/C/ident/_bstats_update>(&bstats
<https://elixir.bootlin.com/linux/v4.19.128/C/ident/bstats>->bstats
<https://elixir.bootlin.com/linux/v4.19.128/C/ident/bstats>, bytes
<https://elixir.bootlin.com/linux/v4.19.128/C/ident/bytes>, packets
<https://elixir.bootlin.com/linux/v4.19.128/C/ident/packets>);

thats as far as i can reach not sure how to debug further to find the root
cause and fix it
Re: repeated Kernel oops need help to debug [ In reply to ]
We have had crashes on one machine. Do not have details because i do not
have broadband now.

Somethings that stuck out to me.
Inability to sync would make me wonder if filesystem issue.
Also wondering if you installed the netdata package since some of the
errors seem to be related to statistics? Our crashes happened after
installing netdata but crashes are only occurring on one machine.

Hypothesis: netdata needs to read and write filesystem more which has
exposed filesystem corruption.

What fs is it? extX or LVM/extX or ?

Read out the fs metadata and fs check confuguration with:
tune2fs -l /dev/sdaX
Replace X with 0, 1, 2 ... whatever you may have.

Then triple check these parameters as i am far from from fast internet at
the moment. I have done the following tons of times on normal Linux
machines, but not so much with xen kernel. Is Dom-0 VM fs the same as
underlying kernel? But set fs check interval and mount count with:

‘tune2fs -i 1d /dev/sdaX’
‘tune2fs -c 1 /dev/sdaX’

‘e2fsck -c -c -C0 -D /dev/sdaX‘
can be used to check for both read and write, output to stdout, and
optimize layout, directory structure, and so on. But this would require
booting from install ISO or qlive media.


On Sun, Jul 26, 2020 at 9:51 AM moftah moftah <mofta7y@gmail.com> wrote:

> Hi All,
> We have a problem that is ongoing for more than 1 month
>
> We have several servers running xcp-ng and we are facing kernel oops that
> crash the server
>
> My skill is not enough to debug the issue So need someone to point me to
> the right direction
> the issue is not hardware related
> it occurred on servers that are of different processor , nic and even
> kernel version (all under 4.19)
>
> the stack trace looks like this
>
> [2399526.430672] ALERT: BUG: unable to handle kernel NULL pointer
> dereference at 0000000000000004
> [2399526.430695] INFO: PGD 447268067 P4D 447268067 PUD 44775f067 PMD 0
> [2399526.430710] WARN: Oops: 0000 [#1] SMP NOPTI
> [2399526.430720] WARN: CPU: 1 PID: 17 Comm: ksoftirqd/1 Not tainted
> 4.19.108 #1
> [2399526.430728] WARN: Hardware name: HP ProLiant SL230s Gen8 /, BIOS
> P75 05/24/2019
> [2399526.430745] WARN: RIP: e030:pfifo_fast_dequeue+0xc9/0x140
> [2399526.430753] WARN: Code: 50 28 48 8b 4f 58 f7 da 65 01 51 04 48 8b
> 57 50 65 48 03 15 11 64 99 7e 8b 88 cc 00 00 00 be 01 00 00 00 48 03 88 d0
> 00 00 00 <66> 83 79 04 00 74 04 0f b7 71 06 8b 48 28 01 72 08 48 01 0a f0 ff
> [2399526.430773] WARN: RSP: e02b:ffffc900400c3de0 EFLAGS: 00010246
> [2399526.430780] WARN: RAX: ffff88842087b900 RBX: 0000000000000001 RCX:
> 0000000000000000
> [2399526.430789] WARN: RDX: ffffe8fffee60a1c RSI: 0000000000000001 RDI:
> ffff8883de0b9c00
> [2399526.430801] WARN: RBP: 0000000000000000 R08: 0000000000000000 R09:
> 0000000000000020
> [2399526.430811] WARN: R10: 0000000000000000 R11: ffff8883de0b9d40 R12:
> 0000000000000001
> [2399526.430823] WARN: R13: ffff8883db210a00 R14: 0000000000000002 R15:
> ffff8883de0b9c00
> [2399526.430852] WARN: FS: 00007ffac43fe700(0000)
> GS:ffff888451240000(0000) knlGS:0000000000000000
> [2399526.430868] WARN: CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> [2399526.430879] WARN: CR2: 0000000000000004 CR3: 000000044ad58000 CR4:
> 0000000000040660
> [2399526.430899] WARN: Call Trace:
> [2399526.430914] WARN: __qdisc_run+0xa2/0x4f0
> [2399526.430928] WARN: ? __switch_to_asm+0x41/0x70
> [2399526.430940] WARN: net_tx_action+0x148/0x230
> [2399526.430949] WARN: __do_softirq+0xd1/0x28c
> [2399526.430966] WARN: run_ksoftirqd+0x26/0x40
> [2399526.430980] WARN: smpboot_thread_fn+0x10e/0x160
> [2399526.430993] WARN: kthread+0xf8/0x130
> [2399526.431004] WARN: ? sort_range+0x20/0x20
> [2399526.431010] WARN: ? kthread_bind+0x10/0x10
> [2399526.431017] WARN: ret_from_fork+0x35/0x40
> [2399526.431027] WARN: Modules linked in: act_police cls_basic
> sch_ingress sch_tbf tun rpcsec_gss_krb5 auth_rpcgss oid_registry nfsv4 nfs
> lockd grace fscache bnx2fc cnic uio fcoe libfcoe libfc scsi_transport_fc
> openvswitch nsh nf_nat_ipv6 nf_nat_ipv4 nf_conncount nf_nat 8021q garp mrp
> stp llc ipt_REJECT nf_reject_ipv4 dm_multipath xt_tcpudp xt_multiport
> xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter
> sunrpc hid_generic sb_edac intel_powerclamp crct10dif_pclmul crc32_pclmul
> ghash_clmulni_intel pcbc dm_mod aesni_intel aes_x86_64 crypto_simd cryptd
> glue_helper intel_rapl_perf psmouse lpc_ich usbhid hid sg hpilo ipmi_si
> ipmi_devintf ipmi_msghandler acpi_power_meter ip_tables x_tables raid1
> md_mod sd_mod serio_raw uhci_hcd ahci libahci igb libata ehci_pci ehci_hcd
> bnx2x mdio libcrc32c mpt3sas
> [2399526.431154] WARN: raid_class scsi_transport_sas scsi_dh_rdac
> scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod ipv6 crc_ccitt
> [2399526.431177] WARN: CR2: 0000000000000004
> [2399526.431189] WARN: ---[ end trace 32a268c3653eb10c ]---
> [2399526.431201] WARN: RIP: e030:pfifo_fast_dequeue+0xc9/0x140
> [2399526.431212] WARN: Code: 50 28 48 8b 4f 58 f7 da 65 01 51 04 48 8b
> 57 50 65 48 03 15 11 64 99 7e 8b 88 cc 00 00 00 be 01 00 00 00 48 03 88 d0
> 00 00 00 <66> 83 79 04 00 74 04 0f b7 71 06 8b 48 28 01 72 08 48 01 0a f0 ff
> [2399526.431238] WARN: RSP: e02b:ffffc900400c3de0 EFLAGS: 00010246
> [2399526.431247] WARN: RAX: ffff88842087b900 RBX: 0000000000000001 RCX:
> 0000000000000000
> [2399526.431260] WARN: RDX: ffffe8fffee60a1c RSI: 0000000000000001 RDI:
> ffff8883de0b9c00
> [2399526.431270] WARN: RBP: 0000000000000000 R08: 0000000000000000 R09:
> 0000000000000020
> [2399526.431280] WARN: R10: 0000000000000000 R11: ffff8883de0b9d40 R12:
> 0000000000000001
> [2399526.431289] WARN: R13: ffff8883db210a00 R14: 0000000000000002 R15:
> ffff8883de0b9c00
> [2399526.431307] WARN: FS: 00007ffac43fe700(0000)
> GS:ffff888451240000(0000) knlGS:0000000000000000
> [2399526.431319] WARN: CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> [2399526.431331] WARN: CR2: 0000000000000004 CR3: 000000044ad58000 CR4:
> 0000000000040660
> [2399526.431355] EMERG: Kernel panic - not syncing: Fatal exception in
> interrupt
>
>
> xen crash analyzer generate many other files as well
> dmesg.kexec.log
> dom0.log ( for each dom )
> dom0.structures.log for each dom ( for each dom )
> ....
> lspci-tv.out
> lspci-vv.out
> lspci-vvxxxx.out
> readelf-Wl.out
> readelf-Wn.out
> time-v.out
> xen.log
> xen.pcpu0.stack.log ( for each pcpu)
> ...
> xen-crashdump-analyser.log
>
> the log can be seen from xenlog file as
> Call Trace:
> [ffffffff810014aa] xen_hypercall_kexec_op+0xa/0x20
> ffffffff81071f85 panic+0x111/0x27c
> ffffffff81027a7f oops_end+0xcf/0xd0
> ffffffff8105da63 no_context+0x1b3/0x3c0
> ffffffff816c0223 inet_gro_receive+0x213/0x2b0
> ffffffff8105e32a __do_page_fault+0xaa/0x4f0
> ffffffff8162cd44 netif_receive_skb_internal+0x34/0xe0
> ffffffff81800f6e page_fault+0x1e/0x30
> ffffffff81663ac9 pfifo_fast_dequeue+0xc9/0x140
> ffffffff81663f38 __qdisc_run+0xa8/0x4e0
> ffffffff816290c8 net_tx_action+0x148/0x220
> ffffffff81a000d1 __softirqentry_text_start+0xd1/0x28c
> ffffffff81077ff6 run_ksoftirqd+0x26/0x40
> ffffffff8109763e smpboot_thread_fn+0x10e/0x160
> ffffffff81093b68 kthread+0xf8/0x130
> ffffffff81097530 smpboot_thread_fn+0/0x160
> ffffffff81093a70 kthread+0/0x130
> ffffffff81800215 ret_from_fork+0x35/0x40
>
> I did use a tool to trace the source code where the issue occure
> ./decode_stacktrace.sh /usr/lib/debug/lib/modules/4.19.108/vmlinux
> /usr/lib/debug/lib/modules/4.19.108/ < ./trace2 > out3
>
> and this is the output
>
> [ffffffff810014aa] xen_hypercall_kexec_op
> (arch/x86/kernel/.tmp_head_64.o:?)
> ffffffff81071f85 panic (/usr/src/debug/kernel-4.19.19/kernel/panic.c:209)
> ffffffff81027a7f oops_end
> (/usr/src/debug/kernel-4.19.19/arch/x86/kernel/dumpstack.c:352)
> ffffffff8105da63 no_context
> (/usr/src/debug/kernel-4.19.19/arch/x86/mm/fault.c:808)
> ffffffff816c0223 inet_gro_receive
> (/usr/src/debug/kernel-4.19.19/include/linux/skbuff.h:2350
> /usr/src/debug/kernel-4.19.19/net/ipv4/af_inet.c:1495)
> ffffffff8105e32a __do_page_fault
> (/usr/src/debug/kernel-4.19.19/arch/x86/mm/fault.c:1435)
> ffffffff8162cd44 netif_receive_skb_internal
> (/usr/src/debug/kernel-4.19.19/net/core/dev.c:5152)
> ffffffff81800f6e page_fault
> (/usr/src/debug////////kernel-4.19.19/arch/x86/entry/entry_64.S:1204)
> ffffffff81663ac9 pfifo_fast_dequeue
> (/usr/src/debug/kernel-4.19.19/include/net/sch_generic.h:723
> /usr/src/debug/kernel-4.19.19/include/net/sch_generic.h:740
> /usr/src/debug/kernel-4.19.19/include/net/sch_generic.h:747
> /usr/src/debug/kernel-4.19.19/net/sched/sch_generic.c:677)
> ffffffff81663f38 __qdisc_run
> (/usr/src/debug/kernel-4.19.19/net/sched/sch_generic.c:283
> /usr/src/debug/kernel-4.19.19/net/sched/sch_generic.c:385
> /usr/src/debug/kernel-4.19.19/net/sched/sch_generic.c:403)
> ffffffff816290c8 net_tx_action
> (/usr/src/debug/kernel-4.19.19/include/linux/seqlock.h:235
> /usr/src/debug/kernel-4.19.19/include/linux/seqlock.h:388
> /usr/src/debug/kernel-4.19.19/include/net/sch_generic.h:145
> /usr/src/debug/kernel-4.19.19/include/net/pkt_sched.h:121
> /usr/src/debug/kernel-4.19.19/net/core/dev.c:4595)
> ffffffff81a000d1 __softirqentry_text_start
> (/usr/src/debug/kernel-4.19.19/kernel/softirq.c:292
> /usr/src/debug/kernel-4.19.19/include/linux/jump_label.h:138
> /usr/src/debug/kernel-4.19.19/include/trace/events/irq.h:142
> /usr/src/debug/kernel-4.19.19/kernel/softirq.c:293)
> ffffffff81077ff6 run_ksoftirqd
> (/usr/src/debug/kernel-4.19.19/arch/x86/include/asm/paravirt.h:799
> /usr/src/debug/kernel-4.19.19/kernel/softirq.c:654)
> ffffffff8109763e smpboot_thread_fn
> (/usr/src/debug/kernel-4.19.19/kernel/smpboot.c:164)
> ffffffff81093b68 kthread
> (/usr/src/debug/kernel-4.19.19/kernel/kthread.c:246)
> ffffffff81097530 smpboot_thread_fn+0/0x160
> ffffffff81093a70 kthread+0/0x130
> ffffffff81800215 ret_from_fork
> (/usr/src/debug////////kernel-4.19.19/arch/x86/entry/entry_64.S:421)
>
> based on that the issue occurred when calling
> _bstats_update
> <https://elixir.bootlin.com/linux/v4.19.128/C/ident/_bstats_update>(&
> bstats <https://elixir.bootlin.com/linux/v4.19.128/C/ident/bstats>->bstats
> <https://elixir.bootlin.com/linux/v4.19.128/C/ident/bstats>, bytes
> <https://elixir.bootlin.com/linux/v4.19.128/C/ident/bytes>, packets
> <https://elixir.bootlin.com/linux/v4.19.128/C/ident/packets>);
>
> thats as far as i can reach not sure how to debug further to find the root
> cause and fix it
>
>
>
Re: repeated Kernel oops need help to debug [ In reply to ]
Hi Rob,

Thanks for looking into it

>Inability to sync would make me wonder if the filesystem issue.
I rebooted one server to do fsck and it was fine
plus the OS is installed in mirrored raid (soft raid using mdraid) so I
dont think it is data corruption issue

the hypervisor and dom zero are unnin from mdraid of 2x 200gb SSDs

netdata is not installed although when we we installed netdata on some
servers in the past we had as well some kernel panics so we removed it and
that resolved the issue

still not sure what the issue is

On Sun, Jul 26, 2020 at 1:04 PM Rob Townley <rob.townley@gmail.com> wrote:

> We have had crashes on one machine. Do not have details because i do not
> have broadband now.
>
> Somethings that stuck out to me.
> Inability to sync would make me wonder if filesystem issue.
> Also wondering if you installed the netdata package since some of the
> errors seem to be related to statistics? Our crashes happened after
> installing netdata but crashes are only occurring on one machine.
>
> Hypothesis: netdata needs to read and write filesystem more which has
> exposed filesystem corruption.
>
> What fs is it? extX or LVM/extX or ?
>
> Read out the fs metadata and fs check confuguration with:
> tune2fs -l /dev/sdaX
> Replace X with 0, 1, 2 ... whatever you may have.
>
> Then triple check these parameters as i am far from from fast internet at
> the moment. I have done the following tons of times on normal Linux
> machines, but not so much with xen kernel. Is Dom-0 VM fs the same as
> underlying kernel? But set fs check interval and mount count with:
>
> ‘tune2fs -i 1d /dev/sdaX’
> ‘tune2fs -c 1 /dev/sdaX’
>
> ‘e2fsck -c -c -C0 -D /dev/sdaX‘
> can be used to check for both read and write, output to stdout, and
> optimize layout, directory structure, and so on. But this would require
> booting from install ISO or qlive media.
>
>
> On Sun, Jul 26, 2020 at 9:51 AM moftah moftah <mofta7y@gmail.com> wrote:
>
>> Hi All,
>> We have a problem that is ongoing for more than 1 month
>>
>> We have several servers running xcp-ng and we are facing kernel oops that
>> crash the server
>>
>> My skill is not enough to debug the issue So need someone to point me to
>> the right direction
>> the issue is not hardware related
>> it occurred on servers that are of different processor , nic and even
>> kernel version (all under 4.19)
>>
>> the stack trace looks like this
>>
>> [2399526.430672] ALERT: BUG: unable to handle kernel NULL pointer
>> dereference at 0000000000000004
>> [2399526.430695] INFO: PGD 447268067 P4D 447268067 PUD 44775f067 PMD 0
>> [2399526.430710] WARN: Oops: 0000 [#1] SMP NOPTI
>> [2399526.430720] WARN: CPU: 1 PID: 17 Comm: ksoftirqd/1 Not tainted
>> 4.19.108 #1
>> [2399526.430728] WARN: Hardware name: HP ProLiant SL230s Gen8 /, BIOS
>> P75 05/24/2019
>> [2399526.430745] WARN: RIP: e030:pfifo_fast_dequeue+0xc9/0x140
>> [2399526.430753] WARN: Code: 50 28 48 8b 4f 58 f7 da 65 01 51 04 48 8b
>> 57 50 65 48 03 15 11 64 99 7e 8b 88 cc 00 00 00 be 01 00 00 00 48 03 88 d0
>> 00 00 00 <66> 83 79 04 00 74 04 0f b7 71 06 8b 48 28 01 72 08 48 01 0a f0 ff
>> [2399526.430773] WARN: RSP: e02b:ffffc900400c3de0 EFLAGS: 00010246
>> [2399526.430780] WARN: RAX: ffff88842087b900 RBX: 0000000000000001 RCX:
>> 0000000000000000
>> [2399526.430789] WARN: RDX: ffffe8fffee60a1c RSI: 0000000000000001 RDI:
>> ffff8883de0b9c00
>> [2399526.430801] WARN: RBP: 0000000000000000 R08: 0000000000000000 R09:
>> 0000000000000020
>> [2399526.430811] WARN: R10: 0000000000000000 R11: ffff8883de0b9d40 R12:
>> 0000000000000001
>> [2399526.430823] WARN: R13: ffff8883db210a00 R14: 0000000000000002 R15:
>> ffff8883de0b9c00
>> [2399526.430852] WARN: FS: 00007ffac43fe700(0000)
>> GS:ffff888451240000(0000) knlGS:0000000000000000
>> [2399526.430868] WARN: CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [2399526.430879] WARN: CR2: 0000000000000004 CR3: 000000044ad58000 CR4:
>> 0000000000040660
>> [2399526.430899] WARN: Call Trace:
>> [2399526.430914] WARN: __qdisc_run+0xa2/0x4f0
>> [2399526.430928] WARN: ? __switch_to_asm+0x41/0x70
>> [2399526.430940] WARN: net_tx_action+0x148/0x230
>> [2399526.430949] WARN: __do_softirq+0xd1/0x28c
>> [2399526.430966] WARN: run_ksoftirqd+0x26/0x40
>> [2399526.430980] WARN: smpboot_thread_fn+0x10e/0x160
>> [2399526.430993] WARN: kthread+0xf8/0x130
>> [2399526.431004] WARN: ? sort_range+0x20/0x20
>> [2399526.431010] WARN: ? kthread_bind+0x10/0x10
>> [2399526.431017] WARN: ret_from_fork+0x35/0x40
>> [2399526.431027] WARN: Modules linked in: act_police cls_basic
>> sch_ingress sch_tbf tun rpcsec_gss_krb5 auth_rpcgss oid_registry nfsv4 nfs
>> lockd grace fscache bnx2fc cnic uio fcoe libfcoe libfc scsi_transport_fc
>> openvswitch nsh nf_nat_ipv6 nf_nat_ipv4 nf_conncount nf_nat 8021q garp mrp
>> stp llc ipt_REJECT nf_reject_ipv4 dm_multipath xt_tcpudp xt_multiport
>> xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter
>> sunrpc hid_generic sb_edac intel_powerclamp crct10dif_pclmul crc32_pclmul
>> ghash_clmulni_intel pcbc dm_mod aesni_intel aes_x86_64 crypto_simd cryptd
>> glue_helper intel_rapl_perf psmouse lpc_ich usbhid hid sg hpilo ipmi_si
>> ipmi_devintf ipmi_msghandler acpi_power_meter ip_tables x_tables raid1
>> md_mod sd_mod serio_raw uhci_hcd ahci libahci igb libata ehci_pci ehci_hcd
>> bnx2x mdio libcrc32c mpt3sas
>> [2399526.431154] WARN: raid_class scsi_transport_sas scsi_dh_rdac
>> scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod ipv6 crc_ccitt
>> [2399526.431177] WARN: CR2: 0000000000000004
>> [2399526.431189] WARN: ---[ end trace 32a268c3653eb10c ]---
>> [2399526.431201] WARN: RIP: e030:pfifo_fast_dequeue+0xc9/0x140
>> [2399526.431212] WARN: Code: 50 28 48 8b 4f 58 f7 da 65 01 51 04 48 8b
>> 57 50 65 48 03 15 11 64 99 7e 8b 88 cc 00 00 00 be 01 00 00 00 48 03 88 d0
>> 00 00 00 <66> 83 79 04 00 74 04 0f b7 71 06 8b 48 28 01 72 08 48 01 0a f0 ff
>> [2399526.431238] WARN: RSP: e02b:ffffc900400c3de0 EFLAGS: 00010246
>> [2399526.431247] WARN: RAX: ffff88842087b900 RBX: 0000000000000001 RCX:
>> 0000000000000000
>> [2399526.431260] WARN: RDX: ffffe8fffee60a1c RSI: 0000000000000001 RDI:
>> ffff8883de0b9c00
>> [2399526.431270] WARN: RBP: 0000000000000000 R08: 0000000000000000 R09:
>> 0000000000000020
>> [2399526.431280] WARN: R10: 0000000000000000 R11: ffff8883de0b9d40 R12:
>> 0000000000000001
>> [2399526.431289] WARN: R13: ffff8883db210a00 R14: 0000000000000002 R15:
>> ffff8883de0b9c00
>> [2399526.431307] WARN: FS: 00007ffac43fe700(0000)
>> GS:ffff888451240000(0000) knlGS:0000000000000000
>> [2399526.431319] WARN: CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [2399526.431331] WARN: CR2: 0000000000000004 CR3: 000000044ad58000 CR4:
>> 0000000000040660
>> [2399526.431355] EMERG: Kernel panic - not syncing: Fatal exception in
>> interrupt
>>
>>
>> xen crash analyzer generate many other files as well
>> dmesg.kexec.log
>> dom0.log ( for each dom )
>> dom0.structures.log for each dom ( for each dom )
>> ....
>> lspci-tv.out
>> lspci-vv.out
>> lspci-vvxxxx.out
>> readelf-Wl.out
>> readelf-Wn.out
>> time-v.out
>> xen.log
>> xen.pcpu0.stack.log ( for each pcpu)
>> ...
>> xen-crashdump-analyser.log
>>
>> the log can be seen from xenlog file as
>> Call Trace:
>> [ffffffff810014aa] xen_hypercall_kexec_op+0xa/0x20
>> ffffffff81071f85 panic+0x111/0x27c
>> ffffffff81027a7f oops_end+0xcf/0xd0
>> ffffffff8105da63 no_context+0x1b3/0x3c0
>> ffffffff816c0223 inet_gro_receive+0x213/0x2b0
>> ffffffff8105e32a __do_page_fault+0xaa/0x4f0
>> ffffffff8162cd44 netif_receive_skb_internal+0x34/0xe0
>> ffffffff81800f6e page_fault+0x1e/0x30
>> ffffffff81663ac9 pfifo_fast_dequeue+0xc9/0x140
>> ffffffff81663f38 __qdisc_run+0xa8/0x4e0
>> ffffffff816290c8 net_tx_action+0x148/0x220
>> ffffffff81a000d1 __softirqentry_text_start+0xd1/0x28c
>> ffffffff81077ff6 run_ksoftirqd+0x26/0x40
>> ffffffff8109763e smpboot_thread_fn+0x10e/0x160
>> ffffffff81093b68 kthread+0xf8/0x130
>> ffffffff81097530 smpboot_thread_fn+0/0x160
>> ffffffff81093a70 kthread+0/0x130
>> ffffffff81800215 ret_from_fork+0x35/0x40
>>
>> I did use a tool to trace the source code where the issue occure
>> ./decode_stacktrace.sh /usr/lib/debug/lib/modules/4.19.108/vmlinux
>> /usr/lib/debug/lib/modules/4.19.108/ < ./trace2 > out3
>>
>> and this is the output
>>
>> [ffffffff810014aa] xen_hypercall_kexec_op
>> (arch/x86/kernel/.tmp_head_64.o:?)
>> ffffffff81071f85 panic (/usr/src/debug/kernel-4.19.19/kernel/panic.c:209)
>> ffffffff81027a7f oops_end
>> (/usr/src/debug/kernel-4.19.19/arch/x86/kernel/dumpstack.c:352)
>> ffffffff8105da63 no_context
>> (/usr/src/debug/kernel-4.19.19/arch/x86/mm/fault.c:808)
>> ffffffff816c0223 inet_gro_receive
>> (/usr/src/debug/kernel-4.19.19/include/linux/skbuff.h:2350
>> /usr/src/debug/kernel-4.19.19/net/ipv4/af_inet.c:1495)
>> ffffffff8105e32a __do_page_fault
>> (/usr/src/debug/kernel-4.19.19/arch/x86/mm/fault.c:1435)
>> ffffffff8162cd44 netif_receive_skb_internal
>> (/usr/src/debug/kernel-4.19.19/net/core/dev.c:5152)
>> ffffffff81800f6e page_fault
>> (/usr/src/debug////////kernel-4.19.19/arch/x86/entry/entry_64.S:1204)
>> ffffffff81663ac9 pfifo_fast_dequeue
>> (/usr/src/debug/kernel-4.19.19/include/net/sch_generic.h:723
>> /usr/src/debug/kernel-4.19.19/include/net/sch_generic.h:740
>> /usr/src/debug/kernel-4.19.19/include/net/sch_generic.h:747
>> /usr/src/debug/kernel-4.19.19/net/sched/sch_generic.c:677)
>> ffffffff81663f38 __qdisc_run
>> (/usr/src/debug/kernel-4.19.19/net/sched/sch_generic.c:283
>> /usr/src/debug/kernel-4.19.19/net/sched/sch_generic.c:385
>> /usr/src/debug/kernel-4.19.19/net/sched/sch_generic.c:403)
>> ffffffff816290c8 net_tx_action
>> (/usr/src/debug/kernel-4.19.19/include/linux/seqlock.h:235
>> /usr/src/debug/kernel-4.19.19/include/linux/seqlock.h:388
>> /usr/src/debug/kernel-4.19.19/include/net/sch_generic.h:145
>> /usr/src/debug/kernel-4.19.19/include/net/pkt_sched.h:121
>> /usr/src/debug/kernel-4.19.19/net/core/dev.c:4595)
>> ffffffff81a000d1 __softirqentry_text_start
>> (/usr/src/debug/kernel-4.19.19/kernel/softirq.c:292
>> /usr/src/debug/kernel-4.19.19/include/linux/jump_label.h:138
>> /usr/src/debug/kernel-4.19.19/include/trace/events/irq.h:142
>> /usr/src/debug/kernel-4.19.19/kernel/softirq.c:293)
>> ffffffff81077ff6 run_ksoftirqd
>> (/usr/src/debug/kernel-4.19.19/arch/x86/include/asm/paravirt.h:799
>> /usr/src/debug/kernel-4.19.19/kernel/softirq.c:654)
>> ffffffff8109763e smpboot_thread_fn
>> (/usr/src/debug/kernel-4.19.19/kernel/smpboot.c:164)
>> ffffffff81093b68 kthread
>> (/usr/src/debug/kernel-4.19.19/kernel/kthread.c:246)
>> ffffffff81097530 smpboot_thread_fn+0/0x160
>> ffffffff81093a70 kthread+0/0x130
>> ffffffff81800215 ret_from_fork
>> (/usr/src/debug////////kernel-4.19.19/arch/x86/entry/entry_64.S:421)
>>
>> based on that the issue occurred when calling
>> _bstats_update
>> <https://elixir.bootlin.com/linux/v4.19.128/C/ident/_bstats_update>(&
>> bstats <https://elixir.bootlin.com/linux/v4.19.128/C/ident/bstats>->
>> bstats <https://elixir.bootlin.com/linux/v4.19.128/C/ident/bstats>, bytes
>> <https://elixir.bootlin.com/linux/v4.19.128/C/ident/bytes>, packets
>> <https://elixir.bootlin.com/linux/v4.19.128/C/ident/packets>);
>>
>> thats as far as i can reach not sure how to debug further to find the
>> root cause and fix it
>>
>>
>>
Re: repeated Kernel oops need help to debug [ In reply to ]
On 26.07.20 17:47, moftah moftah wrote:
> Hi All,
> We have a problem that is ongoing for more than 1 month
>
> We have several servers running xcp-ng and we are facing kernel oops
> that crash the server
>
> My skill is not enough to debug the issue So need someone to point me to
> the right direction
> the issue is not hardware related
> it occurred on servers that are of different processor , nic and even
> kernel version (all under 4.19)
>
> the stack trace looks like this
>
> [2399526.430672]  ALERT: BUG: unable to handle kernel NULL pointer
> dereference at 0000000000000004
> [2399526.430695]   INFO: PGD 447268067 P4D 447268067 PUD 44775f067 PMD 0
> [2399526.430710]   WARN: Oops: 0000 [#1] SMP NOPTI
> [2399526.430720]   WARN: CPU: 1 PID: 17 Comm: ksoftirqd/1 Not tainted
> 4.19.108 #1
> [2399526.430728]   WARN: Hardware name: HP ProLiant SL230s Gen8   /,
> BIOS P75 05/24/2019
> [2399526.430745]   WARN: RIP: e030:pfifo_fast_dequeue+0xc9/0x140
> [2399526.430753]   WARN: Code: 50 28 48 8b 4f 58 f7 da 65 01 51 04 48 8b
> 57 50 65 48 03 15 11 64 99 7e 8b 88 cc 00 00 00 be 01 00 00 00 48 03 88
> d0 00 00 00 <66> 83 79 04 00 74 04 0f b7 71 06 8b 48 28 01 72 08 48 01
> 0a f0 ff
> [2399526.430773]   WARN: RSP: e02b:ffffc900400c3de0 EFLAGS: 00010246
> [2399526.430780]   WARN: RAX: ffff88842087b900 RBX: 0000000000000001
> RCX: 0000000000000000
> [2399526.430789]   WARN: RDX: ffffe8fffee60a1c RSI: 0000000000000001
> RDI: ffff8883de0b9c00
> [2399526.430801]   WARN: RBP: 0000000000000000 R08: 0000000000000000
> R09: 0000000000000020
> [2399526.430811]   WARN: R10: 0000000000000000 R11: ffff8883de0b9d40
> R12: 0000000000000001
> [2399526.430823]   WARN: R13: ffff8883db210a00 R14: 0000000000000002
> R15: ffff8883de0b9c00
> [2399526.430852]   WARN: FS:  00007ffac43fe700(0000)
> GS:ffff888451240000(0000) knlGS:0000000000000000
> [2399526.430868]   WARN: CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> [2399526.430879]   WARN: CR2: 0000000000000004 CR3: 000000044ad58000
> CR4: 0000000000040660
> [2399526.430899]   WARN: Call Trace:
> [2399526.430914]   WARN:  __qdisc_run+0xa2/0x4f0
> [2399526.430928]   WARN:  ? __switch_to_asm+0x41/0x70
> [2399526.430940]   WARN:  net_tx_action+0x148/0x230
> [2399526.430949]   WARN:  __do_softirq+0xd1/0x28c
> [2399526.430966]   WARN:  run_ksoftirqd+0x26/0x40
> [2399526.430980]   WARN:  smpboot_thread_fn+0x10e/0x160
> [2399526.430993]   WARN:  kthread+0xf8/0x130
> [2399526.431004]   WARN:  ? sort_range+0x20/0x20
> [2399526.431010]   WARN:  ? kthread_bind+0x10/0x10
> [2399526.431017]   WARN:  ret_from_fork+0x35/0x40

I wonder whether you are missing all fixes for commit 021a17ed796b
which went into kernel 4.18. It needs following fixes on top:

d518d2ed8640 (went into 5.4), 90b2be27bb0e (went into 5.5).

From the backtrace I really doubt this is a Xen problem, BTW. Maybe
running under Xen makes the problem more likely due to different
timing.


Juergen
Re: repeated Kernel oops need help to debug [ In reply to ]
Hi Juergen,

This seems very related to my issue
but I wonder why the fix was not backtracked to all versions why i was only
added to 4.18 and 5.4 branches and ignored the 4.19 branch

I could try to add the patch manually (if it works ) in 4.19 branch and
test it locally

On Mon, Jul 27, 2020 at 3:27 AM Jürgen Groß <jgross@suse.com> wrote:

> On 26.07.20 17:47, moftah moftah wrote:
> > Hi All,
> > We have a problem that is ongoing for more than 1 month
> >
> > We have several servers running xcp-ng and we are facing kernel oops
> > that crash the server
> >
> > My skill is not enough to debug the issue So need someone to point me to
> > the right direction
> > the issue is not hardware related
> > it occurred on servers that are of different processor , nic and even
> > kernel version (all under 4.19)
> >
> > the stack trace looks like this
> >
> > [2399526.430672] ALERT: BUG: unable to handle kernel NULL pointer
> > dereference at 0000000000000004
> > [2399526.430695] INFO: PGD 447268067 P4D 447268067 PUD 44775f067 PMD 0
> > [2399526.430710] WARN: Oops: 0000 [#1] SMP NOPTI
> > [2399526.430720] WARN: CPU: 1 PID: 17 Comm: ksoftirqd/1 Not tainted
> > 4.19.108 #1
> > [2399526.430728] WARN: Hardware name: HP ProLiant SL230s Gen8 /,
> > BIOS P75 05/24/2019
> > [2399526.430745] WARN: RIP: e030:pfifo_fast_dequeue+0xc9/0x140
> > [2399526.430753] WARN: Code: 50 28 48 8b 4f 58 f7 da 65 01 51 04 48 8b
> > 57 50 65 48 03 15 11 64 99 7e 8b 88 cc 00 00 00 be 01 00 00 00 48 03 88
> > d0 00 00 00 <66> 83 79 04 00 74 04 0f b7 71 06 8b 48 28 01 72 08 48 01
> > 0a f0 ff
> > [2399526.430773] WARN: RSP: e02b:ffffc900400c3de0 EFLAGS: 00010246
> > [2399526.430780] WARN: RAX: ffff88842087b900 RBX: 0000000000000001
> > RCX: 0000000000000000
> > [2399526.430789] WARN: RDX: ffffe8fffee60a1c RSI: 0000000000000001
> > RDI: ffff8883de0b9c00
> > [2399526.430801] WARN: RBP: 0000000000000000 R08: 0000000000000000
> > R09: 0000000000000020
> > [2399526.430811] WARN: R10: 0000000000000000 R11: ffff8883de0b9d40
> > R12: 0000000000000001
> > [2399526.430823] WARN: R13: ffff8883db210a00 R14: 0000000000000002
> > R15: ffff8883de0b9c00
> > [2399526.430852] WARN: FS: 00007ffac43fe700(0000)
> > GS:ffff888451240000(0000) knlGS:0000000000000000
> > [2399526.430868] WARN: CS: e033 DS: 0000 ES: 0000 CR0:
> 0000000080050033
> > [2399526.430879] WARN: CR2: 0000000000000004 CR3: 000000044ad58000
> > CR4: 0000000000040660
> > [2399526.430899] WARN: Call Trace:
> > [2399526.430914] WARN: __qdisc_run+0xa2/0x4f0
> > [2399526.430928] WARN: ? __switch_to_asm+0x41/0x70
> > [2399526.430940] WARN: net_tx_action+0x148/0x230
> > [2399526.430949] WARN: __do_softirq+0xd1/0x28c
> > [2399526.430966] WARN: run_ksoftirqd+0x26/0x40
> > [2399526.430980] WARN: smpboot_thread_fn+0x10e/0x160
> > [2399526.430993] WARN: kthread+0xf8/0x130
> > [2399526.431004] WARN: ? sort_range+0x20/0x20
> > [2399526.431010] WARN: ? kthread_bind+0x10/0x10
> > [2399526.431017] WARN: ret_from_fork+0x35/0x40
>
> I wonder whether you are missing all fixes for commit 021a17ed796b
> which went into kernel 4.18. It needs following fixes on top:
>
> d518d2ed8640 (went into 5.4), 90b2be27bb0e (went into 5.5).
>
> From the backtrace I really doubt this is a Xen problem, BTW. Maybe
> running under Xen makes the problem more likely due to different
> timing.
>
>
> Juergen
>
Re: repeated Kernel oops need help to debug [ In reply to ]
this is a bit weird
I decided to debug the issue before patching the kernel more
so i went to one server and change the qdisc on all network interfaces (not
guest interfaces ) to fifo instead of pfifo fast
and to my shock i got the same panic on pfifo_fast_dequeue !!!!

another thing i noticed that when ever the issue occur on any server i
check the cpu stack of the cpu that had the panic
and there is 2 stacks always one for dom0 and the other one for hypervisor

the hypervisor stack always has this on the panicking cpu
ffff832027be7d20: ffff82d08021831d kexec_crash+0x4d/0x50
ffff832027be7d28: 00000000fffffffe
ffff832027be7d30: ffff82d080218c9d do_kexec_op_internal+0x44d/0x710
ffff832027be7d38: 0000000000040660
ffff832027be7d40: 0000000000000000
ffff832027be7d48: 0000000000000000
ffff832027be7d50: 000000000000000c
ffff832027be7d58: 000000000000000c
ffff832027be7d60: ffff83202780f000
ffff832027be7d68: ffff832027be7da8 .+64
ffff832027be7d70: 000000000000000c
ffff832027be7d78: ffff82d080266750 vga_noop_puts+0/0x10
ffff832027be7d80: ffff82d080249f3a do_console_io+0x41a/0x460
ffff832027be7d88: ffffc90000000001
ffff832027be7d90: ffff83202780fa24
ffff832027be7d98: 000000000000e033

Could the issue be the in the hypervisor side and the dom0 kernel panic
message is just misleading ?


On Wed, Jul 29, 2020 at 11:44 AM moftah moftah <mofta7y@gmail.com> wrote:

> Hi Juergen,
>
> This seems very related to my issue
> but I wonder why the fix was not backtracked to all versions why i was
> only added to 4.18 and 5.4 branches and ignored the 4.19 branch
>
> I could try to add the patch manually (if it works ) in 4.19 branch and
> test it locally
>
> On Mon, Jul 27, 2020 at 3:27 AM Jürgen Groß <jgross@suse.com> wrote:
>
>> On 26.07.20 17:47, moftah moftah wrote:
>> > Hi All,
>> > We have a problem that is ongoing for more than 1 month
>> >
>> > We have several servers running xcp-ng and we are facing kernel oops
>> > that crash the server
>> >
>> > My skill is not enough to debug the issue So need someone to point me
>> to
>> > the right direction
>> > the issue is not hardware related
>> > it occurred on servers that are of different processor , nic and even
>> > kernel version (all under 4.19)
>> >
>> > the stack trace looks like this
>> >
>> > [2399526.430672] ALERT: BUG: unable to handle kernel NULL pointer
>> > dereference at 0000000000000004
>> > [2399526.430695] INFO: PGD 447268067 P4D 447268067 PUD 44775f067 PMD 0
>> > [2399526.430710] WARN: Oops: 0000 [#1] SMP NOPTI
>> > [2399526.430720] WARN: CPU: 1 PID: 17 Comm: ksoftirqd/1 Not tainted
>> > 4.19.108 #1
>> > [2399526.430728] WARN: Hardware name: HP ProLiant SL230s Gen8 /,
>> > BIOS P75 05/24/2019
>> > [2399526.430745] WARN: RIP: e030:pfifo_fast_dequeue+0xc9/0x140
>> > [2399526.430753] WARN: Code: 50 28 48 8b 4f 58 f7 da 65 01 51 04 48
>> 8b
>> > 57 50 65 48 03 15 11 64 99 7e 8b 88 cc 00 00 00 be 01 00 00 00 48 03 88
>> > d0 00 00 00 <66> 83 79 04 00 74 04 0f b7 71 06 8b 48 28 01 72 08 48 01
>> > 0a f0 ff
>> > [2399526.430773] WARN: RSP: e02b:ffffc900400c3de0 EFLAGS: 00010246
>> > [2399526.430780] WARN: RAX: ffff88842087b900 RBX: 0000000000000001
>> > RCX: 0000000000000000
>> > [2399526.430789] WARN: RDX: ffffe8fffee60a1c RSI: 0000000000000001
>> > RDI: ffff8883de0b9c00
>> > [2399526.430801] WARN: RBP: 0000000000000000 R08: 0000000000000000
>> > R09: 0000000000000020
>> > [2399526.430811] WARN: R10: 0000000000000000 R11: ffff8883de0b9d40
>> > R12: 0000000000000001
>> > [2399526.430823] WARN: R13: ffff8883db210a00 R14: 0000000000000002
>> > R15: ffff8883de0b9c00
>> > [2399526.430852] WARN: FS: 00007ffac43fe700(0000)
>> > GS:ffff888451240000(0000) knlGS:0000000000000000
>> > [2399526.430868] WARN: CS: e033 DS: 0000 ES: 0000 CR0:
>> 0000000080050033
>> > [2399526.430879] WARN: CR2: 0000000000000004 CR3: 000000044ad58000
>> > CR4: 0000000000040660
>> > [2399526.430899] WARN: Call Trace:
>> > [2399526.430914] WARN: __qdisc_run+0xa2/0x4f0
>> > [2399526.430928] WARN: ? __switch_to_asm+0x41/0x70
>> > [2399526.430940] WARN: net_tx_action+0x148/0x230
>> > [2399526.430949] WARN: __do_softirq+0xd1/0x28c
>> > [2399526.430966] WARN: run_ksoftirqd+0x26/0x40
>> > [2399526.430980] WARN: smpboot_thread_fn+0x10e/0x160
>> > [2399526.430993] WARN: kthread+0xf8/0x130
>> > [2399526.431004] WARN: ? sort_range+0x20/0x20
>> > [2399526.431010] WARN: ? kthread_bind+0x10/0x10
>> > [2399526.431017] WARN: ret_from_fork+0x35/0x40
>>
>> I wonder whether you are missing all fixes for commit 021a17ed796b
>> which went into kernel 4.18. It needs following fixes on top:
>>
>> d518d2ed8640 (went into 5.4), 90b2be27bb0e (went into 5.5).
>>
>> From the backtrace I really doubt this is a Xen problem, BTW. Maybe
>> running under Xen makes the problem more likely due to different
>> timing.
>>
>>
>> Juergen
>>
>
Re: repeated Kernel oops need help to debug [ In reply to ]
On 06.08.20 17:16, moftah moftah wrote:
> this is a bit weird
> I decided to debug the issue before patching the kernel more
> so i went to one server and change the qdisc on all network interfaces
> (not guest interfaces ) to fifo instead of pfifo fast
> and to my shock i got the same panic on pfifo_fast_dequeue !!!!
>
> another thing i noticed that when ever the issue occur on any server i
> check the cpu stack of the cpu that had the panic
> and there is 2 stacks always one for dom0 and the other one for hypervisor
>
> the hypervisor stack always has this on the panicking cpu
>   ffff832027be7d20: ffff82d08021831d kexec_crash+0x4d/0x50
>   ffff832027be7d28: 00000000fffffffe
>   ffff832027be7d30: ffff82d080218c9d do_kexec_op_internal+0x44d/0x710
>   ffff832027be7d38: 0000000000040660
>   ffff832027be7d40: 0000000000000000
>   ffff832027be7d48: 0000000000000000
>   ffff832027be7d50: 000000000000000c
>   ffff832027be7d58: 000000000000000c
>   ffff832027be7d60: ffff83202780f000
>   ffff832027be7d68: ffff832027be7da8 .+64
>   ffff832027be7d70: 000000000000000c
>   ffff832027be7d78: ffff82d080266750 vga_noop_puts+0/0x10
>   ffff832027be7d80: ffff82d080249f3a do_console_io+0x41a/0x460
>   ffff832027be7d88: ffffc90000000001
>   ffff832027be7d90: ffff83202780fa24
>   ffff832027be7d98: 000000000000e033
>
> Could the issue be the in the hypervisor side and the dom0 kernel panic
> message is just misleading ?

No, a dom0 panic will end up in the hypervisor which will then try to
trigger kexec for taking a dump (if configured).


Juergen
Re: repeated Kernel oops need help to debug [ In reply to ]
Hi Jurgon,

I think the commit 021a17ed796b is mostly causing the issue as you explained
I tried to port back the fixes in 5.4 and 5.5 back to 4.19 but that was out
of my level (the code changed between those versions and I no longer
clearly see where to apply the fixes)

So the other workaround i did is that i reverted 021a17ed796b in 4.19 and
comibled new kernel
the new kernel is much more stable although the issue still occur but i
would say the frequency of the occurring is 10% of what it was before
reverting 021a17ed796b

maybe if someone can port the fixed in 5.4 and 5.5 back to 4.19 it will fix
the issue 100% (I still have the oops but the frequency is less than before)


Thanks

On Thu, Aug 6, 2020 at 11:22 AM Jürgen Groß <jgross@suse.com> wrote:

> On 06.08.20 17:16, moftah moftah wrote:
> > this is a bit weird
> > I decided to debug the issue before patching the kernel more
> > so i went to one server and change the qdisc on all network interfaces
> > (not guest interfaces ) to fifo instead of pfifo fast
> > and to my shock i got the same panic on pfifo_fast_dequeue !!!!
> >
> > another thing i noticed that when ever the issue occur on any server i
> > check the cpu stack of the cpu that had the panic
> > and there is 2 stacks always one for dom0 and the other one for
> hypervisor
> >
> > the hypervisor stack always has this on the panicking cpu
> > ffff832027be7d20: ffff82d08021831d kexec_crash+0x4d/0x50
> > ffff832027be7d28: 00000000fffffffe
> > ffff832027be7d30: ffff82d080218c9d do_kexec_op_internal+0x44d/0x710
> > ffff832027be7d38: 0000000000040660
> > ffff832027be7d40: 0000000000000000
> > ffff832027be7d48: 0000000000000000
> > ffff832027be7d50: 000000000000000c
> > ffff832027be7d58: 000000000000000c
> > ffff832027be7d60: ffff83202780f000
> > ffff832027be7d68: ffff832027be7da8 .+64
> > ffff832027be7d70: 000000000000000c
> > ffff832027be7d78: ffff82d080266750 vga_noop_puts+0/0x10
> > ffff832027be7d80: ffff82d080249f3a do_console_io+0x41a/0x460
> > ffff832027be7d88: ffffc90000000001
> > ffff832027be7d90: ffff83202780fa24
> > ffff832027be7d98: 000000000000e033
> >
> > Could the issue be the in the hypervisor side and the dom0 kernel panic
> > message is just misleading ?
>
> No, a dom0 panic will end up in the hypervisor which will then try to
> trigger kexec for taking a dump (if configured).
>
>
> Juergen
>
Re: repeated Kernel oops need help to debug [ In reply to ]
On 24.08.20 20:31, moftah moftah wrote:
> Hi Jurgon,
>
> I think the commit 021a17ed796b is mostly causing the issue as you explained
> I tried to port back the fixes in 5.4 and 5.5 back to 4.19 but that was
> out of my level (the code changed between those versions and I no longer
> clearly see where to apply the fixes)
>
> So the other workaround i did is that i reverted 021a17ed796b in 4.19
> and comibled new kernel
> the new kernel is much more stable although the issue still occur but i
> would say the frequency of the occurring is 10% of  what it was before
> reverting 021a17ed796b
>
> maybe if someone can port the fixed in 5.4 and 5.5 back to 4.19 it will
> fix the issue 100% (I still have the oops but the frequency is less than
> before)

You should ask the authors of the two fixup patches to backport them.


Juergen