Mailing List Archive

Kernel warning on driver-domain
Hi,

I am getting semi-regular warnings in the dmesg about block devices shared to other Domains from a driver-domain.

The dmesg/kernel trace is at the bottom of the email.
All the traces show the same "call trace" and most of the numbers looks similar.
It's always a different domain and xvda-device that is listed.

Versions (Dom0 and driver domain are kept at same versions)
Kernel: 5.10.52
Xen: 4.14.2

Apart from these warning, I do not see any issues. The domains are running stable, no messages in the logs.

In this particular one, the domain is practically idle, xvda2 is for the swap partition, which isn't actually used.

In previous ones, similarly only the dmesg message on the driver domain, all other domains don't show anything.
There was a scrub running on a different pool, which finished at 17:19:35. (Nearly 30 minutes before).

I tried to google for these, but apart from a few possible matches with much older kernels, I couldn't find anything related.

====
[Fri Sep 3 17:48:01 2021] ------------[ cut here ]------------
[Fri Sep 3 17:48:01 2021] WARNING: CPU: 0 PID: 27163 at kernel/kthread.c:83 to_kthread+0x6/0x10
[Fri Sep 3 17:48:01 2021] Modules linked in: rpcsec_gss_krb5 target_core_iblock iscsi_target_mod target_core_mod xt_recent ipt_REJECT nf_reject_ipv4 xt_multiport xt_tcpudp xt_conntrack xt_hashlimit xt_addrtype xt_mark iptable_mangle iptable_nat iptable_raw n
fnetlink_log xt_NFLOG nf_log_ipv4 nf_log_common xt_LOG nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv4 nfnetlink iptable_filter ip_tables x_tables dm_queue_length zfs(PO) zunicode(PO) zzstd(O) zlua(O) zcommon(PO) znvpair(PO) zavl(PO) icp(PO) spl(O) cr
ct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel mpt3sas aesni_intel scsi_transport_sas
[Fri Sep 3 17:48:01 2021] CPU: 0 PID: 27163 Comm: 18.xvda2-0 Tainted: P W O 5.10.52-gentoo-generic #1
[Fri Sep 3 17:48:01 2021] RIP: e030:to_kthread+0x6/0x10
[Fri Sep 3 17:48:01 2021] Code: 99 22 e3 81 e8 6f 73 3c 00 48 98 c3 e8 82 24 f9 ff 66 90 c3 e8 5a 24 f9 ff 66 90 c3 e8 8e 18 02 00 66 90 c3 f6 47 26 20 75 02 <0f> 0b 48 8b 87 a8 04 00 00 c3 e8 eb ff ff ff 48 8b 40 18 c3 65 48
[Fri Sep 3 17:48:01 2021] RSP: e02b:ffffc90020267b90 EFLAGS: 00010046
[Fri Sep 3 17:48:01 2021] RAX: ffff889886e2a2c0 RBX: ffffc90020267c10 RCX: ffff889886f00000
[Fri Sep 3 17:48:01 2021] RDX: 0000000000000000 RSI: ffffc90020267c10 RDI: ffff889140a38ec0
[Fri Sep 3 17:48:01 2021] RBP: ffff889140a38ec0 R08: 000188e909deff76 R09: 0000000000000040
[Fri Sep 3 17:48:01 2021] R10: 0000000000000004 R11: 0000000000000001 R12: 0000000000000000
[Fri Sep 3 17:48:01 2021] R13: 0000000000000000 R14: 0000000000000000 R15: ffff889140a38f68
[Fri Sep 3 17:48:01 2021] FS: 0000000000000000(0000) GS:ffff889886e00000(0000) knlGS:0000000000000000
[Fri Sep 3 17:48:01 2021] CS: 10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[Fri Sep 3 17:48:01 2021] CR2: 000055e9dec50978 CR3: 000000011ed0c000 CR4: 0000000000050660
[Fri Sep 3 17:48:01 2021] Call Trace:
[Fri Sep 3 17:48:01 2021] kthread_is_per_cpu+0x5/0x16
[Fri Sep 3 17:48:01 2021] can_migrate_task+0x65/0x17a
[Fri Sep 3 17:48:01 2021] load_balance+0x3e8/0x83e
[Fri Sep 3 17:48:01 2021] newidle_balance+0x1d0/0x2a7
[Fri Sep 3 17:48:01 2021] pick_next_task_fair+0x196/0x1f7
[Fri Sep 3 17:48:01 2021] __schedule+0x1ab/0x515
[Fri Sep 3 17:48:01 2021] ? _raw_spin_unlock_irqrestore+0xd/0xe
[Fri Sep 3 17:48:01 2021] ? __mod_timer+0x21f/0x246
[Fri Sep 3 17:48:01 2021] schedule+0x73/0x99
[Fri Sep 3 17:48:01 2021] schedule_timeout+0x9e/0xd7
[Fri Sep 3 17:48:01 2021] ? __next_timer_interrupt+0xe3/0xe3
[Fri Sep 3 17:48:01 2021] xen_blkif_schedule+0x251/0xb5e
[Fri Sep 3 17:48:01 2021] ? __wake_up_locked_sync_key+0x15/0x15
[Fri Sep 3 17:48:01 2021] ? __schedule+0x4f2/0x515
[Fri Sep 3 17:48:01 2021] ? arch_local_irq_disable+0x5/0x8
[Fri Sep 3 17:48:01 2021] ? arch_local_irq_save+0x11/0x17
[Fri Sep 3 17:48:01 2021] ? xen_blkif_be_int+0x22/0x22
[Fri Sep 3 17:48:01 2021] kthread+0xea/0xef
[Fri Sep 3 17:48:01 2021] ? kthread_mod_delayed_work+0xb8/0xb8
[Fri Sep 3 17:48:01 2021] ret_from_fork+0x22/0x30
[Fri Sep 3 17:48:01 2021] ---[ end trace 9280d21445c4f71b ]---
====
Re: Kernel warning on driver-domain [ In reply to ]
On 04.09.21 08:45, J. Roeleveld wrote:
> Hi,
>
> I am getting semi-regular warnings in the dmesg about block devices shared to other Domains from a driver-domain.
>
> The dmesg/kernel trace is at the bottom of the email.
> All the traces show the same "call trace" and most of the numbers looks similar.
> It's always a different domain and xvda-device that is listed.
>
> Versions (Dom0 and driver domain are kept at same versions)
> Kernel: 5.10.52
> Xen: 4.14.2

You are missing kernel patch 3a7956e25e1d7b3c148569e78895e1f3178122a9
which is in 5.10.62.


Juergen
Re: Kernel warning on driver-domain (Worse, boot-failure) using kernel 5.10.62) [ In reply to ]
On Sunday, September 5, 2021 2:32:20 PM CEST Juergen Gross wrote:
> On 04.09.21 08:45, J. Roeleveld wrote:
> > Hi,
> >
> > I am getting semi-regular warnings in the dmesg about block devices shared
> > to other Domains from a driver-domain.
> >
> > The dmesg/kernel trace is at the bottom of the email.
> > All the traces show the same "call trace" and most of the numbers looks
> > similar. It's always a different domain and xvda-device that is listed.
> >
> > Versions (Dom0 and driver domain are kept at same versions)
> > Kernel: 5.10.52
> > Xen: 4.14.2
>
> You are missing kernel patch 3a7956e25e1d7b3c148569e78895e1f3178122a9
> which is in 5.10.62.

Thank you for this.
I just tried booting with 5.10.62, but it fails with the below error. I found
a similar issue (not fully, as reported is for e1000e NIC, and I'm using
mpt3sas) at:
https://lkml.org/lkml/2021/8/26/500

But, I'm not sure if this is actually the issue I am encountering. Currently
back on 5.10.52 as that seems stable for now.

For this particular driver domain, I pass 2 HBA's (LSI SAS3008) to it using
PCI-passthrough. The relevant lines in the config:
pci = ['81:00.0,permissive=1','82:00.0,permissive=1']

I added the "permissive" in the past due to warnings/errors (can't fully
remember which) in the past as per:
https://wiki.xenproject.org/wiki/Xen_PCI_Passthrough#PV_guests_and_PCI_quirks

If anyone has any ideas/suggestions?

====
BUG: unable to handle page fault for address: ffffc9000c62900c
#PF: supervisor write access in kernel mode
#PF: error_code(0x0003) - permissions violation
PGD 1793aa4067 P4D 1793aa4067 PUD 1793aa3067 PMD 1000e1067 PTE
80100000fbe4e075
Oops: 0003 [#1] SMP NOPTI
CPU: 3 PID: 713 Comm: udevd Tainted: P O 5.10.62-gentoo-generic
#1
RIP: e030:__pci_enable_msix_range+0x104/0x477
Code: 01 89 d6 89 54 24 08 c1 e6 04 e8 e4 9e bd ff 48 85 c0 49 89 c7 0f 84 2c
03 00 00 8b 54 24 08 48 8d 48 0c be 01 00 00 00 31 c0 <89> 31 ff c0 48 83 c1 10
39 c2 7f f4 48 89 ef e8 4d f8 ff ff 4d 85
RSP: e02b:ffffc9000dd3bab0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9000c62900c
RDX: 0000000000000060 RSI: 0000000000000001 RDI: 0000000000000000
RBP: ffff888100b89000 R08: 0000000000000000 R09: 00000000fbe4e5ff
R10: 00000000000fbe4e R11: 00000000000fbe4e R12: 0000000000000004
R13: 0000000000000000 R14: ffffc9000dd3bbd0 R15: ffffc9000c629000
FS: 00007fcf52a2c740(0000) GS:ffff889886f80000(0000) knlGS:0000000000000000
CS: e030 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff8000007df270 CR3: 00000001040c6000 CR4: 0000000000050660
Call Trace:
pci_alloc_irq_vectors_affinity+0x6f/0xe8
mpt3sas_base_map_resources+0x4e9/0x7c2 [mpt3sas]
mpt3sas_base_attach+0x113/0x17d3 [mpt3sas]
_scsih_probe+0x753/0x850 [mpt3sas]
pci_device_probe+0xc6/0x135
really_probe+0x144/0x326
driver_probe_device+0x63/0x92
device_driver_attach+0x37/0x50
__driver_attach+0x92/0x9a
? device_driver_attach+0x50/0x50
bus_for_each_dev+0x6e/0xa4
bus_add_driver+0x103/0x1b4
driver_register+0x99/0xd2
? 0xffffffffa01fb000
_mpt3sas_init+0x1a7/0x1000 [mpt3sas]
do_one_initcall+0x72/0x16c
? kmem_cache_alloc_trace+0xdb/0x102
do_init_module+0x56/0x1f3
__do_sys_finit_module+0x94/0xbb
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fcf52b878d9
Code: 48 8d 3d da 60 0c 00 0f 05 eb a4 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48
89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01
c3 48 8b 0d 77 25 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007ffef026c478 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
RAX: ffffffffffffffda RBX: 000055f2b189e260 RCX: 00007fcf52b878d9
RDX: 0000000000000000 RSI: 000055f2b188b720 RDI: 0000000000000009
RBP: 0000000000020000 R08: 0000000000000000 R09: 000055f2b18a0140
R10: 0000000000000009 R11: 0000000000000246 R12: 000055f2b188b720
R13: 0000000000000000 R14: 000055f2b18a1320 R15: 0000000000000000
Modules linked in: zfs(PO+) zunicode(PO) zzstd(O) zlua(O) zcommon(PO)
znvpair(PO) zavl(PO) icp(PO) spl(O) crct10dif_pclmul crc32_pclmul crc32c_intel
ghash_clmulni_intel mpt3sas(+) scsi_transport_sas aesni_intel
CR2: ffffc9000c62900c
---[ end trace 96215648c76c40ac ]---
RIP: e030:__pci_enable_msix_range+0x104/0x477
Code: 01 89 d6 89 54 24 08 c1 e6 04 e8 e4 9e bd ff 48 85 c0 49 89 c7 0f 84 2c
03 00 00 8b 54 24 08 48 8d 48 0c be 01 00 00 00 31 c0 <89> 31 ff c0 48 83 c1 10
39 c2 7f f4 48 89 ef e8 4d f8 ff ff 4d 85
RSP: e02b:ffffc9000dd3bab0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffc9000c62900c
RDX: 0000000000000060 RSI: 0000000000000001 RDI: 0000000000000000
RBP: ffff888100b89000 R08: 0000000000000000 R09: 00000000fbe4e5ff
R10: 00000000000fbe4e R11: 00000000000fbe4e R12: 0000000000000004
R13: 0000000000000000 R14: ffffc9000dd3bbd0 R15: ffffc9000c629000
FS: 00007fcf52a2c740(0000) GS:ffff889886f80000(0000) knlGS:0000000000000000
CS: e030 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff8000007df270 CR3: 00000001040c6000 CR4: 0000000000050660
udevd[679]: worker [713] terminated by signal 9 (Killed)
udevd[679]: worker [713] failed while handling '/devices/pci-0/
pci0000:00/0000:00:00.0'
====