Mailing List Archive

Need help debugging kernel oops ( BUG: unable to handle page fault for address ? )
Hi All,

I have had a few oopses in the past week already and am trying to find out what the likely
cause is and, more importantly, how to resolve this.

I have tried google, but not found anything useful yet. Several results showing issues during
boot (not the case as it runs succesfully for nearly a week) or related to 2.6 kernel versions.

I have attached the dmesg-output after the first time I noticed the "oops" below this email.
(Actually, several in a row)
Last nights was unable to get as the server was frozen by the time I got to it. Which means I
am unable to confirm fully if the pattern was the same. The last message I could still read was
nearly identical to the ones I saw the first time.

I noticed the following section which looks interesting:
===
[317321.524229] BUG: unable to handle page fault for address: ffff888510ebd0e0
[317321.524307] #PF: supervisor write access in kernel mode
[317321.524368] #PF: error_code(0x0003) - permissions violation
===
But I have no idea if this is a cause or a result of the earlier trace messages in the output.

I found a new BIOS and Firmware version available for the mainboard, which I am planning on
applying this week.

The kernel is "tainted" because of the use of ZFS. No other out-of-tree modules are installed.

My distro: Gentoo
Kernel version: 5.4.38
ZFS version: 0.8.3
XEN version: 4.12.2

If more info is needed to analyse this, please let me know.

Additionally, if anyone has/knows good resources (online preferred, but hardcopy will be fine as
well) I can use to analyse/understand these kernel messages I would definitely appreciate it.

Many thanks in advance,

Joost Roeleveld

DMESG:

[317321.523586] ------------[ cut here ]------------
[317321.523600] WARNING: CPU: 1 PID: 25465 at arch/x86/xen/multicalls.c:102
xen_mc_flush+0x194/0x1c0
[317321.523601] ------------[ cut here ]------------
[317321.523603] Modules linked in:
[317321.523614] WARNING: CPU: 3 PID: 2162 at arch/x86/xen/multicalls.c:102
xen_mc_flush+0x194/0x1c0
[317321.523615] iscsi_tcp libiscsi_tcp libiscsi
[317321.523618] Modules linked in:
[317321.523619] scsi_transport_iscsi nfsd
[317321.523622] iscsi_tcp
[317321.523623] auth_rpcgss nfs_acl lockd
[317321.523626] libiscsi_tcp
[317321.523627] grace sunrpc br_netfilter xt_physdev xen_acpi_processor
[317321.523631] libiscsi
[317321.523632] xen_pciback xen_netback xen_blkback xen_gntalloc xen_gntdev
[317321.523636] scsi_transport_iscsi
[317321.523637] xen_evtchn xenfs xen_privcmd bridge 8021q
[317321.523640] nfsd
[317321.523642] garp mrp stp llc
[317321.523644] auth_rpcgss
[317321.523646] bonding intel_rapl_msr iTCO_wdt
[317321.523648] nfs_acl
[317321.523650] iTCO_vendor_support intel_rapl_common sb_edac intel_powerclamp
[317321.523653] lockd
[317321.523654] crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel
intel_rapl_perf
[317321.523658] grace
[317321.523659] pcspkr i2c_i801 ast
[317321.523662] sunrpc
[317321.523663] drm_vram_helper ttm i2c_algo_bit drm_kms_helper
[317321.523666] br_netfilter
[317321.523667] drm lpc_ich mei_me
[317321.523670] xt_physdev
[317321.523671] mei ipmi_ssif ixgbe mdio
[317321.523674] xen_acpi_processor
[317321.523675] ptp ioatdma pps_core dca ipmi_si
[317321.523679] xen_pciback
[317321.523680] ipmi_devintf ipmi_msghandler acpi_power_meter binfmt_misc
[317321.523683] xen_netback
[317321.523684] tun
[317321.523686] xen_blkback
[317321.523687] zfs(PO) zunicode(PO)
[317321.523689] xen_gntalloc
[317321.523690] zavl(PO) icp(PO)
[317321.523692] xen_gntdev
[317321.523694] zcommon(PO) znvpair(PO) spl(O)
[317321.523696] xen_evtchn
[317321.523698] zlua(PO)
[317321.523700] xenfs xen_privcmd bridge 8021q
[317321.523707] CPU: 1 PID: 25465 Comm: zfs_stats_cache Tainted: P O 5.4.38-
gentoo-host #1
[317321.523708] garp
[317321.523710] Hardware name: Supermicro Super Server/X10DRi-T4+, BIOS 3.1
06/08/2018
[317321.523711] mrp
[317321.523715] RIP: e030:xen_mc_flush+0x194/0x1c0
[317321.523716] stp llc bonding
[317321.523721] Code: 05 00 10 00 81 e8 ec 13 be 00 48 89 c1 48 89 45 18 48 c1 e9 3f 48 89
ce e9 03 ff ff ff 48 c7 45 18 ea ff ff ff be 01 00 00 00 <0f> 0b 8b 55 00 48 c7 c7 a0 8a fb 81 31
db 65 8b 0d e7 0a ff 7e e8
[317321.523722] intel_rapl_msr iTCO_wdt iTCO_vendor_support
[317321.523726] RSP: e02b:ffffc9000a00bbe8 EFLAGS: 00010002
[317321.523727] intel_rapl_common sb_edac
[317321.523730] intel_powerclamp crct10dif_pclmul crc32_pclmul
[317321.523736] RAX: ffff888686655858 RBX: 0000777f80000000 RCX: ffff888686655858
[317321.523737] crc32c_intel ghash_clmulni_intel
[317321.523741] RDX: 0000000000000001 RSI: 000000000000000d RDI: ffff888686655310
[317321.523742] intel_rapl_perf pcspkr i2c_i801 ast
Need help debugging kernel oops ( BUG: unable to handle page fault for address ? ) [ In reply to ]
Hi All,

I have had a few oopses in the past week already and am trying to find out what the likely
cause is and, more importantly, how to resolve this.

I have tried google, but not found anything useful yet. Several results showing issues during
boot (not the case as it runs succesfully for nearly a week) or related to 2.6 kernel versions.

I have attached the dmesg-output after the first time I noticed the "oops". (Actually, several in a
row)
Last nights was unable to get as the server was frozen by the time I got to it. Which means I
am unable to confirm fully if the pattern was the same. The last message I could still read was
nearly identical to the ones I saw the first time.

I also attached a normal dmesg output, taken after boot and all VMs finished starting.

I noticed the following section which looks interesting:
===
[317321.524229] BUG: unable to handle page fault for address: ffff888510ebd0e0
[317321.524307] #PF: supervisor write access in kernel mode
[317321.524368] #PF: error_code(0x0003) - permissions violation
===
But I have no idea if this is a cause or a result of the earlier trace messages in the output.

I found a new BIOS and Firmware version available for the mainboard, which I am planning on
applying this week.

The kernel is "tainted" because of the use of ZFS. No other out-of-tree modules are installed.

My distro: Gentoo
Kernel version: 5.4.38
ZFS version: 0.8.3
XEN version: 4.12.2

If more info is needed to analyse this, please let me know.

Additionally, if anyone has/knows good resources (online preferred, but hardcopy will be fine as
well) I can use to analyse/understand these kernel messages I would definitely appreciate it.

Many thanks in advance,

Joost Roeleveld
Re: Need help debugging kernel oops ( BUG: unable to handle page fault for address ? ) [ In reply to ]
On 03.08.20 08:27, J. Roeleveld wrote:
> Hi All,
>
> I have had a few oopses in the past week already and am trying to find
> out what the likely cause is and, more importantly, how to resolve this.

Could you please apply the attached patch to your kernel? This is
for an issue I've found recently and I have the vague feeling it could
repair the issue you are seeing.

I and others have hit the very same problem as you did in very rare
cases, but we were never able to reproduce it at will. So if you are
hitting it on a regular basis it would help a lot to verify whether
my patch is helping to solve this issue, too.


Juergen
Re: Need help debugging kernel oops ( BUG: unable to handle page fault for address ? ) [ In reply to ]
On Tuesday, 4 August 2020 06:24:44 CEST J?rgen Gro? wrote:
> On 03.08.20 08:27, J. Roeleveld wrote:
> > Hi All,
> >
> > I have had a few oopses in the past week already and am trying to find
> > out what the likely cause is and, more importantly, how to resolve this.
>
> Could you please apply the attached patch to your kernel? This is
> for an issue I've found recently and I have the vague feeling it could
> repair the issue you are seeing.
>
> I and others have hit the very same problem as you did in very rare
> cases, but we were never able to reproduce it at will. So if you are
> hitting it on a regular basis it would help a lot to verify whether
> my patch is helping to solve this issue, too.
>
>
> Juergen

I am hitting it, but not predictably, which is rather annoying.
I am currently doing my best to force this and documenting what I am trying.
When I do manage to force the issue, I will definitely try the patch.

Looking at the patch, I am wondering if setting "CONFIG_PREEMPT" might not
also prevent this?

Or, more likely, the "server" option is "CONFIG_PREEMPT_NONE", which might
prevent this particular situation all together?

I currently have "CONFIG_PREEMPT_VOLUNTARY" set in my config.

Is there any documentation about which of these three is actually recommended
for the Xen Domain0?

--
Joost