Mailing List Archive: page faults with linux 6.8.1 on Arch Linux

System is an HP z840 running Arch Linux with the xen 4.18.1 packages
from AUR.

On upgrading the kernel to 6.8.1, PV and PVH domUs experience memory
issues on boot:

[    5.782063] BUG: Bad page state in process swapper/0 pfn:0a3c9
[    5.785678] BUG: Bad page state in process dbus-broker pfn:0a3c8
[    5.938143] BUG: Bad page state in process swapper/0 pfn:0a3c7

Each BUG comes with an oops in dmesg that has this to say about it:

[   99.567117] BUG: Bad page state in process swapper/0 pfn:0a6e4
[   99.567185] page:00000000ff5b37eb refcount:0 mapcount:0
mapping:0000000000000000 index:0x0 pfn:0xa6e4
[   99.567194] flags: 0x1ffff0000000000(node=0|zone=1|lastcpupid=0xffff)
[   99.567202] page_type: 0xffffffff()
[   99.567210] raw: 01ffff0000000000 dead000000000040 ffff8a9208e92000
0000000000000000
[   99.567215] raw: 0000000000000000 0000000000000001 00000000ffffffff
0000000000000000
[   99.567219] page dumped because: page_pool leak
[   99.567223] Modules linked in: intel_rapl_msr intel_rapl_common
crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic gf128mul
ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel
crypto_simd cryptd pcspkr cfg80211 rfkill fuse loop dm_mod nfnetlink
ip_tables x_tables xfs libcrc32c crc32c_generic crc32c_intel
xen_kbdfront xen_netfront xenfs xen_privcmd xen_fbfront xen_blkfront
[   99.567334] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G B
6.8.1-arch1-1 #1 52f97d9bb37be6168651745a1a9f8f7240d21ce5
[   99.567341] Call Trace:
[   99.567345] <IRQ>
[   99.567350] dump_stack_lvl+0x47/0x60
[   99.567362] bad_page+0x71/0x100
[   99.567372] free_unref_page_prepare+0x236/0x390
[   99.567379] free_unref_page+0x34/0x180
[   99.567385] __pskb_pull_tail+0x3ff/0x4a0
[   99.567395] xennet_poll+0x909/0xa40 [xen_netfront
12c02fdcf84c692965d9cd6ca5a6ff0a530b4ce9]
[   99.567411] ? _raw_spin_unlock_irqrestore+0xe/0x40
[   99.567423] __napi_poll+0x2b/0x1b0
[   99.567432] net_rx_action+0x2b5/0x370
[   99.567439] ? handle_irq_desc+0x41/0x60
[   99.567449] __do_softirq+0xcc/0x2c8
[   99.567456] __irq_exit_rcu+0xa3/0xc0
[   99.567464] sysvec_xen_hvm_callback+0x72/0x90
[   99.567471] </IRQ>
[   99.567473] <TASK>
[   99.567475] asm_sysvec_xen_hvm_callback+0x1a/0x20
[   99.567480] RIP: 0010:pv_native_safe_halt+0xf/0x20
[   99.567487] Code: 22 d7 c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90
90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 66 90 0f 00 2d e3 13 27 00 fb
f4 <c3> cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
[   99.567492] RSP: 0018:ffffffffbd803df0 EFLAGS: 00000246
[   99.567497] RAX: 0000000000004000 RBX: ffff8a92012acc64 RCX:
00000204f877f747
[   99.567501] RDX: ffff8a927dc00000 RSI: ffff8a92012acc00 RDI:
0000000000000001
[   99.567504] RBP: ffff8a92012acc64 R08: ffffffffbd94dca0 R09:
0000000000000001
[   99.567507] R10: 0000000000000018 R11: ffff8a927dc331a4 R12:
ffffffffbd94dca0
[   99.567510] R13: ffffffffbd94dd20 R14: 0000000000000001 R15:
0000000000000000
[   99.567516] acpi_safe_halt+0x15/0x30
[   99.567523] acpi_idle_do_entry+0x2f/0x50
[   99.567530] acpi_idle_enter+0x7f/0xd0
[   99.567555] cpuidle_enter_state+0x84/0x440
[   99.567562] cpuidle_enter+0x2d/0x40
[   99.567570] do_idle+0x1d8/0x230
[   99.567574] cpu_startup_entry+0x2a/0x30
[   99.567578] rest_init+0xca/0xd0
[   99.567585] arch_call_rest_init+0xe/0x30
[   99.567591] start_kernel+0x704/0xa90
[   99.567596] x86_64_start_reservations+0x18/0x30
[   99.567604] x86_64_start_kernel+0x96/0xa0
[   99.567610] secondary_startup_64_no_verify+0x184/0x18b
[   99.567620] </TASK>

The full dmesg is here if you want to take a look at it:
https://gist.github.com/refutationalist/4f8ab7c22ffc3c9fa6b88953a27eb3a5

HVMs do not appear to have this problem.   Moving back to the LTS
kernel (6.6.22) makes this problem go away.   Since this is the first
6.8 series kernel, I suspect there's been some change to the kernel
that's causing this.   For now I'll stick to the LTS kernel, but there
seems to be some breaking change here.

Any hints?   Thanks!

-Sam

Just to confirm, I also have plenty of these errors and had to downgrade
the kernel to get rid of them.

Atenciosamente,
*Charles Ferreira Gonçalves *

On Wed, Mar 20, 2024 at 2:38?AM Sam Mulvey <sam@vis.nu> wrote:

>
> System is an HP z840 running Arch Linux with the xen 4.18.1 packages
> from AUR.
>
> On upgrading the kernel to 6.8.1, PV and PVH domUs experience memory
> issues on boot:
>
> [ 5.782063] BUG: Bad page state in process swapper/0 pfn:0a3c9
> [ 5.785678] BUG: Bad page state in process dbus-broker pfn:0a3c8
> [ 5.938143] BUG: Bad page state in process swapper/0 pfn:0a3c7
>
> Each BUG comes with an oops in dmesg that has this to say about it:
>
> [ 99.567117] BUG: Bad page state in process swapper/0 pfn:0a6e4
> [ 99.567185] page:00000000ff5b37eb refcount:0 mapcount:0
> mapping:0000000000000000 index:0x0 pfn:0xa6e4
> [ 99.567194] flags: 0x1ffff0000000000(node=0|zone=1|lastcpupid=0xffff)
> [ 99.567202] page_type: 0xffffffff()
> [ 99.567210] raw: 01ffff0000000000 dead000000000040 ffff8a9208e92000
> 0000000000000000
> [ 99.567215] raw: 0000000000000000 0000000000000001 00000000ffffffff
> 0000000000000000
> [ 99.567219] page dumped because: page_pool leak
> [ 99.567223] Modules linked in: intel_rapl_msr intel_rapl_common
> crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic gf128mul
> ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel
> crypto_simd cryptd pcspkr cfg80211 rfkill fuse loop dm_mod nfnetlink
> ip_tables x_tables xfs libcrc32c crc32c_generic crc32c_intel
> xen_kbdfront xen_netfront xenfs xen_privcmd xen_fbfront xen_blkfront
> [ 99.567334] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G B
> 6.8.1-arch1-1 #1 52f97d9bb37be6168651745a1a9f8f7240d21ce5
> [ 99.567341] Call Trace:
> [ 99.567345] <IRQ>
> [ 99.567350] dump_stack_lvl+0x47/0x60
> [ 99.567362] bad_page+0x71/0x100
> [ 99.567372] free_unref_page_prepare+0x236/0x390
> [ 99.567379] free_unref_page+0x34/0x180
> [ 99.567385] __pskb_pull_tail+0x3ff/0x4a0
> [ 99.567395] xennet_poll+0x909/0xa40 [xen_netfront
> 12c02fdcf84c692965d9cd6ca5a6ff0a530b4ce9]
> [ 99.567411] ? _raw_spin_unlock_irqrestore+0xe/0x40
> [ 99.567423] __napi_poll+0x2b/0x1b0
> [ 99.567432] net_rx_action+0x2b5/0x370
> [ 99.567439] ? handle_irq_desc+0x41/0x60
> [ 99.567449] __do_softirq+0xcc/0x2c8
> [ 99.567456] __irq_exit_rcu+0xa3/0xc0
> [ 99.567464] sysvec_xen_hvm_callback+0x72/0x90
> [ 99.567471] </IRQ>
> [ 99.567473] <TASK>
> [ 99.567475] asm_sysvec_xen_hvm_callback+0x1a/0x20
> [ 99.567480] RIP: 0010:pv_native_safe_halt+0xf/0x20
> [ 99.567487] Code: 22 d7 c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90
> 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 66 90 0f 00 2d e3 13 27 00 fb
> f4 <c3> cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90
> [ 99.567492] RSP: 0018:ffffffffbd803df0 EFLAGS: 00000246
> [ 99.567497] RAX: 0000000000004000 RBX: ffff8a92012acc64 RCX:
> 00000204f877f747
> [ 99.567501] RDX: ffff8a927dc00000 RSI: ffff8a92012acc00 RDI:
> 0000000000000001
> [ 99.567504] RBP: ffff8a92012acc64 R08: ffffffffbd94dca0 R09:
> 0000000000000001
> [ 99.567507] R10: 0000000000000018 R11: ffff8a927dc331a4 R12:
> ffffffffbd94dca0
> [ 99.567510] R13: ffffffffbd94dd20 R14: 0000000000000001 R15:
> 0000000000000000
> [ 99.567516] acpi_safe_halt+0x15/0x30
> [ 99.567523] acpi_idle_do_entry+0x2f/0x50
> [ 99.567530] acpi_idle_enter+0x7f/0xd0
> [ 99.567555] cpuidle_enter_state+0x84/0x440
> [ 99.567562] cpuidle_enter+0x2d/0x40
> [ 99.567570] do_idle+0x1d8/0x230
> [ 99.567574] cpu_startup_entry+0x2a/0x30
> [ 99.567578] rest_init+0xca/0xd0
> [ 99.567585] arch_call_rest_init+0xe/0x30
> [ 99.567591] start_kernel+0x704/0xa90
> [ 99.567596] x86_64_start_reservations+0x18/0x30
> [ 99.567604] x86_64_start_kernel+0x96/0xa0
> [ 99.567610] secondary_startup_64_no_verify+0x184/0x18b
> [ 99.567620] </TASK>
>
> The full dmesg is here if you want to take a look at it:
> https://gist.github.com/refutationalist/4f8ab7c22ffc3c9fa6b88953a27eb3a5
>
>
> HVMs do not appear to have this problem. Moving back to the LTS
> kernel (6.6.22) makes this problem go away. Since this is the first
> 6.8 series kernel, I suspect there's been some change to the kernel
> that's causing this. For now I'll stick to the LTS kernel, but there
> seems to be some breaking change here.
>
> Any hints? Thanks!
>
> -Sam
>
>
>