Mailing List Archive: Re: Network stalls on domU under Xen-4.14.x -- solved, -ish

Den 15.02.2021 12:02, skrev Håkon Alstadheim:
> I'm recently having total network stalls on some domUs . Dmesg on domU
> shows a number of lines like:
>
> Feb 15 11:12:38 gt kernel: net eth0: rx->offset: 0, size: -1
Update, for the record.

One piece of info missing from my previous report: I have been running
the backend with Voluntary preemption on in the dom0 kernel. My set-up
is basically an organically grown fuzzing machine for xen, and I have
several ill-considered settings turned on. Anyway changing the kernel
config to what is seen below, and upgrading to linux kernel 5.10.17
fixes my issue. Changelog for linux-5.10.17 pointed me to the main
culprit, a dead-lock on xen_netback, but even with that fix in, I still
had issues. I figured that turning off preemption would further lessen
risk of constipation on the backend, and that seems to be true.

----linux config that works: ----

CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
# CONFIG_PREEMPTIRQ_DELAY_TEST is not set

[ extra info from previous mail :]

> Feb 15 11:12:38 gt kernel: net eth0: rx->offset: 0, size: -1
> Feb 15 11:12:38 gt kernel: net eth0: rx->offset: 0, size: -1
> Feb 15 11:12:39 gt kernel: net eth0: rx->offset: 0, size: -1
> Feb 15 11:12:39 gt kernel: net eth0: rx->offset: 0, size: -1
> Feb 15 11:12:40 gt kernel: net eth0: rx->offset: 0, size: -1
> Feb 15 11:12:42 gt kernel: net eth0: rx->offset: 0, size: -1
> Feb 15 11:12:45 gt kernel: net eth0: rx->offset: 0, size: -1
> Feb 15 11:12:52 gt kernel: net eth0: rx->offset: 0, size: -1
> Feb 15 11:13:05 gt kernel: net eth0: rx->offset: 0, size: -1
>
> On occasion, with the longer stalls (~ 5 minutes) I get:
>
> Feb 15 09:29:04 gt kernel: net_ratelimit: 5 callbacks suppressed
>
> I have tried this on xen 4.14.0, 4.14.1 and 4.14.2-pre, with various
> guest kernels ranging from linux-4.19.170 to the early 5.10.x kernels.
> Newer 5.10 kernels give me some other error, to do with interrupts.
> Seems interrupts vectors point to La-La-Land, or else they are routed
> to the wrong CPU. I'm fairly certain I did not have this issue running
> Xen-4.14-staging with the earliest linux-5.10.x, but that had other
> issues. File-system corruption got me a week around christmas with the
> whole system down :- ( . Allowed me to learn how to use bacula from a
> grml rescue cd without a catalog-database :-) .
>
> The stalls happen under load (net or cpu, don't know which matters
> more). I can reliably reproduce if i run a lot of compilations&
> network fetches in the domu while simultaneously lanunching firefox
> and thunderbird. I have home mounted with nfs from the dom0, so lots
> of traffic when thunderbird and firefox launch.
>
> On occation the stalls are caught by the kernel, and I get a
> stack-trace, but I guess those are consequences of the network stall,
> incidental to the real issue. like:
>
> Feb 15 09:09:38 gt kernel:     status: r
> Feb 15 09:09:38 gt kernel: net_ratelimit: 5 callbacks suppressed
> Feb 15 09:09:38 gt kernel: net eth0: rx->offset: 0, size: -1
> Feb 15 09:09:38 gt root[45567]: ACPI event unhandled: jack/lineout
> LINEOUT unplug
> Feb 15 09:09:38 gt root[45570]: ACPI event unhandled: jack/videoout
> VIDEOOUT unplug
> Feb 15 09:09:44 gt kernel: net eth0: rx->offset: 0, size: -1
> Feb 15 09:09:57 gt kernel: net eth0: rx->offset: 0, size: -1
> Feb 15 09:10:01 gt CROND[45682]: (root) CMD (/usr/lib/sa/sa1 1 1)
> Feb 15 09:10:23 gt kernel: net eth0: rx->offset: 0, size: -1
> Feb 15 09:11:17 gt kernel: net eth0: rx->offset: 0, size: -1
> Feb 15 09:11:58 gt kernel: INFO: task IndexedDB #3:45442 blocked for
> more than 122 seconds.
> Feb 15 09:11:58 gt kernel:       Not tainted 5.4.80-gentoo-r1-x86_64 #1
> Feb 15 09:11:58 gt kernel: "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Feb 15 09:11:58 gt kernel: IndexedDB #3    D    0 45442   3451 0x00000000
> Feb 15 09:11:58 gt kernel: Call Trace:
> Feb 15 09:11:58 gt kernel: __schedule+0x2a3/0x7a0
> Feb 15 09:11:58 gt kernel: ? nfs_pageio_complete+0xa8/0xf0
> Feb 15 09:11:58 gt kernel: schedule+0x34/0xa0
> Feb 15 09:11:58 gt kernel: io_schedule+0x3c/0x60
> Feb 15 09:11:58 gt kernel: wait_on_page_bit_common+0x125/0x330
> Feb 15 09:11:58 gt kernel: ?
> trace_event_raw_event_file_check_and_advance_wb_err+0xf0/0xf0
> Feb 15 09:11:58 gt kernel: __filemap_fdatawait_range+0x7b/0xe0
> Feb 15 09:11:58 gt kernel: file_write_and_wait_range+0x67/0x90
> Feb 15 09:11:58 gt kernel: nfs_file_fsync+0x83/0x190
> Feb 15 09:11:58 gt kernel: __x64_sys_fsync+0x2f/0x60
> Feb 15 09:11:58 gt kernel: do_syscall_64+0x51/0x130
> Feb 15 09:11:58 gt kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
> Feb 15 09:11:58 gt kernel: RIP: 0033:0x7f4db9580e1b
> Feb 15 09:11:58 gt kernel: Code: Bad RIP value.
> Feb 15 09:11:58 gt kernel: RSP: 002b:00007f4d9b4b4d50 EFLAGS: 00000293
> ORIG_RAX: 000000000000004a
> Feb 15 09:11:58 gt kernel: RAX: ffffffffffffffda RBX: 00007f4d9f2abd28
> RCX: 00007f4db9580e1b
> Feb 15 09:11:58 gt kernel: RDX: 0000000000000002 RSI: 0000000000000002
> RDI: 0000000000000072
> Feb 15 09:11:58 gt kernel: RBP: 0000000000000002 R08: 0000000000000000
> R09: 00007f4d9b4b4d70
> Feb 15 09:11:58 gt kernel: R10: 0000000000000000 R11: 0000000000000293
> R12: 00000000000001f5
> Feb 15 09:11:59 gt kernel: R13: 00007f4d9f2abc70 R14: 0000000000000000
> R15: 00007f4da63774e0
> ---------
>
> My xl info just now:
>
> xl info
> host                   : gentoo
> release                : 5.4.97-gentoo-x86_64
> version                : #1 SMP Wed Feb 10 16:43:41 CET 2021
> machine                : x86_64
> nr_cpus                : 12
> max_cpu_id             : 11
> nr_nodes               : 2
> cores_per_socket       : 6
> threads_per_core       : 1
> cpu_mhz                : 2399.981
> hw_caps                :
> bfebfbff:77fef3ff:2c100800:00000021:00000001:000037ab:00000000:00000100
> virt_caps              : pv hvm hvm_directio pv_directio hap shadow
> iommu_hap_pt_share
> total_memory           : 130953
> free_memory            : 1551
> sharing_freed_memory   : 0
> sharing_used_memory    : 0
> outstanding_claims     : 0
> free_cpus              : 0
> xen_major              : 4
> xen_minor              : 14
> xen_extra              : .2-pre
> xen_version            : 4.14.2-pre
> xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32
> hvm-3.0-x86_32p hvm-3.0-x86_64
> xen_scheduler          : credit2
> xen_pagesize           : 4096
> platform_params        : virt_start=0xffff800000000000
> xen_changeset          :
> xen_commandline        : xen.cfg xen-marker-51 console_timestamps=date
> iommu=1 com1=115200,8n1 console=com1 conswitch=lx
> cpufreq=xen:performance,verbose smt=0 maxcpus=12 core_parking=power
> nmi=dom0 gnttab_max_frames=512 gnttab_max_maptrack_frames=1024
> vcpu_migration_delay=2000 tickle_one_idle_cpu=1 spec-ctrl=no-xen
> sched=credit2 timer_slop=5000 max_cstate=2 dom0_mem=16G,max:16G
> dom0_max_vcpus=8 ept=exec_sp=1
> cc_compiler            : gcc (Gentoo 9.3.0-r2 p4) 9.3.0
> cc_compile_by          : hakon
> cc_compile_domain      : alstadheim.priv.no
> cc_compile_date        : Sat Feb 13 22:07:40 CET 2021
> build_id               : d3fb26987b749da48c2549b12ba9ea4a
> xend_config_format     : 4
> 0:root@gentoo xen-consoles #
>
>
> P.S: I know I should do something about my dmarc set-up, so I can have
> a separate, unprotected "from:" address for posting to mailing-lists.
> Pointers to how-to appreciated.
>
> ---
>
> Håkon
>
>
>
>