I'm recently having total network stalls on some domUs . Dmesg on domU
shows a number of lines like:
Feb 15 11:12:38 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 11:12:38 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 11:12:38 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 11:12:39 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 11:12:39 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 11:12:40 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 11:12:42 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 11:12:45 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 11:12:52 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 11:13:05 gt kernel: net eth0: rx->offset: 0, size: -1
On occasion, with the longer stalls (~ 5 minutes) I get:
Feb 15 09:29:04 gt kernel: net_ratelimit: 5 callbacks suppressed
I have tried this on xen 4.14.0, 4.14.1 and 4.14.2-pre, with various
guest kernels ranging from linux-4.19.170 to the early 5.10.x kernels.
Newer 5.10 kernels give me some other error, to do with interrupts.
Seems interrupts vectors point to La-La-Land, or else they are routed to
the wrong CPU. I'm fairly certain I did not have this issue running
Xen-4.14-staging with the earliest linux-5.10.x, but that had other
issues. File-system corruption got me a week around christmas with the
whole system down :- ( . Allowed me to learn how to use bacula from a
grml rescue cd without a catalog-database :-) .
The stalls happen under load (net or cpu, don't know which matters
more). I can reliably reproduce if i run a lot of compilations& network
fetches in the domu while simultaneously lanunching firefox and
thunderbird. I have home mounted with nfs from the dom0, so lots of
traffic when thunderbird and firefox launch.
On occation the stalls are caught by the kernel, and I get a
stack-trace, but I guess those are consequences of the network stall,
incidental to the real issue. like:
Feb 15 09:09:38 gt kernel: status: r
Feb 15 09:09:38 gt kernel: net_ratelimit: 5 callbacks suppressed
Feb 15 09:09:38 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 09:09:38 gt root[45567]: ACPI event unhandled: jack/lineout
LINEOUT unplug
Feb 15 09:09:38 gt root[45570]: ACPI event unhandled: jack/videoout
VIDEOOUT unplug
Feb 15 09:09:44 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 09:09:57 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 09:10:01 gt CROND[45682]: (root) CMD (/usr/lib/sa/sa1 1 1)
Feb 15 09:10:23 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 09:11:17 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 09:11:58 gt kernel: INFO: task IndexedDB #3:45442 blocked for
more than 122 seconds.
Feb 15 09:11:58 gt kernel: Not tainted 5.4.80-gentoo-r1-x86_64 #1
Feb 15 09:11:58 gt kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 15 09:11:58 gt kernel: IndexedDB #3 D 0 45442 3451 0x00000000
Feb 15 09:11:58 gt kernel: Call Trace:
Feb 15 09:11:58 gt kernel: __schedule+0x2a3/0x7a0
Feb 15 09:11:58 gt kernel: ? nfs_pageio_complete+0xa8/0xf0
Feb 15 09:11:58 gt kernel: schedule+0x34/0xa0
Feb 15 09:11:58 gt kernel: io_schedule+0x3c/0x60
Feb 15 09:11:58 gt kernel: wait_on_page_bit_common+0x125/0x330
Feb 15 09:11:58 gt kernel: ?
trace_event_raw_event_file_check_and_advance_wb_err+0xf0/0xf0
Feb 15 09:11:58 gt kernel: __filemap_fdatawait_range+0x7b/0xe0
Feb 15 09:11:58 gt kernel: file_write_and_wait_range+0x67/0x90
Feb 15 09:11:58 gt kernel: nfs_file_fsync+0x83/0x190
Feb 15 09:11:58 gt kernel: __x64_sys_fsync+0x2f/0x60
Feb 15 09:11:58 gt kernel: do_syscall_64+0x51/0x130
Feb 15 09:11:58 gt kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Feb 15 09:11:58 gt kernel: RIP: 0033:0x7f4db9580e1b
Feb 15 09:11:58 gt kernel: Code: Bad RIP value.
Feb 15 09:11:58 gt kernel: RSP: 002b:00007f4d9b4b4d50 EFLAGS: 00000293
ORIG_RAX: 000000000000004a
Feb 15 09:11:58 gt kernel: RAX: ffffffffffffffda RBX: 00007f4d9f2abd28
RCX: 00007f4db9580e1b
Feb 15 09:11:58 gt kernel: RDX: 0000000000000002 RSI: 0000000000000002
RDI: 0000000000000072
Feb 15 09:11:58 gt kernel: RBP: 0000000000000002 R08: 0000000000000000
R09: 00007f4d9b4b4d70
Feb 15 09:11:58 gt kernel: R10: 0000000000000000 R11: 0000000000000293
R12: 00000000000001f5
Feb 15 09:11:59 gt kernel: R13: 00007f4d9f2abc70 R14: 0000000000000000
R15: 00007f4da63774e0
---------
My xl info just now:
xl info
host : gentoo
release : 5.4.97-gentoo-x86_64
version : #1 SMP Wed Feb 10 16:43:41 CET 2021
machine : x86_64
nr_cpus : 12
max_cpu_id : 11
nr_nodes : 2
cores_per_socket : 6
threads_per_core : 1
cpu_mhz : 2399.981
hw_caps :
bfebfbff:77fef3ff:2c100800:00000021:00000001:000037ab:00000000:00000100
virt_caps : pv hvm hvm_directio pv_directio hap shadow
iommu_hap_pt_share
total_memory : 130953
free_memory : 1551
sharing_freed_memory : 0
sharing_used_memory : 0
outstanding_claims : 0
free_cpus : 0
xen_major : 4
xen_minor : 14
xen_extra : .2-pre
xen_version : 4.14.2-pre
xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32
hvm-3.0-x86_32p hvm-3.0-x86_64
xen_scheduler : credit2
xen_pagesize : 4096
platform_params : virt_start=0xffff800000000000
xen_changeset :
xen_commandline : xen.cfg xen-marker-51 console_timestamps=date
iommu=1 com1=115200,8n1 console=com1 conswitch=lx
cpufreq=xen:performance,verbose smt=0 maxcpus=12 core_parking=power
nmi=dom0 gnttab_max_frames=512 gnttab_max_maptrack_frames=1024
vcpu_migration_delay=2000 tickle_one_idle_cpu=1 spec-ctrl=no-xen
sched=credit2 timer_slop=5000 max_cstate=2 dom0_mem=16G,max:16G
dom0_max_vcpus=8 ept=exec_sp=1
cc_compiler : gcc (Gentoo 9.3.0-r2 p4) 9.3.0
cc_compile_by : hakon
cc_compile_domain : alstadheim.priv.no
cc_compile_date : Sat Feb 13 22:07:40 CET 2021
build_id : d3fb26987b749da48c2549b12ba9ea4a
xend_config_format : 4
0:root@gentoo xen-consoles #
P.S: I know I should do something about my dmarc set-up, so I can have a
separate, unprotected "from:" address for posting to mailing-lists.
Pointers to how-to appreciated.
---
Håkon
shows a number of lines like:
Feb 15 11:12:38 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 11:12:38 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 11:12:38 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 11:12:39 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 11:12:39 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 11:12:40 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 11:12:42 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 11:12:45 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 11:12:52 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 11:13:05 gt kernel: net eth0: rx->offset: 0, size: -1
On occasion, with the longer stalls (~ 5 minutes) I get:
Feb 15 09:29:04 gt kernel: net_ratelimit: 5 callbacks suppressed
I have tried this on xen 4.14.0, 4.14.1 and 4.14.2-pre, with various
guest kernels ranging from linux-4.19.170 to the early 5.10.x kernels.
Newer 5.10 kernels give me some other error, to do with interrupts.
Seems interrupts vectors point to La-La-Land, or else they are routed to
the wrong CPU. I'm fairly certain I did not have this issue running
Xen-4.14-staging with the earliest linux-5.10.x, but that had other
issues. File-system corruption got me a week around christmas with the
whole system down :- ( . Allowed me to learn how to use bacula from a
grml rescue cd without a catalog-database :-) .
The stalls happen under load (net or cpu, don't know which matters
more). I can reliably reproduce if i run a lot of compilations& network
fetches in the domu while simultaneously lanunching firefox and
thunderbird. I have home mounted with nfs from the dom0, so lots of
traffic when thunderbird and firefox launch.
On occation the stalls are caught by the kernel, and I get a
stack-trace, but I guess those are consequences of the network stall,
incidental to the real issue. like:
Feb 15 09:09:38 gt kernel: status: r
Feb 15 09:09:38 gt kernel: net_ratelimit: 5 callbacks suppressed
Feb 15 09:09:38 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 09:09:38 gt root[45567]: ACPI event unhandled: jack/lineout
LINEOUT unplug
Feb 15 09:09:38 gt root[45570]: ACPI event unhandled: jack/videoout
VIDEOOUT unplug
Feb 15 09:09:44 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 09:09:57 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 09:10:01 gt CROND[45682]: (root) CMD (/usr/lib/sa/sa1 1 1)
Feb 15 09:10:23 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 09:11:17 gt kernel: net eth0: rx->offset: 0, size: -1
Feb 15 09:11:58 gt kernel: INFO: task IndexedDB #3:45442 blocked for
more than 122 seconds.
Feb 15 09:11:58 gt kernel: Not tainted 5.4.80-gentoo-r1-x86_64 #1
Feb 15 09:11:58 gt kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 15 09:11:58 gt kernel: IndexedDB #3 D 0 45442 3451 0x00000000
Feb 15 09:11:58 gt kernel: Call Trace:
Feb 15 09:11:58 gt kernel: __schedule+0x2a3/0x7a0
Feb 15 09:11:58 gt kernel: ? nfs_pageio_complete+0xa8/0xf0
Feb 15 09:11:58 gt kernel: schedule+0x34/0xa0
Feb 15 09:11:58 gt kernel: io_schedule+0x3c/0x60
Feb 15 09:11:58 gt kernel: wait_on_page_bit_common+0x125/0x330
Feb 15 09:11:58 gt kernel: ?
trace_event_raw_event_file_check_and_advance_wb_err+0xf0/0xf0
Feb 15 09:11:58 gt kernel: __filemap_fdatawait_range+0x7b/0xe0
Feb 15 09:11:58 gt kernel: file_write_and_wait_range+0x67/0x90
Feb 15 09:11:58 gt kernel: nfs_file_fsync+0x83/0x190
Feb 15 09:11:58 gt kernel: __x64_sys_fsync+0x2f/0x60
Feb 15 09:11:58 gt kernel: do_syscall_64+0x51/0x130
Feb 15 09:11:58 gt kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Feb 15 09:11:58 gt kernel: RIP: 0033:0x7f4db9580e1b
Feb 15 09:11:58 gt kernel: Code: Bad RIP value.
Feb 15 09:11:58 gt kernel: RSP: 002b:00007f4d9b4b4d50 EFLAGS: 00000293
ORIG_RAX: 000000000000004a
Feb 15 09:11:58 gt kernel: RAX: ffffffffffffffda RBX: 00007f4d9f2abd28
RCX: 00007f4db9580e1b
Feb 15 09:11:58 gt kernel: RDX: 0000000000000002 RSI: 0000000000000002
RDI: 0000000000000072
Feb 15 09:11:58 gt kernel: RBP: 0000000000000002 R08: 0000000000000000
R09: 00007f4d9b4b4d70
Feb 15 09:11:58 gt kernel: R10: 0000000000000000 R11: 0000000000000293
R12: 00000000000001f5
Feb 15 09:11:59 gt kernel: R13: 00007f4d9f2abc70 R14: 0000000000000000
R15: 00007f4da63774e0
---------
My xl info just now:
xl info
host : gentoo
release : 5.4.97-gentoo-x86_64
version : #1 SMP Wed Feb 10 16:43:41 CET 2021
machine : x86_64
nr_cpus : 12
max_cpu_id : 11
nr_nodes : 2
cores_per_socket : 6
threads_per_core : 1
cpu_mhz : 2399.981
hw_caps :
bfebfbff:77fef3ff:2c100800:00000021:00000001:000037ab:00000000:00000100
virt_caps : pv hvm hvm_directio pv_directio hap shadow
iommu_hap_pt_share
total_memory : 130953
free_memory : 1551
sharing_freed_memory : 0
sharing_used_memory : 0
outstanding_claims : 0
free_cpus : 0
xen_major : 4
xen_minor : 14
xen_extra : .2-pre
xen_version : 4.14.2-pre
xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32
hvm-3.0-x86_32p hvm-3.0-x86_64
xen_scheduler : credit2
xen_pagesize : 4096
platform_params : virt_start=0xffff800000000000
xen_changeset :
xen_commandline : xen.cfg xen-marker-51 console_timestamps=date
iommu=1 com1=115200,8n1 console=com1 conswitch=lx
cpufreq=xen:performance,verbose smt=0 maxcpus=12 core_parking=power
nmi=dom0 gnttab_max_frames=512 gnttab_max_maptrack_frames=1024
vcpu_migration_delay=2000 tickle_one_idle_cpu=1 spec-ctrl=no-xen
sched=credit2 timer_slop=5000 max_cstate=2 dom0_mem=16G,max:16G
dom0_max_vcpus=8 ept=exec_sp=1
cc_compiler : gcc (Gentoo 9.3.0-r2 p4) 9.3.0
cc_compile_by : hakon
cc_compile_domain : alstadheim.priv.no
cc_compile_date : Sat Feb 13 22:07:40 CET 2021
build_id : d3fb26987b749da48c2549b12ba9ea4a
xend_config_format : 4
0:root@gentoo xen-consoles #
P.S: I know I should do something about my dmarc set-up, so I can have a
separate, unprotected "from:" address for posting to mailing-lists.
Pointers to how-to appreciated.
---
Håkon