Mailing List Archive

Xen 4.12 DomU hang / freeze / stall under high network/disk load
Dear Xen Team:

Since upgrading to Xen 4.12, I'm experiencing an ongoing problem with
stalled guests. I had previously thought I was the only one with this
problem, and had reported this to my distro's virtualization team (
see: https://lists.opensuse.org/opensuse-virtual/2019-12/msg00000.html
and https://lists.opensuse.org/opensuse-virtual/2019-12/msg00003.html
for thread heads ), but although they tried to help, we all kind of
concluded (wrongly) that I must just have had a bad guest.

I finally recreated a new host and a new guest, clean, from scratch,
thinking that would solve the problem, and it didn't. That led me to
search again, and I now see that another individual (who doubtless
thought HE was the only one) has reported this issue to your list (
https://lists.xenproject.org/archives/html/xen-users/2020-02/msg00015.html
et al).

So I'm sending this report here to let you all know that this problem
is now no longer limited to one person, and it is reproducible, and
is, in my opinion, severe. I hope that someone here can help, or
point us in the right direction. Here we go:

Problem: Xen DomU guests randomly stall under high network/disk
loads. Dom0 is not affected. Randomly means anywhere between 1 hour
and 14 days after guest boot - the time seems to shorten with (or
perhaps the problem is triggered by) increased network (and possibly
disk) activity.

Symptoms on the DomU Guest:
1. Guest machine performs normally until the moment of failure. No
abnormal log/console entries exist.
2. At the moment of failure, the guest's network goes offline. No
abnormal log/console entries are written at that moment.
3. Processes which were trying to connect to the network start to
consume increasing amounts of CPU.
4. Load average of the guest starts to increase, continuing upward
without apparent bound.
5. If a high-priority bash shell is left logged in on the guest hvc0
console, some commands might still be runnable; most are not.
6. If the guest console is not logged in, the console is frozen and
doesn't even echo characters.
7. Some guests will output messages on the console like this:
kernel: [164084.912966] NMI watchdog: BUG: soft lockup - CPU#16 stuck for 67s!
8. On some others, I will also see output like:
BUG: workqueue lockup - pool cpus=20 node=0 flags=0x0 nice=-20 stuck for 70s!
9. Sometimes there is no output at all on the console.

Symptoms on the Dom0 Host:
1. The Dom0 Host is unaffected. The only indication anything is
happening on the host are two log entries in /var/log/messages:
vif vif-6-0 vif6.0: Guest Rx stalled
br0: port 2(vif6.0) entered disabled state
2. Other guests are not affected (Although other guests too may stall
at other random times, stalls on one guest do not seem to affect other
guests directly.)

Circumstances when the problem first occurred:
1. All hosts and guests were previously on Xen 4.9.4 (OpenSuse 42.3,
Linux 4.4.180, Xen 4.9.4)
2. I upgraded one physical host to Xen 4.12.1 (OpenSuse 15.1, Linux
4.12.14, Xen 4.12.1).
3. The guest(s) on that host started malfunctioning at that point.

Immediate steps taken while the guest was stalled, which did not help:
1. Tried to use high-priority shell on guest console to kill high-CPU
processes; they were unkillable.
2. Tried to use guest console to stop and restart network; commands
were unresponsive.
3. Tried to use guest console to shutdown/init 0. This caused console
to be terminated, but guest would not otherwise shutdown.
4. Tried to use host xl interface to unplug/replug network bridges.
This appeared to work from host side, but guest was unaffected.

One thing which I accidentally discovered that *did* help:
1. Tried sending xl trigger nmi from the host to the guest.

When I trigger the stalled guest with an NMI, I get its attention.
The guest will print the following on the console:

Uhhuh. NMI received for unknown reason 00 on CPU 0.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue

In some cases (pattern not yet known), the guest will then immediately
come back online: The network will come back online, and all
processes will slowly stop consuming CPU, and things will return to
normal. Existing network connections were obviously terminated, but
new connections are accepted. In that case, it's like the guest just
magically comes back to life.

When this works, the host log shows:
vif vif-6-0 vif6.0: Guest Rx ready
br0: port 2(vif6.0) entered blocking state
br0: port 2(vif6.0) entered forwarding state

And all seems well... as if the guest had never stalled.

However, this is not reliable. In most cases, the guest will print
those messages, but the processes will NOT recover, and the network
will come back impaired, or not at all. When that happens, repeated
NMIs do not help: If the guest doesn't recover the first time, it
doesn't recover at all.

The *only* reliable way to recover is to destroy the guest completely,
and recreate it. This is a hard destroy: the guest cannot shut
itself down. The guest will then run fine... until the next stall.
But of course a hard-destroy can't be a healthy thing for a guest
machine, and that's really not a solution.

Long-term mitigation steps which were tried which did not help.
1. Thought this was an SSH bug (since sshd processes were consuming
high CPU), installed latest OpenSSH.
2. Though maybe a PV problem, tried under HVM instead of PV.
3. Noted a problem with grant frames, applied the recommended fix for
that, my config now looks like:
# xen-diag gnttab_query_size 0 # Domain-0
domid=0: nr_frames=1, max_nr_frames=64
# xen-diag gnttab_query_size 1 # Xenstore
domid=1: nr_frames=4, max_nr_frames=4
# xen-diag gnttab_query_size 6 # My guest
domid=6: nr_frames=17, max_nr_frames=256
4. Thought maybe a kernel module might be at issue, reviewed list with
OpenSuse team, pruned modules.
5. Thought this might be a kernel mismatch, was referred to a new
kernel by OpenSuse team (Linux 4.12.13 for OpenSuse 42.3). That
changed some of the console output behavior and logging, but did not
solve the problem.
6. Thought this might be a general OS mismatch, tried upgrading the
guest to the same OS/Xen versions as the host (OpenSuse 15.1/Linux
4.12.14/Xen 4.12.1). In this configuration, no console or log output
is generated on the guest at all, it just stalls.
7. Assumed (incorrectly, it now turns out) that something was just
"wrong" with my guest, tried a fresh load of host, and a fresh guest.
I thought that would solve it, but to my sadness, it did not.

Which means that this is now a reproducible bug.

Steps to reproduce:
1. Get a server. I'm using a Dell PowerEdge R720, but this has
happened on several different Dell models. My current server has two
16-core CPUs, and 128GB of RAM.
2. Load Xen 4.12.1 (OpenSuse 15.1/Xen 4.12.1) on the server. Boot it
up in Xen Dom0/host mode.
3. Create a new guest machine, also with 4.12.1.
4. Fire up the guest.
5. Put a lot of data on the guest (my guest has 3 TB of files and data).
6. Plug a crossover cable into your server, and plug the other end
into some other Linux machine.
7. From that other machine, start pounding the guest. An rsync of the
entire data partition is a great way to trigger this. If I run
several outbound rsyncs together, I can crash my guest in under 48
hours. If I run 4 or 5, I can often crash the guest in just 2 hours.
If you don't want to damage your SSDs on your other machine, here's my
current command (my host is 192.168.1.10, and my guest is
192.168.1.11, so I plug in some other machine and make it, say,
192.168.1.12, and then run:

nohup ssh 192.168.1.11 tar cf - --one-file-system /a | cat > /dev/null &

Where /a is my directory full of user data. 4-6 of these running
simultaneously will bring the guest to its knees in short order.

On my most recent test, I did the NMI trigger thing, and found this in
the guest's /var/log/messages after sending the trigger (I've removed
tagging and timestamping for clarity:)

Uhhuh. NMI received for unknown reason 00 on CPU 0.
Do you have a strange powersaving mode enabled?
Dazed and confused, but trying to continue
clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc'
as unstable because the skew is too large:
clocksource: 'xen' wd_now: 58842b687eb3c wd_last: 55aa97ff29565 mask:
ffffffffffffffff
clocksource: 'tsc' cs_now: 58d3355ea9a87e cs_last: 585cca21d4f074
mask: ffffffffffffffff
tsc: Marking TSC unstable due to clocksource watchdog
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=-20 stuck for 50117s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256
pending: clocksource_watchdog_work, vmstat_shepherd, cache_reap
workqueue mm_percpu_wq: flags=0x8
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
pending: vmstat_update
workqueue writeback: flags=0x4e
pwq 52: cpus=0-25 flags=0x4 nice=0 active=1/256
in-flight: 28593:wb_workfn
workqueue kblockd: flags=0x18
pwq 1: cpus=0 node=0 flags=0x0 nice=-20 active=2/256
pending: blk_mq_run_work_fn, blk_mq_timeout_work
pool 52: cpus=0-25 flags=0x4 nice=0 hung=0s workers=3 idle: 32044 18125

That led me to search around, and I tripped over this:
https://wiki.debian.org/Xen/Clocksource , which describes a guest
hanging with the message "clocksource/0: Time went backwards/"
Although I did not see this message, and this is not directly on point
with OpenSuse (since our /proc structure doesn't include some of the
switches mentioned), I did notice clocksource references in the logs
(see above), and that led me back to:
https://doc.opensuse.org/documentation/leap/virtualization/html/book.virt/cha-xen-manage.html,
and specifically the tsc_mode setting. I have no idea if it's
relevant, but I since I'm out of ideas and have nothing better to try,
I have now booted my guest into tsc_mode=1 and am stress testing it to
see if it fares any better this way.

Administrivia:
OS: OpenSuse 15.1
Linux: 4.12.14-lp151.28.36
Xen: 4.12.1
Dom0 boot parameters: dom0_mem=4G dom0_max_vcpus=4 dom0_vcpus_pin
gnttab_max_frames=256
Xen guest config:

name="guest1"
description="guest1"
memory=90112
maxmem=90112
vcpus=26
cpus="4-31"
on_poweroff="destroy"
on_reboot="restart"
on_crash="restart"
on_watchdog="restart"
localtime=0
keymap="en-us"
type="pv"
kernel="/usr/lib/grub2/x86_64-xen/grub.xen"
extra="elevator=noop"
disk=[
'/xen/guest1/guest1.root,raw,xvda1,w',
'/xen/guest1/guest1.swap,raw,xvda2,w',
'/xen/guest1/guest1.xa,raw,xvda3,w',
]
vif=[
'rate=100Mb/s,mac=00:16:3f:49:4a:41,bridge=br0',
]
vfb=['type=vnc,vncunused=1']

I had originally thought that I was the only person with this problem,
and that's why I thought a fresh guest would fix it - the problem
followed me around different servers, so that made sense. Over the
past weeks I've set up a fresh guest on my fresh host, and, just on a
whim, did the above stress testing on it... sadly, it only lasted for
36 hours. On the older Xen 4.9, I *never* encountered problems, and
nothing changed other than OS/Xen versions when I did the upgrades to
the new versions.

Since I can now reproduce the problem on different hardware and
setups, I thought I'd start my searches over again. To my relief, I
found that, just in the past few weeks, another person has now
reported what seems to be the same problem, only he reported it to
this list (whereas I had sent my report to OpenSuse.) In his message,
referenced above, he states that the problem is limited to Xen 4.12
and 4.13, and that rolling back to Xen 4.11 solves the problem.

If that's right, there seems to be a *significant* problem somewhere,
and it's clearly no longer just one instance.

I am looking for a means on Xen to bug report this; so far, I haven't
found it, but I will keep looking.

Meanwhile, I'm hoping that these details and history spark something
for some of you here. Do any of you have any ideas on this? Any
thoughts, guidance, musings, etc., anything at all would be
appreciated.

Again, thank you all for your patience and help, I am very grateful!

Glen

_______________________________________________
Xen-users mailing list
Xen-users@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-users
Re: Xen 4.12 DomU hang / freeze / stall under high network/disk load [ In reply to ]
On 2/13/20 1:07 PM, Glen wrote:
> Dear Xen Team:
>
> Since upgrading to Xen 4.12, I'm experiencing an ongoing problem with
> stalled guests. I had previously thought I was the only one with this
> problem, and had reported this to my distro's virtualization team (
> see: https://lists.opensuse.org/opensuse-virtual/2019-12/msg00000.html
> and https://lists.opensuse.org/opensuse-virtual/2019-12/msg00003.html
> for thread heads ), but although they tried to help, we all kind of
> concluded (wrongly) that I must just have had a bad guest.
>
> I finally recreated a new host and a new guest, clean, from scratch,
> thinking that would solve the problem, and it didn't. That led me to
> search again, and I now see that another individual (who doubtless
> thought HE was the only one) has reported this issue to your list (
> https://lists.xenproject.org/archives/html/xen-users/2020-02/msg00015.html
> et al).

I'm not one of the Xen developers, but I wouldn't necessarily assume the same root cause from the information you've provided so far.

<snip>
> Circumstances when the problem first occurred:
> 1. All hosts and guests were previously on Xen 4.9.4 (OpenSuse 42.3,
> Linux 4.4.180, Xen 4.9.4)
> 2. I upgraded one physical host to Xen 4.12.1 (OpenSuse 15.1, Linux
> 4.12.14, Xen 4.12.1).
> 3. The guest(s) on that host started malfunctioning at that point.

If you can, try Xen 4.9.4 with Linux 4.12.14 (or Xen 4.12.1 with Linux 4.4.180.)

That will help isolate the issue to either Xen or the Linux kernel.

<snip>

> 4. Tried to use host xl interface to unplug/replug network bridges.
> This appeared to work from host side, but guest was unaffected.

Do you mean 'xl network-detach' and 'xl network-attach'? If not, please give example commands.

<snip>
>
> Steps to reproduce:
> 1. Get a server. I'm using a Dell PowerEdge R720, but this has
> happened on several different Dell models. My current server has two
> 16-core CPUs, and 128GB of RAM.

What CPUs are these? Can you dump the information from one of the cpus in /proc/cpuinfo so we can see what microcode version you have, in the highly
unlikely case this information is pertinent?

> 2. Load Xen 4.12.1 (OpenSuse 15.1/Xen 4.12.1) on the server. Boot it
> up in Xen Dom0/host mode.
What about attaching the output of 'xl dmesg' - both the initial boot messages and anything that comes from running the specific domU?

> 7. From that other machine, start pounding the guest. An rsync of the
> entire data partition is a great way to trigger this. If I run
> several outbound rsyncs together, I can crash my guest in under 48
> hours. If I run 4 or 5, I can often crash the guest in just 2 hours.
> If you don't want to damage your SSDs on your other machine, here's my
> current command (my host is 192.168.1.10, and my guest is
> 192.168.1.11, so I plug in some other machine and make it, say,
> 192.168.1.12, and then run:
>
> nohup ssh 192.168.1.11 tar cf - --one-file-system /a | cat > /dev/null &

How about trying iperf or iperf3 with either only transmit or receive? iperf is specifically designed to use maximal bandwidth and doesn't use disk.

http://fasterdata.es.net/performance-testing/network-troubleshooting-tools/throughput-tool-comparision/

For independently load-testing disk, you can try dd or fio, while being cognizant of the disk cache. To avoid actual disk I/O I think you should be
able to use a ram based disk in the dom0 instead of a physical disk. However, I wouldn't bother if you can reproduce with network only, until the
network issue has been fixed.

> Administrivia:
> OS: OpenSuse 15.1
> Linux: 4.12.14-lp151.28.36
> Xen: 4.12.1
> Dom0 boot parameters: dom0_mem=4G dom0_max_vcpus=4 dom0_vcpus_pin
> gnttab_max_frames=256
> Xen guest config:
>
> name="guest1"
> description="guest1"
> memory=90112
> maxmem=90112
> vcpus=26

This is fairly large.

Have you tried both fewer cpus and less memory? If you can reproduce with iperf, which probably will reproduce more quickly, can you reproduce with
memory=2048 and vcpus=1 or vcpus=2 for example? FYI the domU might not boot at all with vcpus=1 with some kernel versions.

But I would try that only if none of the network changes show a difference.

> cpus="4-31"
> on_poweroff="destroy"
> on_reboot="restart"
> on_crash="restart"
> on_watchdog="restart"
> localtime=0
> keymap="en-us"
> type="pv"
> kernel="/usr/lib/grub2/x86_64-xen/grub.xen"
> extra="elevator=noop"
> disk=[
> '/xen/guest1/guest1.root,raw,xvda1,w',
> '/xen/guest1/guest1.swap,raw,xvda2,w',
> '/xen/guest1/guest1.xa,raw,xvda3,w',
> ]
> vif=[
> 'rate=100Mb/s,mac=00:16:3f:49:4a:41,bridge=br0',

You probably want try removing the vif rate limit. Using rate=... I got soft lockups on the dom0 many kernel versions ago. I don't know what happens
if the soft lockups in the dom0 have been fixed - perhaps another problem remains in the domU.

If removing "rate" fixes it, switch to rate limiting with another method - possibly 'tc' but there might be something better available now using BPF.

Also, have you tried at all looking at or changing the offload settings in the dom0 and/or domU with "/sbin/ethtool -k/-K <device>" ? I don't think
this is actually the issue. But it's been a source of problems historically and it's easy to try.

> I am looking for a means on Xen to bug report this; so far, I haven't
> found it, but I will keep looking.

https://wiki.xen.org/wiki/Reporting_Bugs_against_Xen_Project

But try some more data collection and debugging first, ideally by changing one thing at a time.

> Meanwhile, I'm hoping that these details and history spark something
> for some of you here. Do any of you have any ideas on this? Any
> thoughts, guidance, musings, etc., anything at all would be
> appreciated.

x-ref https://www.kernel.org/doc/html/latest/admin-guide/sysrq.html

I don't get the impression you've tried using sysrq already, since you did not mention it by name. If you have tried sysrq, it would be helpful if you
could go back through your original email and add examples of all of the commands you've run.

For PV, to send the sysrq you can try 'xl sysrq <domU> <key>' or 'ctrl-o <key>' on the virtual serial console. Neither will probably work for HVM. I
can't figure how to send a break on the virtual serial console right now for HVM. You can also use /proc/sysrq-trigger in the domU to send a key if
the domU minimally responds.

When the domU locks up, you *might* get interesting information from the 'x' and 'l' sysrq commands within the domU, You may need to enable that
functionality with 'sysctl -w kernel.sysrq=1' .

I'm not sure the 'l' commands works for PV at all. 'l' works for HVM.

If you can send a sysrq when the domU is not locked up, but can't send one when it's locked up, that's also potentially interesting.

There's a lot of debug information available from the Xen hypervisor too, but I'm not 100% sure which of that is interesting and some of it is fairly
intrusive to collect.

--Sarah


_______________________________________________
Xen-users mailing list
Xen-users@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-users
Re: Xen 4.12 DomU hang / freeze / stall under high network/disk load [ In reply to ]
Hi Sarah!

On Thu, Feb 13, 2020 at 4:28 PM Sarah Newman <srn@prgmr.com> wrote:
> I'm not one of the Xen developers,

Thank you so much for responding, and for the many clarifications and
pointers! I am very grateful for your time. Everything is very
well-taken and very much appreciated.

> I wouldn't necessarily assume the same root cause from the information you've provided so far.

Okay, understood. This has been plaguing me since I first upgraded my
first host to 4.12, and I totally concede that I could well be
"grasping at straws." This is a production host for a large client,
and it's to the point where alarms wake me up every 4-5 nights now, so
I am... very interested in finding a solution here. (I now have
nightmares about my phone ringing, even when it isn't. Ugh.)

> > 2. I upgraded one physical host to Xen 4.12.1 (OpenSuse 15.1, Linux
> > 4.12.14, Xen 4.12.1).
> > 3. The guest(s) on that host started malfunctioning at that point.
> If you can, try Xen 4.9.4 with Linux 4.12.14 (or Xen 4.12.1 with Linux 4.4.180.)
> That will help isolate the issue to either Xen or the Linux kernel.

Understood. Tomorrow (Pacific time) I had already planned to change
the physical host (via fresh reload) to OpenSuse 15.0 (in between
these two releases) which has Linux 4.12.14 and Xen 4.10.4 (so
pre-4.12). I'll do that as an interim step on your plan, and I will
report how that goes. If it fails as well, I will take the physical
host back to 42.3 (Xen 4.9.4) and install the 4.12 kernel there, and
report.

> > 4. Tried to use host xl interface to unplug/replug network bridges.
> > This appeared to work from host side, but guest was unaffected.
> Do you mean 'xl network-detach' and 'xl network-attach'? If not, please give example commands.

I tried both xl network-detach followed by a network-attach (feeding
back in the parameters from my guest machine.)

I also tried using brctl to remove the VIF from the bridge and re-add it, as in:
brctl dellif br0 vif6.0 / brctl addif br0 vif6.0

Neither had any effect on the guest in trouble.

> > 1. Get a server. I'm using a Dell PowerEdge R720, but this has
> What CPUs are these? Can you dump the information from one of the cpus in /proc/cpuinfo so we can see what microcode version you have, in the highly unlikely case this information is pertinent?

My pleasure. This problem has happened on several different physical
hosts, I will dump for two of them:

My testing server:

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz
stepping : 2
microcode : 0x43
cpu MHz : 2596.990
cache size : 20480 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu de tsc msr pae mce cx8 apic sep mca cmov pat
clflush acpi mmx fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc
rep_good nopl nonstop_tsc cpuid pni pclmulqdq monitor est ssse3 fma
cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor
lahf_lm abm ssbd ibrs ibpb stibp fsgsbase bmi1 avx2 bmi2 erms xsaveopt
md_clear
bugs : null_seg cpu_meltdown spectre_v1 spectre_v2
spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 5193.98
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:

The production machine where this first occurred (and continues to occur):

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 45
model name : Intel(R) Xeon(R) CPU E5-2450 0 @ 2.10GHz
stepping : 7
microcode : 0x710
cpu MHz : 2100.014
cache size : 20480 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu de tsc msr pae mce cx8 apic sep mca cmov pat
clflush acpi mmx fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc
rep_good nopl nonstop_tsc cpuid pni pclmulqdq monitor est ssse3 cx16
sse4_1 sse4_2 popcnt aes xsave avx hypervisor lahf_lm xsaveopt
bugs : null_seg cpu_meltdown spectre_v1 spectre_v2
spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 4200.02
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:

> What about attaching the output of 'xl dmesg' - both the initial boot messages and anything that comes from running the specific domU?

I'll post just from my test machine, skipping the pretty ASCII art -
let me know if that's not right, or if you want to see a second
machine as well.

(XEN) Xen version 4.12.1_06-lp151.2.9 (abuild@suse.de) (gcc (SUSE
Linux) 7.4.1 20190905 [gcc-7-branch revision 275407]) debug=n Fri Dec
6 16:56:43 UTC 2019
(XEN) Latest ChangeSet:
(XEN) Bootloader: GRUB2 2.02
(XEN) Command line: dom0_mem=4G dom0_max_vcpus=4 dom0_vcpus_pin
gnttab_max_frames=256 vga=gfx-1024x768x16
(XEN) Xen image load base address: 0
(XEN) Video information:
(XEN) VGA is graphics mode 1024x768, 16 bpp
(XEN) VBE/DDC methods: V2; EDID transfer time: 1 seconds
(XEN) Disc information:
(XEN) Found 2 MBR signatures
(XEN) Found 2 EDD information structures
(XEN) Xen-e820 RAM map:
(XEN) 0000000000000000 - 000000000009c000 (usable)
(XEN) 000000000009c000 - 00000000000a0000 (reserved)
(XEN) 00000000000e0000 - 0000000000100000 (reserved)
(XEN) 0000000000100000 - 000000007a289000 (usable)
(XEN) 000000007a289000 - 000000007af0b000 (reserved)
(XEN) 000000007af0b000 - 000000007b93b000 (ACPI NVS)
(XEN) 000000007b93b000 - 000000007bab8000 (ACPI data)
(XEN) 000000007bab8000 - 000000007bae9000 (usable)
(XEN) 000000007bae9000 - 000000007baff000 (ACPI data)
(XEN) 000000007baff000 - 000000007bb00000 (usable)
(XEN) 000000007bb00000 - 0000000090000000 (reserved)
(XEN) 00000000feda8000 - 00000000fedac000 (reserved)
(XEN) 00000000ff310000 - 0000000100000000 (reserved)
(XEN) 0000000100000000 - 0000001880000000 (usable)
(XEN) New Xen image base address: 0x79c00000
(XEN) ACPI: RSDP 000FE320, 0024 (r2 DELL )
(XEN) ACPI: XSDT 7BAB60E8, 00BC (r1 DELL PE_SC3 0 1000013)
(XEN) ACPI: FACP 7BAB2000, 00F4 (r4 DELL PE_SC3 0 DELL 1)
(XEN) ACPI: DSDT 7BA9C000, EACD (r2 DELL PE_SC3 3 DELL 1)
(XEN) ACPI: FACS 7B8F3000, 0040
(XEN) ACPI: MCEJ 7BAB5000, 0130 (r1 INTEL 1 INTL 100000D)
(XEN) ACPI: WD__ 7BAB4000, 0134 (r1 DELL PE_SC3 1 DELL 1)
(XEN) ACPI: SLIC 7BAB3000, 0024 (r1 DELL PE_SC3 1 DELL 1)
(XEN) ACPI: HPET 7BAB1000, 0038 (r1 DELL PE_SC3 1 DELL 1)
(XEN) ACPI: APIC 7BAB0000, 0AFC (r2 DELL PE_SC3 0 DELL 1)
(XEN) ACPI: MCFG 7BAAF000, 003C (r1 DELL PE_SC3 1 DELL 1)
(XEN) ACPI: MSCT 7BAAE000, 0090 (r1 DELL PE_SC3 1 DELL 1)
(XEN) ACPI: SLIT 7BAAD000, 006C (r1 DELL PE_SC3 1 DELL 1)
(XEN) ACPI: SRAT 7BAAB000, 1130 (r3 DELL PE_SC3 1 DELL 1)
(XEN) ACPI: SSDT 7B959000, 1424A9 (r2 DELL PE_SC3 4000 INTL 20121114)
(XEN) ACPI: SSDT 7B956000, 217F (r2 DELL PE_SC3 2 INTL 20121114)
(XEN) ACPI: SSDT 7B955000, 006E (r2 DELL PE_SC3 2 INTL 20121114)
(XEN) ACPI: PRAD 7B954000, 0132 (r2 DELL PE_SC3 2 INTL 20121114)
(XEN) ACPI: DMAR 7BAFE000, 00F8 (r1 DELL PE_SC3 1 DELL 1)
(XEN) ACPI: HEST 7BAFD000, 017C (r1 DELL PE_SC3 2 DELL 1)
(XEN) ACPI: BERT 7BAFC000, 0030 (r1 DELL PE_SC3 2 DELL 1)
(XEN) ACPI: ERST 7BAFB000, 0230 (r1 DELL PE_SC3 2 DELL 1)
(XEN) ACPI: EINJ 7BAFA000, 0150 (r1 DELL PE_SC3 2 DELL 1)
(XEN) System RAM: 98210MB (100567388kB)
(XEN) Domain heap initialised DMA width 32 bits
(XEN) ACPI: 32/64X FACS address mismatch in FADT -
7b8f3000/0000000000000000, using 32
(XEN) IOAPIC[0]: apic_id 8, version 32, address 0xfec00000, GSI 0-23
(XEN) IOAPIC[1]: apic_id 9, version 32, address 0xfec01000, GSI 24-47
(XEN) IOAPIC[2]: apic_id 10, version 32, address 0xfec40000, GSI 48-71
(XEN) Enabling APIC mode: Phys. Using 3 I/O APICs
(XEN) Not enabling x2APIC (upon firmware request)
(XEN) xstate: size: 0x340 and states: 0x7
(XEN) CMCI: threshold 0x2 too large for CPU0 bank 17, using 0x1
(XEN) CMCI: threshold 0x2 too large for CPU0 bank 18, using 0x1
(XEN) CMCI: threshold 0x2 too large for CPU0 bank 19, using 0x1
(XEN) Speculative mitigation facilities:
(XEN) Hardware features: IBRS/IBPB STIBP L1D_FLUSH SSBD MD_CLEAR
(XEN) Compiled-in support: INDIRECT_THUNK SHADOW_PAGING
(XEN) Xen settings: BTI-Thunk RETPOLINE, SPEC_CTRL: IBRS- SSBD-,
Other: IBPB L1D_FLUSH VERW
(XEN) L1TF: believed vulnerable, maxphysaddr L1D 46, CPUID 46, Safe
address 300000000000
(XEN) Support for HVM VMs: MSR_SPEC_CTRL RSB EAGER_FPU MD_CLEAR
(XEN) Support for PV VMs: MSR_SPEC_CTRL RSB EAGER_FPU MD_CLEAR
(XEN) XPTI (64-bit PV only): Dom0 enabled, DomU enabled (with PCID)
(XEN) PV L1TF shadowing: Dom0 disabled, DomU enabled
(XEN) Using scheduler: SMP Credit Scheduler rev2 (credit2)
(XEN) Initializing Credit2 scheduler
(XEN) Platform timer is 14.318MHz HPET
(XEN) Detected 2596.991 MHz processor.
(XEN) Initing memory sharing.
(XEN) Intel VT-d iommu 0 supported page sizes: 4kB, 2MB, 1GB.
(XEN) Intel VT-d iommu 1 supported page sizes: 4kB, 2MB, 1GB.
(XEN) Intel VT-d Snoop Control enabled.
(XEN) Intel VT-d Dom0 DMA Passthrough not enabled.
(XEN) Intel VT-d Queued Invalidation enabled.
(XEN) Intel VT-d Interrupt Remapping enabled.
(XEN) Intel VT-d Posted Interrupt not enabled.
(XEN) Intel VT-d Shared EPT tables enabled.
(XEN) I/O virtualisation enabled
(XEN) - Dom0 mode: Relaxed
(XEN) Interrupt remapping enabled
(XEN) Enabled directed EOI with ioapic_ack_old on!
(XEN) ENABLING IO-APIC IRQs
(XEN) Allocated console ring of 64 KiB.
(XEN) VMX: Supported advanced features:
(XEN) - APIC MMIO access virtualisation
(XEN) - APIC TPR shadow
(XEN) - Extended Page Tables (EPT)
(XEN) - Virtual-Processor Identifiers (VPID)
(XEN) - Virtual NMI
(XEN) - MSR direct-access bitmap
(XEN) - Unrestricted Guest
(XEN) - APIC Register Virtualization
(XEN) - Virtual Interrupt Delivery
(XEN) - Posted Interrupt Processing
(XEN) - VMCS shadowing
(XEN) - VM Functions
(XEN) HVM: ASIDs enabled.
(XEN) VMX: Disabling executable EPT superpages due to CVE-2018-12207
(XEN) HVM: VMX enabled
(XEN) HVM: Hardware Assisted Paging (HAP) detected
(XEN) HVM: HAP page sizes: 4kB, 2MB, 1GB
(XEN) CMCI: threshold 0x2 too large for CPU16 bank 17, using 0x1
(XEN) CMCI: threshold 0x2 too large for CPU16 bank 18, using 0x1
(XEN) CMCI: threshold 0x2 too large for CPU16 bank 19, using 0x1
(XEN) Brought up 32 CPUs
(XEN) mtrr: your CPUs had inconsistent variable MTRR settings
(XEN) Dom0 has maximum 840 PIRQs
(XEN) Xen kernel: 64-bit, lsb, compat32
(XEN) Dom0 kernel: 64-bit, PAE, lsb, paddr 0x1000000 -> 0x30b3000
(XEN) PHYSICAL MEMORY ARRANGEMENT:
(XEN) Dom0 alloc.: 0000001840000000->0000001844000000 (1029560
pages to be allocated)
(XEN) Init. ramdisk: 000000187f5b8000->000000187ffffe20
(XEN) VIRTUAL MEMORY ARRANGEMENT:
(XEN) Loaded kernel: ffffffff81000000->ffffffff830b3000
(XEN) Init. ramdisk: 0000000000000000->0000000000000000
(XEN) Phys-Mach map: 0000008000000000->0000008000800000
(XEN) Start info: ffffffff830b3000->ffffffff830b34b4
(XEN) Xenstore ring: 0000000000000000->0000000000000000
(XEN) Console ring: 0000000000000000->0000000000000000
(XEN) Page tables: ffffffff830b4000->ffffffff830d1000
(XEN) Boot stack: ffffffff830d1000->ffffffff830d2000
(XEN) TOTAL: ffffffff80000000->ffffffff83400000
(XEN) ENTRY ADDRESS: ffffffff824f5180
(XEN) Dom0 has maximum 4 VCPUs
(XEN) Initial low memory virq threshold set at 0x4000 pages.
(XEN) Scrubbing Free RAM in background
(XEN) Std. Loglevel: Errors and warnings
(XEN) Guest Loglevel: Nothing (Rate-limited: Errors and warnings)
(XEN) ***************************************************
(XEN) Booted on L1TF-vulnerable hardware with SMT/Hyperthreading
(XEN) enabled. Please assess your configuration and choose an
(XEN) explicit 'smt=<bool>' setting. See XSA-273.
(XEN) ***************************************************
(XEN) Booted on MLPDS/MFBDS-vulnerable hardware with SMT/Hyperthreading
(XEN) enabled. Mitigations will not be fully effective. Please
(XEN) choose an explicit smt=<bool> setting. See XSA-297.
(XEN) ***************************************************
(XEN) 3... 2... 1...
(XEN) Xen is relinquishing VGA console.
(XEN) *** Serial input to DOM0 (type 'CTRL-a' three times to switch input)
(XEN) Freed 500kB init memory
(XEN) TSC marked as reliable, warp = 0 (count=2)
(XEN) dom1: mode=0,ofs=0x914112976,khz=2596991,inc=1

> > nohup ssh 192.168.1.11 tar cf - --one-file-system /a | cat > /dev/null &
> How about trying iperf or iperf3 with either only transmit or receive? iperf is specifically designed to use maximal bandwidth and doesn't use disk.
> http://fasterdata.es.net/performance-testing/network-troubleshooting-tools/throughput-tool-comparision/

Noted, thank you! I'll look at those tools now, and try them.

> For independently load-testing disk, you can try dd or fio, while being cognizant of the disk cache. To avoid actual disk I/O I think you should be
> able to use a ram based disk in the dom0 instead of a physical disk. However, I wouldn't bother if you can reproduce with network only, until the
> network issue has been fixed.

Acknowledged, thanks.

> > maxmem=90112
> > vcpus=26
> This is fairly large.
> Have you tried both fewer cpus and less memory? If you can reproduce with iperf, which probably will reproduce more quickly, can you reproduce with
> memory=2048 and vcpus=1 or vcpus=2 for example? FYI the domU might not boot at all with vcpus=1 with some kernel versions.

I... have not.... and please pardon my ignorance here, but my guest
machine runs a lot of different things for our client, and definitely
needs the RAM (and I *think* needs the CPUs, although I confess that
I'm not sure how vcpus translate to available compute power.) I can
try the smaller numbers, but have not because to me it's off-point,
since my guest requires the larger number of resources we've
traditionally allocated.

> But I would try that only if none of the network changes show a difference.

Okay, understood.

> > 'rate=100Mb/s,mac=00:16:3f:49:4a:41,bridge=br0',
> You probably want try removing the vif rate limit. Using rate=... I got soft lockups on the dom0 many kernel versions ago. I don't know what happens
> if the soft lockups in the dom0 have been fixed - perhaps another problem remains in the domU.
> If removing "rate" fixes it, switch to rate limiting with another method - possibly 'tc' but there might be something better available now using BPF.

Okay, will attempt that as a subsequent step.

> Also, have you tried at all looking at or changing the offload settings in the dom0 and/or domU with "/sbin/ethtool -k/-K <device>" ? I don't think
> this is actually the issue. But it's been a source of problems historically and it's easy to try.

In the past, I ran with, e.g. :

ethtool -K em1 rx off tx off sg off tso off ufo off gso off gro off lro off

on both the host and the guest. The problems did occur after the
upgrade even with those settings. I then stopped using them
(commented them out) on both host and guest - it made no material
difference that I could see - the guest still crashed, and had roughly
the same performance, either way.

> > I am looking for a means on Xen to bug report this; so far, I haven't
> > found it, but I will keep looking.
> https://wiki.xen.org/wiki/Reporting_Bugs_against_Xen_Project

Thank you!

> But try some more data collection and debugging first, ideally by changing one thing at a time.

Understood, and will do.

> > thoughts, guidance, musings, etc., anything at all would be
> > appreciated.
> x-ref https://www.kernel.org/doc/html/latest/admin-guide/sysrq.html
> I don't get the impression you've tried using sysrq already, since you did not mention it by name. If you have tried sysrq, it would be helpful if you
> could go back through your original email and add examples of all of the commands you've run.

I have not. The closest I got to this was the xl trigger nmi command,
which just (sometimes) brings the guest back to life. Thank you for
that pointer!

> For PV, to send the sysrq you can try 'xl sysrq <domU> <key>' or 'ctrl-o <key>' on the virtual serial console. Neither will probably work for HVM. I

Right. We prefer PV, I tried HVM as a test, it failed, so we've stayed with PV.

> When the domU locks up, you *might* get interesting information from the 'x' and 'l' sysrq commands within the domU, You may need to enable that
> functionality with 'sysctl -w kernel.sysrq=1' .
> I'm not sure the 'l' commands works for PV at all. 'l' works for HVM.

Okay, done. I've enabled it on my ailing production guest and my test
guest, and will try those two commands on the next stall.

> If you can send a sysrq when the domU is not locked up, but can't send one when it's locked up, that's also potentially interesting.

Okay noted. I'm going to try it on the stalled guest first, and then
I'll try again immediately after I reboot said guest, and report.

So, going back to your "ideally by changing one thing at a time"
comment, here's kind of how I'm proceeding:

1. Just prior to sending my original email, I had booted the guest
with tsc_mode="always_emulate", and I am currently stress-testing it
with a large number of those tar jobs. I'd already done this/started
them, so I'm going to let them run overnight (I'm on Pacific time).
I'd be surprised if this solves it... and I'm not going to wait past
tomorrow morning because I feel like downgrading Xen is a more
productive approach (see below), but who knows, it will be interesting
to see if the machine survives the night at least.

2. I've armed the sysrq on all machines, and will try that if either
guest crashes, and will capture and post any output I can get from
them.

3. The next thing I'm going to try - tomorrow morning - is taking Xen
down to 4.10.4 via an OpenSuse 15.0 install on my test Dom0, I'm then
going to boot the guest (in its default config without tsc_mode
overridden) and see if it runs reliably. If it does, I'll report
that.

IF it does, I'm going to transfer my production guest to this host in
the hope that it becomes stable, but I can still continue testing
beyond that if it's desired here, by using the former production
machine as a test bed (since it has the same problems.)

Next steps after that, as I understand what you've said:

4. Take the physical host back to Xen 4.9.4, with the old default
4.4.180 kernel, test and report. (I'd expect this to work, as this is
the "old, pre-trouble" setup, but who knows.)
5. Take that host up to the 4.12 kernel with the old Xen 4.9.4, test and report.
6. Remove the rate limit, test and report.

Let me know if that's not right, or you'd like to see anything done differently.

And of course, my challenge here is simply that these stalls don't
happen immediately. Under heavy load, they usuallly take just hours,
but might take days. Given what I've seen, I don't personally think I
could call anything "solved" unless the guest survived under elevated
load for at least seven days. So I will be performing each of these
steps, but it may take several days or more to report on each one.
But like I said, this is for a large client, and there is of course a
sense of... wanting to get this solved quickly... so I will do the
steps you've suggested and report on each one.

In the meantime, THANK YOU for your response, and if you or anyone
else has any other thoughts, please do send them to me!

Glen

_______________________________________________
Xen-users mailing list
Xen-users@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-users
Re: Xen 4.12 DomU hang / freeze / stall under high network/disk load [ In reply to ]
On 2/13/20 6:26 PM, Glen wrote:

> I tried both xl network-detach followed by a network-attach (feeding
> back in the parameters from my guest machine.)

OK. Were you able to check if the network device went away in the domU? It should have, but you won't see anything in dmesg necessarily.

> (XEN) Using scheduler: SMP Credit Scheduler rev2 (credit2)
> (XEN) Initializing Credit2 scheduler

This is something that changed by default from xen 4.11 to 4.12:

https://xenproject.org/2019/04/02/whats-new-in-xen-4-12/

You could try the old scheduler:

https://xenbits.xen.org/docs/unstable/features/sched_credit.html

I am skeptical this is the problem, but you could try the old one.

>>> maxmem=90112
>>> vcpus=26
>> This is fairly large.
>> Have you tried both fewer cpus and less memory? If you can reproduce with iperf, which probably will reproduce more quickly, can you reproduce with
>> memory=2048 and vcpus=1 or vcpus=2 for example? FYI the domU might not boot at all with vcpus=1 with some kernel versions.
>
> I... have not.... and please pardon my ignorance here, but my guest
> machine runs a lot of different things for our client, and definitely
> needs the RAM (and I *think* needs the CPUs, although I confess that
> I'm not sure how vcpus translate to available compute power.) I can
> try the smaller numbers, but have not because to me it's off-point,
> since my guest requires the larger number of resources we've
> traditionally allocated.

Then set up another VM for testing?

Anything about your setup that's out of the ordinary is a reasonable place to start looking for problems. It may not solve your immediate issue but if
it means a developer can reproduce, that gives you a chance of the bug actually getting fixed.

> So, going back to your "ideally by changing one thing at a time"
> comment, here's kind of how I'm proceeding:

I'd recommend you start by attempting to reproduce the problem as fast as possible, with the setup as-is, before changing anything. 4 days is too long
to have any certainty.

BTW, if it's the domU network load - you would probably reproduce fastest by running testing between 2 domUs on the same dom0, if you can.

--Sarah

_______________________________________________
Xen-users mailing list
Xen-users@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-users
Re: Xen 4.12 DomU hang / freeze / stall under high network/disk load [ In reply to ]
On 2/13/20 7:06 PM, Sarah Newman wrote:

> I'd recommend you start by attempting to reproduce the problem as fast as possible, with the setup as-is, before changing anything. 4 days is too long
> to have any certainty.
>
> BTW, if it's the domU network load - you would probably reproduce fastest by running testing between 2 domUs on the same dom0, if you can.

Another thing I do is look through the commit logs in the Linux and Xen branches compared to my current version for any related commits that might be
fixes. Linux I'm not sure but maybe both the 4.9 and 4.14 LTS git commit histories, Xen you would look at the staging-4.12 branch.

This suggests you're already able to reproduce the problem, so that you can retest after applying the patch.

--Sarah

_______________________________________________
Xen-users mailing list
Xen-users@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-users
Re: Xen 4.12 DomU hang / freeze / stall under high network/disk load [ In reply to ]
On Thu, Feb 13, 2020 at 7:06 PM Sarah Newman <srn@prgmr.com> wrote:
> > I tried both xl network-detach followed by a network-attach (feeding
> > back in the parameters from my guest machine.)
> OK. Were you able to check if the network device went away in the domU? It should have, but you won't see anything in dmesg necessarily.

Alas no. The guest was unresponsive when I did this, and console
functionality was very limited (as in, "sync" might have worked, but
nothing else did.) I can only report that the commands didn't seem to
help, or fix the problem, or have any visible impact on the guest.

> You could try the old scheduler:
> https://xenbits.xen.org/docs/unstable/features/sched_credit.html
> I am skeptical this is the problem, but you could try the old one.

Okay, noted, and added to my list.

> Anything about your setup that's out of the ordinary is a reasonable place to start looking for problems. It may not solve your immediate issue but if
> it means a developer can reproduce, that gives you a chance of the bug actually getting fixed.

Absolutely, and that's what I want. I obviously want to solve my
immediate problem and just get my setup to be stable, even if that
means running on older Xen... but I take Xen seriously, and even if I
get my situation stabilized, I will still work on this as long as
anyone here wants to listen to me. :-)

> I'd recommend you start by attempting to reproduce the problem as fast as possible, with the setup as-is, before changing anything. 4 days is too long
> to have any certainty.

Right. In this case, a failure becomes good (for debugging)

I've got 20 simultaneous tar processes running against my guest right
now (which is way more than I've ever needed or attempted - because
I'm grumpy and trying to do exactly what you say - make the guest
crash as fast as possible so I can eliminate possibilities), with that
tsc_mode="always_emulate" setting. It's survived that for 10 hours so
far, which is far more than I expected. I can't imagine that *that*
might solve this, but... I'll continue to watch it as long as I can,
and report either way in the morning to see how it does after 24
hours.

I can leave the host alone and test against that setting more, to see
if I can crash the guest without it faster (again) at the higher load.
I feel like downgrading to Xen 4.10 will probably fix *my* problem,
but mask *the* problem, and I really want both fixed. :-)

> BTW, if it's the domU network load - you would probably reproduce fastest by running testing between 2 domUs on the same dom0, if you can.

Not under my current setup, no. This is a huge guest and it (almost)
maxes the host. I've got a pair of servers committed to this already,
just for testing. But even if I can solve my immediate issue, I'll
still have another pair (the current production group) of servers I
can mess with, and I'll have more flexibility to change things then.

THANK YOU THANK YOU!
Glen

_______________________________________________
Xen-users mailing list
Xen-users@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-users
Re: Xen 4.12 DomU hang / freeze / stall under high network/disk load [ In reply to ]
On Thu, Feb 13, 2020 at 7:20 PM Sarah Newman <srn@prgmr.com> wrote:
> Another thing I do is look through the commit logs in the Linux and Xen branches compared to my current version for any related commits that might be
> fixes. Linux I'm not sure but maybe both the 4.9 and 4.14 LTS git commit histories, Xen you would look at the staging-4.12 branch.

That I can do...

> This suggests you're already able to reproduce the problem, so that you can retest after applying the patch.

... but knowing how to apply a patch in this case is something I'll
have to figure out. I mean, I know how to use patch and apply patches
to source code, I'm just not familiar enough with the kernel and Xen
at this moment to know exactly how to do that in this context. That's
why I'm starting with OS-provided packages and distros, it's easier
and within my experience. I know how to build userland software
packages from source, but Xen (and the kernel) are... larger than
anything I've done before... so I'd need to get up to speed a bit
before being able to function effectively on that deep a level.

Happy to go there if I have to, though!

Glen

_______________________________________________
Xen-users mailing list
Xen-users@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-users
Re: Xen 4.12 DomU hang / freeze / stall under high network/disk load [ In reply to ]
On Thu, Feb 13, 2020 at 10:11 PM Glen <glenbarney@gmail.com> wrote:

> Dear Xen Team:
>
> Since upgrading to Xen 4.12, I'm experiencing an ongoing problem with
> stalled guests. I had previously thought I was the only one with this
> problem, and had reported this to my distro's virtualization team (
> see: https://lists.opensuse.org/opensuse-virtual/2019-12/msg00000.html
> and https://lists.opensuse.org/opensuse-virtual/2019-12/msg00003.html
> for thread heads ), but although they tried to help, we all kind of
> concluded (wrongly) that I must just have had a bad guest.
>
> I finally recreated a new host and a new guest, clean, from scratch,
> thinking that would solve the problem, and it didn't. That led me to
> search again, and I now see that another individual (who doubtless
> thought HE was the only one) has reported this issue to your list (
> https://lists.xenproject.org/archives/html/xen-users/2020-02/msg00015.html
> et al).
>
> So I'm sending this report here to let you all know that this problem
> is now no longer limited to one person, and it is reproducible, and
> is, in my opinion, severe. I hope that someone here can help, or
> point us in the right direction. Here we go:
>
> Problem: Xen DomU guests randomly stall under high network/disk
> loads. Dom0 is not affected. Randomly means anywhere between 1 hour
> and 14 days after guest boot - the time seems to shorten with (or
> perhaps the problem is triggered by) increased network (and possibly
> disk) activity.
>
> Symptoms on the DomU Guest:
> 1. Guest machine performs normally until the moment of failure. No
> abnormal log/console entries exist.
> 2. At the moment of failure, the guest's network goes offline. No
> abnormal log/console entries are written at that moment.
> 3. Processes which were trying to connect to the network start to
> consume increasing amounts of CPU.
> 4. Load average of the guest starts to increase, continuing upward
> without apparent bound.
> 5. If a high-priority bash shell is left logged in on the guest hvc0
> console, some commands might still be runnable; most are not.
> 6. If the guest console is not logged in, the console is frozen and
> doesn't even echo characters.
> 7. Some guests will output messages on the console like this:
> kernel: [164084.912966] NMI watchdog: BUG: soft lockup - CPU#16 stuck for
> 67s!
> 8. On some others, I will also see output like:
> BUG: workqueue lockup - pool cpus=20 node=0 flags=0x0 nice=-20 stuck for
> 70s!
> 9. Sometimes there is no output at all on the console.
>
> Symptoms on the Dom0 Host:
> 1. The Dom0 Host is unaffected. The only indication anything is
> happening on the host are two log entries in /var/log/messages:
> vif vif-6-0 vif6.0: Guest Rx stalled
> br0: port 2(vif6.0) entered disabled state
> 2. Other guests are not affected (Although other guests too may stall
> at other random times, stalls on one guest do not seem to affect other
> guests directly.)
>
> Circumstances when the problem first occurred:
> 1. All hosts and guests were previously on Xen 4.9.4 (OpenSuse 42.3,
> Linux 4.4.180, Xen 4.9.4)
> 2. I upgraded one physical host to Xen 4.12.1 (OpenSuse 15.1, Linux
> 4.12.14, Xen 4.12.1).
> 3. The guest(s) on that host started malfunctioning at that point.
>
> Immediate steps taken while the guest was stalled, which did not help:
> 1. Tried to use high-priority shell on guest console to kill high-CPU
> processes; they were unkillable.
> 2. Tried to use guest console to stop and restart network; commands
> were unresponsive.
> 3. Tried to use guest console to shutdown/init 0. This caused console
> to be terminated, but guest would not otherwise shutdown.
> 4. Tried to use host xl interface to unplug/replug network bridges.
> This appeared to work from host side, but guest was unaffected.
>
> One thing which I accidentally discovered that *did* help:
> 1. Tried sending xl trigger nmi from the host to the guest.
>
> When I trigger the stalled guest with an NMI, I get its attention.
> The guest will print the following on the console:
>
> Uhhuh. NMI received for unknown reason 00 on CPU 0.
> Do you have a strange power saving mode enabled?
> Dazed and confused, but trying to continue
>
> In some cases (pattern not yet known), the guest will then immediately
> come back online: The network will come back online, and all
> processes will slowly stop consuming CPU, and things will return to
> normal. Existing network connections were obviously terminated, but
> new connections are accepted. In that case, it's like the guest just
> magically comes back to life.
>
> When this works, the host log shows:
> vif vif-6-0 vif6.0: Guest Rx ready
> br0: port 2(vif6.0) entered blocking state
> br0: port 2(vif6.0) entered forwarding state
>
> And all seems well... as if the guest had never stalled.
>
> However, this is not reliable. In most cases, the guest will print
> those messages, but the processes will NOT recover, and the network
> will come back impaired, or not at all. When that happens, repeated
> NMIs do not help: If the guest doesn't recover the first time, it
> doesn't recover at all.
>
> The *only* reliable way to recover is to destroy the guest completely,
> and recreate it. This is a hard destroy: the guest cannot shut
> itself down. The guest will then run fine... until the next stall.
> But of course a hard-destroy can't be a healthy thing for a guest
> machine, and that's really not a solution.
>
> Long-term mitigation steps which were tried which did not help.
> 1. Thought this was an SSH bug (since sshd processes were consuming
> high CPU), installed latest OpenSSH.
> 2. Though maybe a PV problem, tried under HVM instead of PV.
> 3. Noted a problem with grant frames, applied the recommended fix for
> that, my config now looks like:
> # xen-diag gnttab_query_size 0 # Domain-0
> domid=0: nr_frames=1, max_nr_frames=64
> # xen-diag gnttab_query_size 1 # Xenstore
> domid=1: nr_frames=4, max_nr_frames=4
> # xen-diag gnttab_query_size 6 # My guest
> domid=6: nr_frames=17, max_nr_frames=256
> 4. Thought maybe a kernel module might be at issue, reviewed list with
> OpenSuse team, pruned modules.
> 5. Thought this might be a kernel mismatch, was referred to a new
> kernel by OpenSuse team (Linux 4.12.13 for OpenSuse 42.3). That
> changed some of the console output behavior and logging, but did not
> solve the problem.
> 6. Thought this might be a general OS mismatch, tried upgrading the
> guest to the same OS/Xen versions as the host (OpenSuse 15.1/Linux
> 4.12.14/Xen 4.12.1). In this configuration, no console or log output
> is generated on the guest at all, it just stalls.
> 7. Assumed (incorrectly, it now turns out) that something was just
> "wrong" with my guest, tried a fresh load of host, and a fresh guest.
> I thought that would solve it, but to my sadness, it did not.
>
> Which means that this is now a reproducible bug.
>
> Steps to reproduce:
> 1. Get a server. I'm using a Dell PowerEdge R720, but this has
> happened on several different Dell models. My current server has two
> 16-core CPUs, and 128GB of RAM.
> 2. Load Xen 4.12.1 (OpenSuse 15.1/Xen 4.12.1) on the server. Boot it
> up in Xen Dom0/host mode.
> 3. Create a new guest machine, also with 4.12.1.
> 4. Fire up the guest.
> 5. Put a lot of data on the guest (my guest has 3 TB of files and data).
> 6. Plug a crossover cable into your server, and plug the other end
> into some other Linux machine.
> 7. From that other machine, start pounding the guest. An rsync of the
> entire data partition is a great way to trigger this. If I run
> several outbound rsyncs together, I can crash my guest in under 48
> hours. If I run 4 or 5, I can often crash the guest in just 2 hours.
> If you don't want to damage your SSDs on your other machine, here's my
> current command (my host is 192.168.1.10, and my guest is
> 192.168.1.11, so I plug in some other machine and make it, say,
> 192.168.1.12, and then run:
>
> nohup ssh 192.168.1.11 tar cf - --one-file-system /a | cat > /dev/null &
>
> Where /a is my directory full of user data. 4-6 of these running
> simultaneously will bring the guest to its knees in short order.
>
> On my most recent test, I did the NMI trigger thing, and found this in
> the guest's /var/log/messages after sending the trigger (I've removed
> tagging and timestamping for clarity:)
>
> Uhhuh. NMI received for unknown reason 00 on CPU 0.
> Do you have a strange powersaving mode enabled?
> Dazed and confused, but trying to continue
> clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc'
> as unstable because the skew is too large:
> clocksource: 'xen' wd_now: 58842b687eb3c wd_last: 55aa97ff29565 mask:
> ffffffffffffffff
> clocksource: 'tsc' cs_now: 58d3355ea9a87e cs_last: 585cca21d4f074
> mask: ffffffffffffffff
> tsc: Marking TSC unstable due to clocksource watchdog
> BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=-20 stuck for
> 50117s!
> Showing busy workqueues and worker pools:
> workqueue events: flags=0x0
> pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256
> pending: clocksource_watchdog_work, vmstat_shepherd, cache_reap
> workqueue mm_percpu_wq: flags=0x8
> pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
> pending: vmstat_update
> workqueue writeback: flags=0x4e
> pwq 52: cpus=0-25 flags=0x4 nice=0 active=1/256
> in-flight: 28593:wb_workfn
> workqueue kblockd: flags=0x18
> pwq 1: cpus=0 node=0 flags=0x0 nice=-20 active=2/256
> pending: blk_mq_run_work_fn, blk_mq_timeout_work
> pool 52: cpus=0-25 flags=0x4 nice=0 hung=0s workers=3 idle: 32044 18125
>
> That led me to search around, and I tripped over this:
> https://wiki.debian.org/Xen/Clocksource , which describes a guest
> hanging with the message "clocksource/0: Time went backwards/"
> Although I did not see this message, and this is not directly on point
> with OpenSuse (since our /proc structure doesn't include some of the
> switches mentioned), I did notice clocksource references in the logs
> (see above), and that led me back to:
>
> https://doc.opensuse.org/documentation/leap/virtualization/html/book.virt/cha-xen-manage.html
> ,
> and specifically the tsc_mode setting. I have no idea if it's
> relevant, but I since I'm out of ideas and have nothing better to try,
> I have now booted my guest into tsc_mode=1 and am stress testing it to
> see if it fares any better this way.
>
> Administrivia:
> OS: OpenSuse 15.1
> Linux: 4.12.14-lp151.28.36
> Xen: 4.12.1
> Dom0 boot parameters: dom0_mem=4G dom0_max_vcpus=4 dom0_vcpus_pin
> gnttab_max_frames=256
> Xen guest config:
>
> name="guest1"
> description="guest1"
> memory=90112
> maxmem=90112
> vcpus=26
> cpus="4-31"
> on_poweroff="destroy"
> on_reboot="restart"
> on_crash="restart"
> on_watchdog="restart"
> localtime=0
> keymap="en-us"
> type="pv"
> kernel="/usr/lib/grub2/x86_64-xen/grub.xen"
> extra="elevator=noop"
> disk=[
> '/xen/guest1/guest1.root,raw,xvda1,w',
> '/xen/guest1/guest1.swap,raw,xvda2,w',
> '/xen/guest1/guest1.xa,raw,xvda3,w',
> ]
> vif=[
> 'rate=100Mb/s,mac=00:16:3f:49:4a:41,bridge=br0',
> ]
> vfb=['type=vnc,vncunused=1']
>
> I had originally thought that I was the only person with this problem,
> and that's why I thought a fresh guest would fix it - the problem
> followed me around different servers, so that made sense. Over the
> past weeks I've set up a fresh guest on my fresh host, and, just on a
> whim, did the above stress testing on it... sadly, it only lasted for
> 36 hours. On the older Xen 4.9, I *never* encountered problems, and
> nothing changed other than OS/Xen versions when I did the upgrades to
> the new versions.
>
> Since I can now reproduce the problem on different hardware and
> setups, I thought I'd start my searches over again. To my relief, I
> found that, just in the past few weeks, another person has now
> reported what seems to be the same problem, only he reported it to
> this list (whereas I had sent my report to OpenSuse.) In his message,
> referenced above, he states that the problem is limited to Xen 4.12
> and 4.13, and that rolling back to Xen 4.11 solves the problem.
>
> If that's right, there seems to be a *significant* problem somewhere,
> and it's clearly no longer just one instance.
>
> I am looking for a means on Xen to bug report this; so far, I haven't
> found it, but I will keep looking.
>
> Meanwhile, I'm hoping that these details and history spark something
> for some of you here. Do any of you have any ideas on this? Any
> thoughts, guidance, musings, etc., anything at all would be
> appreciated.
>
> Again, thank you all for your patience and help, I am very grateful!
>
> Glen
>
> _______________________________________________
> Xen-users mailing list
> Xen-users@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-users



Hello Glen, thanks for your report.

The symptoms seem similar:
- xen 4.12
- 2 cpu
- high load
- dom0 is ok, domU stalls

I've just upgraded one of my machines to xen 4.12 (again) with
sched=credit, I'll report back if it helps.

Thanks,
Tomas
Re: Xen 4.12 DomU hang / freeze / stall under high network/disk load [ In reply to ]
On Fri, Feb 14, 2020 at 12:07 AM Tomas Mozes <hydrapolic@gmail.com> wrote:
> Hello Glen, thanks for your report.

Thank you!

> The symptoms seem similar:
> - xen 4.12
> - 2 cpu
> - high load
> - dom0 is ok, domU stalls

Yes, same here.

> I've just upgraded one of my machines to xen 4.12 (again) with sched=credit, I'll report back if it helps.

Thanks. Right now I'm focused on tsc_mode, since it's what I was
working on before Sarah's responses yesterday.

My guest machine survived 24 hours under a very high load test using
tsc_mode="always_emulate" in the guest machine config. I realize that
24 hours is hardly conclusive, but given Sarah's suggestion to try to
eliminate things quickly, I've now switched the guest to what I think
is the opposite tsc_mode="native", and I'm trying it there for a
little while to see if it crashes. (Neither of these are the default
- the default seems to be a hybrid of the two - but I will return and
test that once more if I can.)

In addition to my normal test methods, I have at Sarah's suggestion
thrown in an iperf3 at maximum speed, continuous repeat from the host
to the guest, just to see if it helps stall the machine faster. It's
pushing data at 14GBps right now, so we'll see.

I will report on whatever happens.

Thank you for your response!

Glen

_______________________________________________
Xen-users mailing list
Xen-users@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-users
Re: Xen 4.12 DomU hang / freeze / stall under high network/disk load [ In reply to ]
On 2/14/20 9:00 AM, Glen wrote:
> On Fri, Feb 14, 2020 at 12:07 AM Tomas Mozes <hydrapolic@gmail.com> wrote:
>> Hello Glen, thanks for your report.
>
> Thank you!
>
>> The symptoms seem similar:
>> - xen 4.12
>> - 2 cpu
>> - high load
>> - dom0 is ok, domU stalls
>
> Yes, same here.
>
>> I've just upgraded one of my machines to xen 4.12 (again) with sched=credit, I'll report back if it helps.
>
> Thanks. Right now I'm focused on tsc_mode, since it's what I was
> working on before Sarah's responses yesterday.
>
> My guest machine survived 24 hours under a very high load test using
> tsc_mode="always_emulate" in the guest machine config. I realize that
> 24 hours is hardly conclusive, but given Sarah's suggestion to try to
> eliminate things quickly,

But how long does it take to reproduce under the original conditions?

If you want confidence that something staying up for 24 hours is a fix, you want a test that takes much less than 24 hours to reproduce the failure
repeatedly under the original conditions.

If it takes 24 hours on average to reproduce, I would say you want to run for at least 48 hours, if not significantly longer, to have any confidence
in a fix.

You could use some probability theory to put some better numbers here.

> I've now switched the guest to what I think
> is the opposite tsc_mode="native", and I'm trying it there for a
> little while to see if it crashes. (Neither of these are the default
> - the default seems to be a hybrid of the two - but I will return and
> test that once more if I can.)

If you haven't read the entirety of

https://xenbits.xen.org/docs/unstable/man/xen-tscmode.7.html

you probably want to then.

>
> In addition to my normal test methods, I have at Sarah's suggestion
> thrown in an iperf3 at maximum speed, continuous repeat from the host
> to the guest, just to see if it helps stall the machine faster. It's
> pushing data at 14GBps right now, so we'll see.

If it's the vif rate limit which is the issue - that affects data outbound from the guest, not inbound to the guest. It's easy enough to confirm the
direction by checking the interface tx/rx counters.

I was thinking that if it was network load, you might be able to reproduce within a few minutes using iperf, which would make subsequent testing much
more easy.

--Sarah

_______________________________________________
Xen-users mailing list
Xen-users@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-users
Re: Xen 4.12 DomU hang / freeze / stall under high network/disk load [ In reply to ]
Hi Sarah -

Thank you for your email!

On Fri, Feb 14, 2020 at 10:36 AM Sarah Newman <srn@prgmr.com> wrote:
> On 2/14/20 9:00 AM, Glen wrote:
> > On Fri, Feb 14, 2020 at 12:07 AM Tomas Mozes <hydrapolic@gmail.com> wrote:
> >> The symptoms seem similar:
> >> - xen 4.12
> >> - 2 cpu
> >> - high load
> >> - dom0 is ok, domU stalls
> > Yes, same here.
> > My guest machine survived 24 hours under a very high load test using
> > tsc_mode="always_emulate" in the guest machine config. I realize that
> > 24 hours is hardly conclusive, but given Sarah's suggestion to try to
> > eliminate things quickly,
> But how long does it take to reproduce under the original conditions?
> If you want confidence that something staying up for 24 hours is a fix, you want a test that takes much less than 24 hours to reproduce the failure
> repeatedly under the original conditions.
> If it takes 24 hours on average to reproduce, I would say you want to run for at least 48 hours, if not significantly longer, to have any confidence
> in a fix.

You're entirely correct - and you've put your finger on the problem,
which is that I don't know how long it takes to reproduce.

Since upgrading to Xen 4.12, *most* of my guests have no problem. The
guests which do have a problem are my busiest guests, by which I mean
the guests that are the most highly-used in terms of web traffic, FTP
and rsync traffic, email list messages, and the like.

If I do nothing, and just let all my guests run normally, my less-used
guests never have a problem. My high-use guests will invariably
stall, and will do so on average after about 4 days of use. However,
I've seen them stall as quickly as 36 hours, and I've seen them last
as long as 14 days.

If I take a guest and stress test it, (meaning, I make a copy of my
busy production guest, and put it up elsewhere, on an internal test
network, so it's not getting any outside traffic at all, and I start
4-6 of those rsync jobs I mentioned previously), I've seen the guest
stall in as little as 90 minutes, and last as long as 48 hours before
stalling.

So in my case, what I would want would be to find a solution under
which one of my stress-test guests lasts for a full 7 days minimum,
which would hopefully imply that normal guest would "forever" (for
some value of "forever").

I'm going to add your iperf3 thing to my ongoing stress testing mix to
see if it helps things go faster. Today, I was able to run it against
a guest for two hours, and it did not make the guest crash. So I'll
keep trying.

> If you haven't read the entirety of
> https://xenbits.xen.org/docs/unstable/man/xen-tscmode.7.html
> you probably want to then.

Done, thank you! And, again with the admission that I"m down to
grasping at straws here, the reason I'm looking at this at all is
because I'm trying *anything* I can to restore stability here. The
guests make comments in their logs about "clocksource", and I have no
idea if it's relevant or not, but when on that page I see phrases like
"emulated means ... apps will always run correctly", I feel like that
was an avenue worth checking.

Right now, however, because it required a physical data center visit,
I have proceeded to try to bracket this further. Today I downgraded
Xen on my test host to 4.10, and am now stress-testing the guest
again. I will have to let it run for 7 days to meet my arbitrary
standard, unless it crashes further, so I now have to sit and wait for
this, which is frustrating. I'll throw in iperf3 as well, but unless
it crashes all I can do is wait.

That's the problem here overall: There is no *pattern* to when a
guest stalls. No apparent spike in load. No "length of time" before
it goes. It's literally random... it just, literally.... stalls...
like a jogger dying while sitting on the couch.

> I was thinking that if it was network load, you might be able to reproduce within a few minutes using iperf, which would make subsequent testing much
> more easy.

I will do as you say, and hope that it helps!

Additional thoughts/comments/guidance welcome and wanted if it occurs
to you or anyone!

I will report back as soon as anything happens, or 7 days of uptime
under stress pass.

THANK YOU!
Glen

_______________________________________________
Xen-users mailing list
Xen-users@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-users
Re: Xen 4.12 DomU hang / freeze / stall under high network/disk load [ In reply to ]
On 2/14/20 4:21 PM, Glen wrote:
> Additional thoughts/comments/guidance welcome and wanted if it occurs
> to you or anyone!

Iirc, you'd mentioned you were hosting on OpenSUSE Leap 15.1?

If that's the case, I'll simply chime in with some heuristic hand-waving ...

Here, running hosts on openSUSE had for a long while suffered with frequent, but unfortunately still intermittent, instabilities. Not dissimilar to some of the behaviors you've enjoyed.

When I switched to openSUSE Xen 4.13.0_04 packages with KernelStable (atm, 5.5.3-25.gd654690), Guests of all 'flavors' became *much* better behaved.

Did I ever find a specific cause? No. But, for me, the combo's been a reliable in-production solution (of course, NOW i've jinxed it!). The issues that remain are mainly old-hardware-specific quirks that upgrades generally seem to make vanish.

Yes, I know others are certainly running problem-free with distro-release Xen+Kernel, so take that ^^ all with necessary grains of salt!

_______________________________________________
Xen-users mailing list
Xen-users@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-users
Re: Xen 4.12 DomU hang / freeze / stall under high network/disk load [ In reply to ]
On 2/14/20 4:21 PM, Glen wrote:

> Done, thank you! And, again with the admission that I"m down to
> grasping at straws here, the reason I'm looking at this at all is
> because I'm trying *anything* I can to restore stability here. The
> guests make comments in their logs about "clocksource", and I have no
> idea if it's relevant or not, but when on that page I see phrases like
> "emulated means ... apps will always run correctly", I feel like that
> was an avenue worth checking.

I would personally guess it just means that something didn't get to run for a long time. It might be worth using xl list / xl vcpu-list <domain> when
it's hung to see if it's running or blocked and how many cpu times are going up or not.


>
> Right now, however, because it required a physical data center visit,
> I have proceeded to try to bracket this further. Today I downgraded
> Xen on my test host to 4.10, and am now stress-testing the guest
> again. I will have to let it run for 7 days to meet my arbitrary
> standard, unless it crashes further, so I now have to sit and wait for
> this, which is frustrating. I'll throw in iperf3 as well, but unless
> it crashes all I can do is wait.

Well, that gets you security support through December 2020.

>
> Additional thoughts/comments/guidance welcome and wanted if it occurs
> to you or anyone!

I've gotten very useful data from debug builds of both Linux and Xen. It will massively slow down your system and you don't want to run them in
production.

--Sarah

_______________________________________________
Xen-users mailing list
Xen-users@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-users
Re: Xen 4.12 DomU hang / freeze / stall under high network/disk load [ In reply to ]
Hello! Thank you for your email!

On Fri, Feb 14, 2020 at 5:54 PM PGNet Dev <pgnet.dev@gmail.com> wrote:
> Iirc, you'd mentioned you were hosting on OpenSUSE Leap 15.1?

That's what we upgraded to, and that's when this problem started, yes.
To be specific, I upgraded the Dom0 host to 15.1. The guest was still
at 42.3 (older version, huh) and started having issues at that point -
upgrading the guest to 15.1 did not solve it.

> If that's the case, I'll simply chime in with some heuristic hand-waving ...

Yes please! Any comments at all welcome and wanted!

> Here, running hosts on openSUSE had for a long while suffered with frequent, but unfortunately still intermittent, instabilities. Not dissimilar to some of the behaviors you've enjoyed.
> When I switched to openSUSE Xen 4.13.0_04 packages with KernelStable (atm, 5.5.3-25.gd654690), Guests of all 'flavors' became *much* better behaved.

VERY interesting! I will add this to my "to be tested" list.

> Did I ever find a specific cause? No. But, for me, the combo's been a reliable in-production solution (of course, NOW i've jinxed it!).

HA! :-)

> The issues that remain are mainly old-hardware-specific quirks that upgrades generally seem to make vanish.
> Yes, I know others are certainly running problem-free with distro-release Xen+Kernel, so take that ^^ all with necessary grains of salt!

I was very worried late last year when it seemed I was the only one
having this problem. Seeing Tomas' report this past month was
counter-intuitively encouraging, because even though as Sarah rightly
points out we can't be sure it's the same thing, at least I'm not the
only one with a "problem of this type" anymore - so my hope is that
this will (logically) get more attention now since it's reproducible.

Thank you for this info!

Glen

_______________________________________________
Xen-users mailing list
Xen-users@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-users
Re: Xen 4.12 DomU hang / freeze / stall under high network/disk load [ In reply to ]
Tomas -

In your previous report, before I showed up, you wrote:
> I've tried Xen 4.12 and the latest staging Xen 4.13, both behave the same. Doesn't matter if kernel 4.14 or 5.4 is used.

Now PGNet Dev has said:

On Fri, Feb 14, 2020 at 5:54 PM PGNet Dev <pgnet.dev@gmail.com> wrote:
> When I switched to openSUSE Xen 4.13.0_04 packages with KernelStable (atm, 5.5.3-25.gd654690), Guests of all 'flavors' became *much* better behaved.

They used a later kernel than you cited - I'm wondering about the
relationship between their "Xen 4.13.0_04" and your "latest staging
Xen 4.13".

Any thoughts or insight there?

Thanks!
Glen

_______________________________________________
Xen-users mailing list
Xen-users@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-users
Re: Xen 4.12 DomU hang / freeze / stall under high network/disk load [ In reply to ]
On 2/14/20 6:54 PM, Glen wrote:
> That's what we upgraded to, and that's when this problem started, yes.
> To be specific, I upgraded the Dom0 host to 15.1. The guest was still
> at 42.3 (older version, huh) and started having issues at that point -
> upgrading the guest to 15.1 did not solve it.

just fyi, if not already old-news:

opensuse's up-to-date pkgs,

https://build.opensuse.org/project/show/Kernel:stable
https://build.opensuse.org/project/show/Virtualization

my own, that i 'monkey' with (these DO run on my own office/home systems; to date, reliably)

https://build.opensuse.org/project/show/home:pgnd:Kernel:stable
https://build.opensuse.org/project/show/home:pgnd:Virtualization:Xen
https://build.opensuse.org/project/show/home:pgnd:Virtualization:qemu

actual *production* pkgs for openSUSE servers are _based_ on results from that^, but are finalized, built & distributed only locally



_______________________________________________
Xen-users mailing list
Xen-users@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-users
Re: Xen 4.12 DomU hang / freeze / stall under high network/disk load [ In reply to ]
Hi Sarah -

On Fri, Feb 14, 2020 at 6:22 PM Sarah Newman <srn@prgmr.com> wrote:
> I would personally guess it just means that something didn't get to run for a long time. It might be worth using xl list / xl vcpu-list <domain> when
> it's hung to see if it's running or blocked and how many cpu times are going up or not.

Okay, good. I'm adding that to my list of other things to pull the
next time a guest freezes. Thank you for this!

> Well, that gets you security support through December 2020.

With zero sarcasm intended, all I really want is for it to get me the
ability to sleep through the night. :-) I absolutely plan to keep
pushing on this as long as I can even if I get stability on 4.10 - so
I hope that by December 2020 we/they will have figured this out and
fixed it. I'll take what I can get! :-)

> I've gotten very useful data from debug builds of both Linux and Xen. It will massively slow down your system and you don't want to run them in
> production.
> --Sarah

That also might be beyond my ability but I will try.

So here's where I am at right now. Loosely speaking (and using Xen
version numbers since I'm on the Xen list) what I have is a 4.9
production guest, and a 4.9 hot backup guest, and a 4.12 test guest.
All are running on separate, Dell-based, 4.12 hosts.

I don't have the luxury of stress-testing the production guest, but I
don't need to: It stalls every 3-5 days. When it stalls, it's either
during the day, in which my "test time" is limited, or it's during the
middle of the night, in which case my cognitive ability is limited.
:-) The next time it stalls, I'm going to do the sysrq things and
the xl list/vcpu-list and other stuff similar to it, and try to
capture it all. I'll then post it here.

As a kind of informal fallback, today I downgraded the hot backup
guest's host to 4.9 again. The downgrade went fine, that guest is now
running on 4.9 all the way, so it "should" never stall again. As a
hot backup, it's traffic is much lower, so it's only stalled once,
this was really about just seeing if I could successfully downgrade
production if I needed to.

Today I also downgraded the test host to 4.10 (my only convenient
option <4.12). I then launched the guest and started the stress
testing. This next statement is interesting but otherwise useless:
Under the same amount of stress testing the guests' CPU load average
is about half what is was (hovering around 2-3, wheras before it was
hovering around 6.) This is just for entertainment value only.

If Tomas' experience applies to me, this should mean that my test
guest will not stall anymore. I am going to let it run for 7 days
under this load, and report back either at that time, or sooner if it
stalls.

At that time, I'll also have a map to proceed. If the guest survives,
I'm going to roll my client forward to this configuration, because
they need to be on the new OS for a number of other reasons. So if
this proves to be "stable enough", we'll go forward.

If this guest does NOT survive, I'm going to downgrade the current
production host back to 4.9, putting us back to the place we were
before the trouble started.

Either way, I'll then be left with a pair of machines that are broken
(whichever pair my client "Leaves behind") and then I can start much
more aggressively testing everything you've asked me to - because my
client will be happy, and I will be able to (literally) sleep at
night.

In the meantime, if anything else happens, I will report, and if you
or anyone has other thoughts, please tell me. I definitely want to
get this resolved and fixed for everyone, to the extent I can help - I
just have to deal with the paying client first before I can turn back
to this completely.

(Test guest load average just touched 1.94 before coming back to 2.36
- no idea what this means but I have hope!)

Thank you thank you!

Glen

_______________________________________________
Xen-users mailing list
Xen-users@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-users
Re: Xen 4.12 DomU hang / freeze / stall under high network/disk load [ In reply to ]
Hi!

On Fri, Feb 14, 2020 at 7:07 PM PGNet Dev <pgnet.dev@gmail.com> wrote:
> On 2/14/20 6:54 PM, Glen wrote:
> > That's what we upgraded to, and that's when this problem started, yes.
> > To be specific, I upgraded the Dom0 host to 15.1. The guest was still
> > at 42.3 (older version, huh) and started having issues at that point -
> > upgrading the guest to 15.1 did not solve it.
> just fyi, if not already old-news:
> opensuse's up-to-date pkgs,
> https://build.opensuse.org/project/show/Kernel:stable
> https://build.opensuse.org/project/show/Virtualization
> my own, that i 'monkey' with (these DO run on my own office/home systems; to date, reliably)
> https://build.opensuse.org/project/show/home:pgnd:Kernel:stable
> https://build.opensuse.org/project/show/home:pgnd:Virtualization:Xen
> https://build.opensuse.org/project/show/home:pgnd:Virtualization:qemu
> actual *production* pkgs for openSUSE servers are _based_ on results from that^, but are finalized, built & distributed only locally

Not "old news" at all ; incredibly helpful pointers I would otherwise
have had to search for and might have missed! THANK YOU for this!

Saving!

Glen

_______________________________________________
Xen-users mailing list
Xen-users@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-users
Re: Xen 4.12 DomU hang / freeze / stall under high network/disk load [ In reply to ]
On Sat, Feb 15, 2020 at 3:58 AM Glen <glenbarney@gmail.com> wrote:

> Tomas -
>
> In your previous report, before I showed up, you wrote:
> > I've tried Xen 4.12 and the latest staging Xen 4.13, both behave the
> same. Doesn't matter if kernel 4.14 or 5.4 is used.
>
> Now PGNet Dev has said:
>
> On Fri, Feb 14, 2020 at 5:54 PM PGNet Dev <pgnet.dev@gmail.com> wrote:
> > When I switched to openSUSE Xen 4.13.0_04 packages with KernelStable
> (atm, 5.5.3-25.gd654690), Guests of all 'flavors' became *much* better
> behaved.
>
> They used a later kernel than you cited - I'm wondering about the
> relationship between their "Xen 4.13.0_04" and your "latest staging
> Xen 4.13".
>
> Any thoughts or insight there?
>
> Thanks!
> Glen
>


Glen,
I've used release 4.13.0 with all patches from the staging-4.13 branch:
https://xenbits.xen.org/gitweb/?p=xen.git;a=shortlog;h=refs/heads/staging-4.13

IIRC it was commit
https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=721f2c323ca55c77857c93e7275b4a93a0e15e1f

I'd prefer to stick to kernel 5.4 as it's the latest LTS. I don't really
know what "much better behaved" means. With xen 4.11 I have no issues with
Xen and anything newer is causing stalls, so in my case it's either all
good or bad on selected busy hosts :)

Tomas
Re: Xen 4.12 DomU hang / freeze / stall under high network/disk load [ In reply to ]
On Fri, Feb 14, 2020 at 10:26 PM Tomas Mozes <hydrapolic@gmail.com> wrote:
>> They used a later kernel than you cited - I'm wondering about the
>> relationship between their "Xen 4.13.0_04" and your "latest staging
>> Xen 4.13".
> Glen,
> I've used release 4.13.0 with all patches from the staging-4.13 branch:
> https://xenbits.xen.org/gitweb/?p=xen.git;a=shortlog;h=refs/heads/staging-4.13
> IIRC it was commit https://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=721f2c323ca55c77857c93e7275b4a93a0e15e1f
> I'd prefer to stick to kernel 5.4 as it's the latest LTS. I don't really know what "much better behaved" means. With xen 4.11 I have no issues with Xen and anything newer is causing stalls, so in my case it's either all good or bad on selected busy hosts :)

Okay, Tomas, thank you so much!

My test guest has survived one day of stress testing under Xen 4.10 so
far, I'll report again soon.

Glen

_______________________________________________
Xen-users mailing list
Xen-users@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-users
Re: Xen 4.12 DomU hang / freeze / stall under high network/disk load [ In reply to ]
Anecdotal informational item for entertainment value only:

I had said previously that I had loaded about 2.5TB of data onto my
guest machine, and was testing by trying to rsync (and then tar'ing)
all that data from the test guest to an external source. I was
running multiple jobs simultaneously to try to simulate the heavy load
that was clearly causing my production machines to stall

Under 4.12, my guests would crumple after 24-36 hours of this type of loading.

Under 4.10, my guest has been up now for 68 hours and - after about 64
hours, the above transfers *completed*. This has never happened. No
guest under 4.12 has ever survived 4+ simultaneous transfers of 2.5TB
of data since I encountered this problem. They would all stall well
before the transfers could complete. In contrast, under 4.10, my same
guest ran at (subjectively) about half the load average, and the
transfers all completed, in their entirely, without any stalls.

I have now restarted the testing, at 12 simultaneous transfers instead
of 4, plus Sarah's iperf3 suggestion thrown into the mix as well.
After 30 minutes, my guest is still showing a noticeably lower load
average than it did under 4.12 with just 4 simultaneous transfers. I
will report on how this goes.

I understand that none of this is particularly objective data. I'm
sending it only in the hopes that it sparks something for someone
while we wait. If this guest continues to survive high stress testing
under 4.10 - and I'm starting to have hope that it will - I'm moving
my client over to it this weekend. Then, starting next week, I'll be
able to do the directed, specific testing Sarah and others suggested,
without any business-side time pressure.

But it seems clear to me that there really *is* a problem in 4.12,
where guests seem (still subjectively so far) to stall, and certainly
run slower? at a higher load? (for some value of slower TBD) than
they do under earlier Xen versions. I agree that we need more data,
and I'm going to get it - and I hope that any others out there
experiencing this will chime in (I know Tomas is working this too)
because we need to bracket this well enough to file a useful bug
report so developers can get this fixed.

More to follow - thank you all for your ongoing attention and support.

Glen

_______________________________________________
Xen-users mailing list
Xen-users@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-users
Re: Xen 4.12 DomU hang / freeze / stall under high network/disk load [ In reply to ]
On 2/17/20 9:37 AM, Glen wrote:
> Anecdotal informational item for entertainment value only:
>
> I had said previously that I had loaded about 2.5TB of data onto my
> guest machine, and was testing by trying to rsync (and then tar'ing)
> all that data from the test guest to an external source. I was
> running multiple jobs simultaneously to try to simulate the heavy load
> that was clearly causing my production machines to stall
>
> Under 4.12, my guests would crumple after 24-36 hours of this type of loading.
>
> Under 4.10, my guest has been up now for 68 hours and - after about 64
> hours, the above transfers *completed*. This has never happened. No
> guest under 4.12 has ever survived 4+ simultaneous transfers of 2.5TB
> of data since I encountered this problem. They would all stall well
> before the transfers could complete. In contrast, under 4.10, my same
> guest ran at (subjectively) about half the load average, and the
> transfers all completed, in their entirely, without any stalls.

4.10 + which kernel version, just to confirm?

_______________________________________________
Xen-users mailing list
Xen-users@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-users
Re: Xen 4.12 DomU hang / freeze / stall under high network/disk load [ In reply to ]
Hi Sarah!

On Mon, Feb 17, 2020 at 9:58 AM Sarah Newman <srn@prgmr.com> wrote:
> On 2/17/20 9:37 AM, Glen wrote:
> > Under 4.12, my guests would crumple after 24-36 hours of this type of loading.
> > Under 4.10, my guest has been up now for 68 hours and - after about 64
> > hours, the above transfers *completed*. This has never happened. No
> > guest under 4.12 has ever survived 4+ simultaneous transfers of 2.5TB
> > of data since I encountered this problem. They would all stall well
> > before the transfers could complete. In contrast, under 4.10, my same
> > guest ran at (subjectively) about half the load average, and the
> > transfers all completed, in their entirely, without any stalls.

> 4.10 + which kernel version, just to confirm?

Host:
OpenSuse 15.0
Linux 4.12.14-lp150.12.82-default x86_64
Xen version 4.10.4_06-lp150.2.25

Guest: (unchanged)
OpenSuse 15.1
Linux 4.12.14-lp151.28.36-default x86_64

:-)

Glen

_______________________________________________
Xen-users mailing list
Xen-users@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-users
Re: Xen 4.12 DomU hang / freeze / stall under high network/disk load [ In reply to ]
On Monday, February 17, 2020, Glen <glenbarney@gmail.com> wrote:
> Hi Sarah!
>
> On Mon, Feb 17, 2020 at 9:58 AM Sarah Newman <srn@prgmr.com> wrote:
>> On 2/17/20 9:37 AM, Glen wrote:
>> > Under 4.12, my guests would crumple after 24-36 hours of this type of
loading.
>> > Under 4.10, my guest has been up now for 68 hours and - after about 64
>> > hours, the above transfers *completed*. This has never happened. No
>> > guest under 4.12 has ever survived 4+ simultaneous transfers of 2.5TB
>> > of data since I encountered this problem. They would all stall well
>> > before the transfers could complete. In contrast, under 4.10, my same
>> > guest ran at (subjectively) about half the load average, and the
>> > transfers all completed, in their entirely, without any stalls.
>
>> 4.10 + which kernel version, just to confirm?
>
> Host:
> OpenSuse 15.0
> Linux 4.12.14-lp150.12.82-default x86_64
> Xen version 4.10.4_06-lp150.2.25
>
> Guest: (unchanged)
> OpenSuse 15.1
> Linux 4.12.14-lp151.28.36-default x86_64
>
> :-)
>
> Glen
>
> _______________________________________________
> Xen-users mailing list
> Xen-users@lists.xenproject.org
> https://lists.xenproject.org/mailman/listinfo/xen-users

Just a quick note - no stall after switching to credit scheduler on xen
4.12 after 3 days.

Tomas
Re: Xen 4.12 DomU hang / freeze / stall under high network/disk load [ In reply to ]
On 2/17/20 10:33 AM, Tomas Mozes wrote:

> Just a quick note - no stall after switching to credit scheduler on xen
> 4.12 after 3 days.

That's great news. By 4.12 do you mean release 4.12.1, 4.12.2, or something else?

I'm assuming when "PGNet Dev" reported 4.12 being bad and 4.13 being good, they were using the default scheduler of credit 2.

It's worth asking on xen-devel if there's a known bug in the credit 2 scheduler that's been fixed. It looks like there were some significant changes
to the scheduling code in between Xen 4.12 and Xen 4.13, and if one was a fix I'm not sure it would have been recognized as being so.

--Sarah

_______________________________________________
Xen-users mailing list
Xen-users@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-users

1 2  View All