Mailing List Archive

xen domU stall on 4.12.1
Hello,
I've tried upgrading one of my long running Xen-dom0 machines from Xen
4.11.3 to 4.12.1. It's been working fine for several days, but after that
one of the domUs failed the monitoring checks and it was impossible to
access it via ssh. From the monitoring it's visible that the load just
starts to grow linearly (the memory consumption grows too as new monitoring
processes are spawned and don't finish) and the machine is simply stuck. It
happened 3 times during the last 3 weeks. The Dom0 is working fine, just
one of the domUs is stuck (always the same domU is stuck).

Xen got upgraded 19.12.2019, the first lockup happened 26.12.2019, then
28.12.2019 and 5.1.2020.

The domU kernel log is full of these messages:
Jan 5 13:19:20 kernel: [680493.141103] INFO: rcu_sched detected stalls on
CPUs/tasks:
Jan 5 13:19:20 kernel: [680493.141107] (detected by 12, t=147012
jiffies, g=72555998, c=72555997, q=89937)
Jan 5 13:19:20 kernel: [680493.141112] All QSes seen, last rcu_sched
kthread activity 147012 (4975178416-4975031404), jiffies_till_next_fqs=3,
root ->qsmask 0x0
Jan 5 13:19:20 kernel: [680493.141114] php-fpm R running task
14024 17581 2249 0x00000000
Jan 5 13:19:20 kernel: [680493.141120] Call Trace:
Jan 5 13:19:20 kernel: [680493.141124] <IRQ>
Jan 5 13:19:20 kernel: [680493.141131] sched_show_task.cold+0xb4/0xcb
Jan 5 13:19:20 kernel: [680493.141135]
rcu_check_callbacks.cold+0x36d/0x3ba
Jan 5 13:19:20 kernel: [680493.141138] update_process_times+0x24/0x60
Jan 5 13:19:20 kernel: [680493.141143] tick_sched_handle+0x30/0x50
Jan 5 13:19:20 kernel: [680493.141145] tick_sched_timer+0x30/0x70
Jan 5 13:19:20 kernel: [680493.141147] ? tick_sched_do_timer+0x40/0x40
Jan 5 13:19:20 kernel: [680493.141149] __hrtimer_run_queues+0xbc/0x1f0
Jan 5 13:19:20 kernel: [680493.141153] hrtimer_interrupt+0xa0/0x1d0
Jan 5 13:19:20 kernel: [680493.141158] xen_timer_interrupt+0x1e/0x30
Jan 5 13:19:20 kernel: [680493.141162]
__handle_irq_event_percpu+0x3d/0x160
Jan 5 13:19:20 kernel: [680493.141164] handle_irq_event_percpu+0x1c/0x60
Jan 5 13:19:20 kernel: [680493.141168] handle_percpu_irq+0x32/0x50
Jan 5 13:19:20 kernel: [680493.141171] generic_handle_irq+0x1f/0x30
Jan 5 13:19:20 kernel: [680493.141175]
__evtchn_fifo_handle_events+0x13f/0x150
Jan 5 13:19:20 kernel: [680493.141181] __xen_evtchn_do_upcall+0x53/0x90
Jan 5 13:19:20 kernel: [680493.141186] xen_evtchn_do_upcall+0x22/0x40
Jan 5 13:19:20 kernel: [680493.141191] xen_hvm_callback_vector+0x85/0x90
Jan 5 13:19:20 kernel: [680493.141192] </IRQ>
Jan 5 13:19:20 kernel: [680493.141194] RIP: 0033:0x56398dc8a959
Jan 5 13:19:20 kernel: [680493.141195] RSP: 002b:00007ffdd588d3d0 EFLAGS:
00000246 ORIG_RAX: ffffffffffffff0c
Jan 5 13:19:20 kernel: [680493.141197] RAX: 0000000000000060 RBX:
00007f6ea3aa02e0 RCX: 0000000000000000
Jan 5 13:19:20 kernel: [680493.141198] RDX: 00007f6ea3aa02a0 RSI:
00007ffdd588d3d8 RDI: 00007ffdd588d3e0
Jan 5 13:19:20 kernel: [680493.141199] RBP: 00007f6ea3a9b5b0 R08:
00007f6ea49be770 R09: 00007f6ea483cdc0
Jan 5 13:19:20 kernel: [680493.141200] R10: 00007f6eae520a40 R11:
00007f6eae4933c0 R12: 00005639902892a0
Jan 5 13:19:20 kernel: [680493.141201] R13: 0000000000000000 R14:
00007f6eae41e930 R15: 00007f6ea7077138
Jan 5 13:19:20 kernel: [680493.141204] rcu_sched kthread starved for
147012 jiffies! g72555998 c72555997 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x200
->cpu=3
Jan 5 13:19:20 kernel: [680493.141205] rcu_sched R15016 8 2
0x80000000
Jan 5 13:19:20 kernel: [680493.141210] Call Trace:
Jan 5 13:19:20 kernel: [680493.141215] ? __schedule+0x24e/0x710
Jan 5 13:19:20 kernel: [680493.141216] schedule+0x2d/0x80
Jan 5 13:19:20 kernel: [680493.141219] schedule_timeout+0x16c/0x340
Jan 5 13:19:20 kernel: [680493.141221] ? call_timer_fn+0x130/0x130
Jan 5 13:19:20 kernel: [680493.141222] rcu_gp_kthread+0x486/0xd60
Jan 5 13:19:20 kernel: [680493.141224] kthread+0xfd/0x130
Jan 5 13:19:20 kernel: [680493.141226] ? force_qs_rnp+0x170/0x170
Jan 5 13:19:20 kernel: [680493.141227] ? __kthread_parkme+0x90/0x90
Jan 5 13:19:20 kernel: [680493.141228] ret_from_fork+0x35/0x40

Xen-dom0 is running kernel 4.14.158 and these xen command line options:
GRUB_CMDLINE_XEN="dom0_mem=4G gnttab_max_frames=256 ucode=scan loglvl=all
guest_loglvl=all console_to_ring console_timestamps=date conring_size=1m
smt=true iommu=no-intremap"

Xen-domU config:
name = "machine"
kernel = "kernel-4.14.159-gentoo-xen"
memory = 10000
vcpus = 16
vif = [ '' ]
disk = [
'...root,raw,xvda,rw',
'...opt,raw,xvdc,rw',
'...home,raw,xvdb,rw',
'...tmp,raw,xvdd,rw',
'...var,raw,xvde,rw',
]
extra = "root=/dev/xvda net.ifnames=0 console=ttyS0 console=ttyS0,38400n8"
type = "hvm"
sdl = 0
vnc = 0
serial='pty'
xen_platform_pci=1
max_grant_frames = 256

I've had issues like this in the past with the grant frames (basically this
issue https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=880554), maybe some
other value needs to be raised too?

Thanks,
Tomas
Re: xen domU stall on 4.12.1 [ In reply to ]
On Tue, Jan 7, 2020 at 8:29 AM Tomas Mozes <hydrapolic@gmail.com> wrote:

> Hello,
> I've tried upgrading one of my long running Xen-dom0 machines from Xen
> 4.11.3 to 4.12.1. It's been working fine for several days, but after that
> one of the domUs failed the monitoring checks and it was impossible to
> access it via ssh. From the monitoring it's visible that the load just
> starts to grow linearly (the memory consumption grows too as new monitoring
> processes are spawned and don't finish) and the machine is simply stuck. It
> happened 3 times during the last 3 weeks. The Dom0 is working fine, just
> one of the domUs is stuck (always the same domU is stuck).
>
> Xen got upgraded 19.12.2019, the first lockup happened 26.12.2019, then
> 28.12.2019 and 5.1.2020.
>
> The domU kernel log is full of these messages:
> Jan 5 13:19:20 kernel: [680493.141103] INFO: rcu_sched detected stalls on
> CPUs/tasks:
> Jan 5 13:19:20 kernel: [680493.141107] (detected by 12, t=147012
> jiffies, g=72555998, c=72555997, q=89937)
> Jan 5 13:19:20 kernel: [680493.141112] All QSes seen, last rcu_sched
> kthread activity 147012 (4975178416-4975031404), jiffies_till_next_fqs=3,
> root ->qsmask 0x0
> Jan 5 13:19:20 kernel: [680493.141114] php-fpm R running task
> 14024 17581 2249 0x00000000
> Jan 5 13:19:20 kernel: [680493.141120] Call Trace:
> Jan 5 13:19:20 kernel: [680493.141124] <IRQ>
> Jan 5 13:19:20 kernel: [680493.141131] sched_show_task.cold+0xb4/0xcb
> Jan 5 13:19:20 kernel: [680493.141135]
> rcu_check_callbacks.cold+0x36d/0x3ba
> Jan 5 13:19:20 kernel: [680493.141138] update_process_times+0x24/0x60
> Jan 5 13:19:20 kernel: [680493.141143] tick_sched_handle+0x30/0x50
> Jan 5 13:19:20 kernel: [680493.141145] tick_sched_timer+0x30/0x70
> Jan 5 13:19:20 kernel: [680493.141147] ? tick_sched_do_timer+0x40/0x40
> Jan 5 13:19:20 kernel: [680493.141149] __hrtimer_run_queues+0xbc/0x1f0
> Jan 5 13:19:20 kernel: [680493.141153] hrtimer_interrupt+0xa0/0x1d0
> Jan 5 13:19:20 kernel: [680493.141158] xen_timer_interrupt+0x1e/0x30
> Jan 5 13:19:20 kernel: [680493.141162]
> __handle_irq_event_percpu+0x3d/0x160
> Jan 5 13:19:20 kernel: [680493.141164] handle_irq_event_percpu+0x1c/0x60
> Jan 5 13:19:20 kernel: [680493.141168] handle_percpu_irq+0x32/0x50
> Jan 5 13:19:20 kernel: [680493.141171] generic_handle_irq+0x1f/0x30
> Jan 5 13:19:20 kernel: [680493.141175]
> __evtchn_fifo_handle_events+0x13f/0x150
> Jan 5 13:19:20 kernel: [680493.141181] __xen_evtchn_do_upcall+0x53/0x90
> Jan 5 13:19:20 kernel: [680493.141186] xen_evtchn_do_upcall+0x22/0x40
> Jan 5 13:19:20 kernel: [680493.141191] xen_hvm_callback_vector+0x85/0x90
> Jan 5 13:19:20 kernel: [680493.141192] </IRQ>
> Jan 5 13:19:20 kernel: [680493.141194] RIP: 0033:0x56398dc8a959
> Jan 5 13:19:20 kernel: [680493.141195] RSP: 002b:00007ffdd588d3d0 EFLAGS:
> 00000246 ORIG_RAX: ffffffffffffff0c
> Jan 5 13:19:20 kernel: [680493.141197] RAX: 0000000000000060 RBX:
> 00007f6ea3aa02e0 RCX: 0000000000000000
> Jan 5 13:19:20 kernel: [680493.141198] RDX: 00007f6ea3aa02a0 RSI:
> 00007ffdd588d3d8 RDI: 00007ffdd588d3e0
> Jan 5 13:19:20 kernel: [680493.141199] RBP: 00007f6ea3a9b5b0 R08:
> 00007f6ea49be770 R09: 00007f6ea483cdc0
> Jan 5 13:19:20 kernel: [680493.141200] R10: 00007f6eae520a40 R11:
> 00007f6eae4933c0 R12: 00005639902892a0
> Jan 5 13:19:20 kernel: [680493.141201] R13: 0000000000000000 R14:
> 00007f6eae41e930 R15: 00007f6ea7077138
> Jan 5 13:19:20 kernel: [680493.141204] rcu_sched kthread starved for
> 147012 jiffies! g72555998 c72555997 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x200
> ->cpu=3
> Jan 5 13:19:20 kernel: [680493.141205] rcu_sched R15016 8
> 2 0x80000000
> Jan 5 13:19:20 kernel: [680493.141210] Call Trace:
> Jan 5 13:19:20 kernel: [680493.141215] ? __schedule+0x24e/0x710
> Jan 5 13:19:20 kernel: [680493.141216] schedule+0x2d/0x80
> Jan 5 13:19:20 kernel: [680493.141219] schedule_timeout+0x16c/0x340
> Jan 5 13:19:20 kernel: [680493.141221] ? call_timer_fn+0x130/0x130
> Jan 5 13:19:20 kernel: [680493.141222] rcu_gp_kthread+0x486/0xd60
> Jan 5 13:19:20 kernel: [680493.141224] kthread+0xfd/0x130
> Jan 5 13:19:20 kernel: [680493.141226] ? force_qs_rnp+0x170/0x170
> Jan 5 13:19:20 kernel: [680493.141227] ? __kthread_parkme+0x90/0x90
> Jan 5 13:19:20 kernel: [680493.141228] ret_from_fork+0x35/0x40
>
> Xen-dom0 is running kernel 4.14.158 and these xen command line options:
> GRUB_CMDLINE_XEN="dom0_mem=4G gnttab_max_frames=256 ucode=scan loglvl=all
> guest_loglvl=all console_to_ring console_timestamps=date conring_size=1m
> smt=true iommu=no-intremap"
>
> Xen-domU config:
> name = "machine"
> kernel = "kernel-4.14.159-gentoo-xen"
> memory = 10000
> vcpus = 16
> vif = [ '' ]
> disk = [.
> '...root,raw,xvda,rw',
> '...opt,raw,xvdc,rw',
> '...home,raw,xvdb,rw',
> '...tmp,raw,xvdd,rw',
> '...var,raw,xvde,rw',
> ]
> extra = "root=/dev/xvda net.ifnames=0 console=ttyS0 console=ttyS0,38400n8"
> type = "hvm"
> sdl = 0
> vnc = 0
> serial='pty'
> xen_platform_pci=1
> max_grant_frames = 256
>
> I've had issues like this in the past with the grant frames (basically
> this issue https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=880554),
> maybe some other value needs to be raised too?
>
> Thanks,
> Tomas
>

Reproduced on another HP machine. The previous one was HP ProLiant DL360 G7
with 2x Intel Xeon E5620 @ 2.40GHz. The other is HP ProLiant DL360p Gen8
with 2x Intel Xeon CPU E5-2630 @ 2.30GHz.

The strange thing is this does not happen on our testing machine (but has
lower load of course) - that's a Supermicro X10DRW with 1x Intel Xeon CPU
E5-2620 v3 @ 2.40GHz.
Re: xen domU stall on 4.12.1 [ In reply to ]
On Mon, Jan 27, 2020 at 2:42 PM Tomas Mozes <hydrapolic@gmail.com> wrote:

>
>
> On Tue, Jan 7, 2020 at 8:29 AM Tomas Mozes <hydrapolic@gmail.com> wrote:
>
>> Hello,
>> I've tried upgrading one of my long running Xen-dom0 machines from Xen
>> 4.11.3 to 4.12.1. It's been working fine for several days, but after that
>> one of the domUs failed the monitoring checks and it was impossible to
>> access it via ssh. From the monitoring it's visible that the load just
>> starts to grow linearly (the memory consumption grows too as new monitoring
>> processes are spawned and don't finish) and the machine is simply stuck. It
>> happened 3 times during the last 3 weeks. The Dom0 is working fine, just
>> one of the domUs is stuck (always the same domU is stuck).
>>
>> Xen got upgraded 19.12.2019, the first lockup happened 26.12.2019, then
>> 28.12.2019 and 5.1.2020.
>>
>> The domU kernel log is full of these messages:
>> Jan 5 13:19:20 kernel: [680493.141103] INFO: rcu_sched detected stalls
>> on CPUs/tasks:
>> Jan 5 13:19:20 kernel: [680493.141107] (detected by 12, t=147012
>> jiffies, g=72555998, c=72555997, q=89937)
>> Jan 5 13:19:20 kernel: [680493.141112] All QSes seen, last rcu_sched
>> kthread activity 147012 (4975178416-4975031404), jiffies_till_next_fqs=3,
>> root ->qsmask 0x0
>> Jan 5 13:19:20 kernel: [680493.141114] php-fpm R running task
>> 14024 17581 2249 0x00000000
>> Jan 5 13:19:20 kernel: [680493.141120] Call Trace:
>> Jan 5 13:19:20 kernel: [680493.141124] <IRQ>
>> Jan 5 13:19:20 kernel: [680493.141131] sched_show_task.cold+0xb4/0xcb
>> Jan 5 13:19:20 kernel: [680493.141135]
>> rcu_check_callbacks.cold+0x36d/0x3ba
>> Jan 5 13:19:20 kernel: [680493.141138] update_process_times+0x24/0x60
>> Jan 5 13:19:20 kernel: [680493.141143] tick_sched_handle+0x30/0x50
>> Jan 5 13:19:20 kernel: [680493.141145] tick_sched_timer+0x30/0x70
>> Jan 5 13:19:20 kernel: [680493.141147] ? tick_sched_do_timer+0x40/0x40
>> Jan 5 13:19:20 kernel: [680493.141149] __hrtimer_run_queues+0xbc/0x1f0
>> Jan 5 13:19:20 kernel: [680493.141153] hrtimer_interrupt+0xa0/0x1d0
>> Jan 5 13:19:20 kernel: [680493.141158] xen_timer_interrupt+0x1e/0x30
>> Jan 5 13:19:20 kernel: [680493.141162]
>> __handle_irq_event_percpu+0x3d/0x160
>> Jan 5 13:19:20 kernel: [680493.141164] handle_irq_event_percpu+0x1c/0x60
>> Jan 5 13:19:20 kernel: [680493.141168] handle_percpu_irq+0x32/0x50
>> Jan 5 13:19:20 kernel: [680493.141171] generic_handle_irq+0x1f/0x30
>> Jan 5 13:19:20 kernel: [680493.141175]
>> __evtchn_fifo_handle_events+0x13f/0x150
>> Jan 5 13:19:20 kernel: [680493.141181] __xen_evtchn_do_upcall+0x53/0x90
>> Jan 5 13:19:20 kernel: [680493.141186] xen_evtchn_do_upcall+0x22/0x40
>> Jan 5 13:19:20 kernel: [680493.141191] xen_hvm_callback_vector+0x85/0x90
>> Jan 5 13:19:20 kernel: [680493.141192] </IRQ>
>> Jan 5 13:19:20 kernel: [680493.141194] RIP: 0033:0x56398dc8a959
>> Jan 5 13:19:20 kernel: [680493.141195] RSP: 002b:00007ffdd588d3d0
>> EFLAGS: 00000246 ORIG_RAX: ffffffffffffff0c
>> Jan 5 13:19:20 kernel: [680493.141197] RAX: 0000000000000060 RBX:
>> 00007f6ea3aa02e0 RCX: 0000000000000000
>> Jan 5 13:19:20 kernel: [680493.141198] RDX: 00007f6ea3aa02a0 RSI:
>> 00007ffdd588d3d8 RDI: 00007ffdd588d3e0
>> Jan 5 13:19:20 kernel: [680493.141199] RBP: 00007f6ea3a9b5b0 R08:
>> 00007f6ea49be770 R09: 00007f6ea483cdc0
>> Jan 5 13:19:20 kernel: [680493.141200] R10: 00007f6eae520a40 R11:
>> 00007f6eae4933c0 R12: 00005639902892a0
>> Jan 5 13:19:20 kernel: [680493.141201] R13: 0000000000000000 R14:
>> 00007f6eae41e930 R15: 00007f6ea7077138
>> Jan 5 13:19:20 kernel: [680493.141204] rcu_sched kthread starved for
>> 147012 jiffies! g72555998 c72555997 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x200
>> ->cpu=3
>> Jan 5 13:19:20 kernel: [680493.141205] rcu_sched R15016 8
>> 2 0x80000000
>> Jan 5 13:19:20 kernel: [680493.141210] Call Trace:
>> Jan 5 13:19:20 kernel: [680493.141215] ? __schedule+0x24e/0x710
>> Jan 5 13:19:20 kernel: [680493.141216] schedule+0x2d/0x80
>> Jan 5 13:19:20 kernel: [680493.141219] schedule_timeout+0x16c/0x340
>> Jan 5 13:19:20 kernel: [680493.141221] ? call_timer_fn+0x130/0x130
>> Jan 5 13:19:20 kernel: [680493.141222] rcu_gp_kthread+0x486/0xd60
>> Jan 5 13:19:20 kernel: [680493.141224] kthread+0xfd/0x130
>> Jan 5 13:19:20 kernel: [680493.141226] ? force_qs_rnp+0x170/0x170
>> Jan 5 13:19:20 kernel: [680493.141227] ? __kthread_parkme+0x90/0x90
>> Jan 5 13:19:20 kernel: [680493.141228] ret_from_fork+0x35/0x40
>>
>> Xen-dom0 is running kernel 4.14.158 and these xen command line options:
>> GRUB_CMDLINE_XEN="dom0_mem=4G gnttab_max_frames=256 ucode=scan loglvl=all
>> guest_loglvl=all console_to_ring console_timestamps=date conring_size=1m
>> smt=true iommu=no-intremap"
>>
>> Xen-domU config:
>> name = "machine"
>> kernel = "kernel-4.14.159-gentoo-xen"
>> memory = 10000
>> vcpus = 16
>> vif = [ '' ]
>> disk = [.
>> '...root,raw,xvda,rw',
>> '...opt,raw,xvdc,rw',
>> '...home,raw,xvdb,rw',
>> '...tmp,raw,xvdd,rw',
>> '...var,raw,xvde,rw',
>> ]
>> extra = "root=/dev/xvda net.ifnames=0 console=ttyS0 console=ttyS0,38400n8"
>> type = "hvm"
>> sdl = 0
>> vnc = 0
>> serial='pty'
>> xen_platform_pci=1
>> max_grant_frames = 256
>>
>> I've had issues like this in the past with the grant frames (basically
>> this issue https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=880554),
>> maybe some other value needs to be raised too?
>>
>> Thanks,
>> Tomas
>>
>
> Reproduced on another HP machine. The previous one was HP ProLiant DL360
> G7 with 2x Intel Xeon E5620 @ 2.40GHz. The other is HP ProLiant DL360p Gen8
> with 2x Intel Xeon CPU E5-2630 @ 2.30GHz.
>
> The strange thing is this does not happen on our testing machine (but has
> lower load of course) - that's a Supermicro X10DRW with 1x Intel Xeon CPU
> E5-2620 v3 @ 2.40GHz.
>


Just an update, I've tried Xen 4.12 and the latest staging Xen 4.13, both
behave the same. Doesn't matter if kernel 4.14 or 5.4 is used. Right after
the Xen version is reverted back to 4.11, everything works just fine,
nothing else needs to be changed.

I've tried adding "mitigations=off" to kernel options and "spec-ctrl=false
xpti=false pv-l1tf=false tsx=true" to Xen options, but it didn't help
either.

Thanks,
Tomas
Re: xen domU stall on 4.12.1 [ In reply to ]
On Thu, Feb 6, 2020 at 8:06 AM Tomas Mozes <hydrapolic@gmail.com> wrote:

>
>
>
> On Mon, Jan 27, 2020 at 2:42 PM Tomas Mozes <hydrapolic@gmail.com> wrote:
>
>>
>>
>> On Tue, Jan 7, 2020 at 8:29 AM Tomas Mozes <hydrapolic@gmail.com> wrote:
>>
>>> Hello,
>>> I've tried upgrading one of my long running Xen-dom0 machines from Xen
>>> 4.11.3 to 4.12.1. It's been working fine for several days, but after that
>>> one of the domUs failed the monitoring checks and it was impossible to
>>> access it via ssh. From the monitoring it's visible that the load just
>>> starts to grow linearly (the memory consumption grows too as new monitoring
>>> processes are spawned and don't finish) and the machine is simply stuck. It
>>> happened 3 times during the last 3 weeks. The Dom0 is working fine, just
>>> one of the domUs is stuck (always the same domU is stuck).
>>>
>>> Xen got upgraded 19.12.2019, the first lockup happened 26.12.2019, then
>>> 28.12.2019 and 5.1.2020.
>>>
>>> The domU kernel log is full of these messages:
>>> Jan 5 13:19:20 kernel: [680493.141103] INFO: rcu_sched detected stalls
>>> on CPUs/tasks:
>>> Jan 5 13:19:20 kernel: [680493.141107] (detected by 12,
>>> t=147012 jiffies, g=72555998, c=72555997, q=89937)
>>> Jan 5 13:19:20 kernel: [680493.141112] All QSes seen, last rcu_sched
>>> kthread activity 147012 (4975178416-4975031404), jiffies_till_next_fqs=3,
>>> root ->qsmask 0x0
>>> Jan 5 13:19:20 kernel: [680493.141114] php-fpm R running task
>>> 14024 17581 2249 0x00000000
>>> Jan 5 13:19:20 kernel: [680493.141120] Call Trace:
>>> Jan 5 13:19:20 kernel: [680493.141124] <IRQ>
>>> Jan 5 13:19:20 kernel: [680493.141131] sched_show_task.cold+0xb4/0xcb
>>> Jan 5 13:19:20 kernel: [680493.141135]
>>> rcu_check_callbacks.cold+0x36d/0x3ba
>>> Jan 5 13:19:20 kernel: [680493.141138] update_process_times+0x24/0x60
>>> Jan 5 13:19:20 kernel: [680493.141143] tick_sched_handle+0x30/0x50
>>> Jan 5 13:19:20 kernel: [680493.141145] tick_sched_timer+0x30/0x70
>>> Jan 5 13:19:20 kernel: [680493.141147] ? tick_sched_do_timer+0x40/0x40
>>> Jan 5 13:19:20 kernel: [680493.141149] __hrtimer_run_queues+0xbc/0x1f0
>>> Jan 5 13:19:20 kernel: [680493.141153] hrtimer_interrupt+0xa0/0x1d0
>>> Jan 5 13:19:20 kernel: [680493.141158] xen_timer_interrupt+0x1e/0x30
>>> Jan 5 13:19:20 kernel: [680493.141162]
>>> __handle_irq_event_percpu+0x3d/0x160
>>> Jan 5 13:19:20 kernel: [680493.141164]
>>> handle_irq_event_percpu+0x1c/0x60
>>> Jan 5 13:19:20 kernel: [680493.141168] handle_percpu_irq+0x32/0x50
>>> Jan 5 13:19:20 kernel: [680493.141171] generic_handle_irq+0x1f/0x30
>>> Jan 5 13:19:20 kernel: [680493.141175]
>>> __evtchn_fifo_handle_events+0x13f/0x150
>>> Jan 5 13:19:20 kernel: [680493.141181] __xen_evtchn_do_upcall+0x53/0x90
>>> Jan 5 13:19:20 kernel: [680493.141186] xen_evtchn_do_upcall+0x22/0x40
>>> Jan 5 13:19:20 kernel: [680493.141191]
>>> xen_hvm_callback_vector+0x85/0x90
>>> Jan 5 13:19:20 kernel: [680493.141192] </IRQ>
>>> Jan 5 13:19:20 kernel: [680493.141194] RIP: 0033:0x56398dc8a959
>>> Jan 5 13:19:20 kernel: [680493.141195] RSP: 002b:00007ffdd588d3d0
>>> EFLAGS: 00000246 ORIG_RAX: ffffffffffffff0c
>>> Jan 5 13:19:20 kernel: [680493.141197] RAX: 0000000000000060 RBX:
>>> 00007f6ea3aa02e0 RCX: 0000000000000000
>>> Jan 5 13:19:20 kernel: [680493.141198] RDX: 00007f6ea3aa02a0 RSI:
>>> 00007ffdd588d3d8 RDI: 00007ffdd588d3e0
>>> Jan 5 13:19:20 kernel: [680493.141199] RBP: 00007f6ea3a9b5b0 R08:
>>> 00007f6ea49be770 R09: 00007f6ea483cdc0
>>> Jan 5 13:19:20 kernel: [680493.141200] R10: 00007f6eae520a40 R11:
>>> 00007f6eae4933c0 R12: 00005639902892a0
>>> Jan 5 13:19:20 kernel: [680493.141201] R13: 0000000000000000 R14:
>>> 00007f6eae41e930 R15: 00007f6ea7077138
>>> Jan 5 13:19:20 kernel: [680493.141204] rcu_sched kthread starved for
>>> 147012 jiffies! g72555998 c72555997 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x200
>>> ->cpu=3
>>> Jan 5 13:19:20 kernel: [680493.141205] rcu_sched R15016 8
>>> 2 0x80000000
>>> Jan 5 13:19:20 kernel: [680493.141210] Call Trace:
>>> Jan 5 13:19:20 kernel: [680493.141215] ? __schedule+0x24e/0x710
>>> Jan 5 13:19:20 kernel: [680493.141216] schedule+0x2d/0x80
>>> Jan 5 13:19:20 kernel: [680493.141219] schedule_timeout+0x16c/0x340
>>> Jan 5 13:19:20 kernel: [680493.141221] ? call_timer_fn+0x130/0x130
>>> Jan 5 13:19:20 kernel: [680493.141222] rcu_gp_kthread+0x486/0xd60
>>> Jan 5 13:19:20 kernel: [680493.141224] kthread+0xfd/0x130
>>> Jan 5 13:19:20 kernel: [680493.141226] ? force_qs_rnp+0x170/0x170
>>> Jan 5 13:19:20 kernel: [680493.141227] ? __kthread_parkme+0x90/0x90
>>> Jan 5 13:19:20 kernel: [680493.141228] ret_from_fork+0x35/0x40
>>>
>>> Xen-dom0 is running kernel 4.14.158 and these xen command line options:
>>> GRUB_CMDLINE_XEN="dom0_mem=4G gnttab_max_frames=256 ucode=scan loglvl=all
>>> guest_loglvl=all console_to_ring console_timestamps=date conring_size=1m
>>> smt=true iommu=no-intremap"
>>>
>>> Xen-domU config:
>>> name = "machine"
>>> kernel = "kernel-4.14.159-gentoo-xen"
>>> memory = 10000
>>> vcpus = 16
>>> vif = [ '' ]
>>> disk = [.
>>> '...root,raw,xvda,rw',
>>> '...opt,raw,xvdc,rw',
>>> '...home,raw,xvdb,rw',
>>> '...tmp,raw,xvdd,rw',
>>> '...var,raw,xvde,rw',
>>> ]
>>> extra = "root=/dev/xvda net.ifnames=0 console=ttyS0
>>> console=ttyS0,38400n8"
>>> type = "hvm"
>>> sdl = 0
>>> vnc = 0
>>> serial='pty'
>>> xen_platform_pci=1
>>> max_grant_frames = 256
>>>
>>> I've had issues like this in the past with the grant frames (basically
>>> this issue https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=880554),
>>> maybe some other value needs to be raised too?
>>>
>>> Thanks,
>>> Tomas
>>>
>>
>> Reproduced on another HP machine. The previous one was HP ProLiant DL360
>> G7 with 2x Intel Xeon E5620 @ 2.40GHz. The other is HP ProLiant DL360p Gen8
>> with 2x Intel Xeon CPU E5-2630 @ 2.30GHz.
>>
>> The strange thing is this does not happen on our testing machine (but has
>> lower load of course) - that's a Supermicro X10DRW with 1x Intel Xeon CPU
>> E5-2620 v3 @ 2.40GHz.
>>
>
>
> Just an update, I've tried Xen 4.12 and the latest staging Xen 4.13, both
> behave the same. Doesn't matter if kernel 4.14 or 5.4 is used. Right after
> the Xen version is reverted back to 4.11, everything works just fine,
> nothing else needs to be changed.
>
> I've tried adding "mitigations=off" to kernel options and "spec-ctrl=false
> xpti=false pv-l1tf=false tsx=true" to Xen options, but it didn't help
> either.
>
> Thanks,
> Tomas
>


As reported in
https://lists.xenproject.org/archives/html/xen-devel/2020-01/msg00361.html
and
https://lists.xenproject.org/archives/html/xen-users/2020-02/msg00042.html,
switching back to credit1 scheduler seems to make it working again. I've
migrated 6 machines to Xen 4.12 with sched=credit xen option and haven't
observed a hang for more than a week now.

Thanks,
Tomas
Re: xen domU stall on 4.12.1 [ In reply to ]
On Sun, Feb 23, 2020 at 11:12 PM Tomas Mozes <hydrapolic@gmail.com> wrote:
> As reported in https://lists.xenproject.org/archives/html/xen-devel/2020-01/msg00361.html and https://lists.xenproject.org/archives/html/xen-users/2020-02/msg00042.html, switching back to credit1 scheduler seems to make it working again. I've migrated 6 machines to Xen 4.12 with sched=credit xen option and haven't observed a hang for more than a week now.

My experience is the same. I have migrated all 16 of my physical
hosts back to OpenSuse 15.1 with Xen 4.12.1 with sched=credit . All
guests are now running perfectly, without any issues at all. Over
this past week I performed directed stress-testing against several of
my guests, and they all survived without any problems at all. I've
now completely my migration to the new guests, and everyone is happy.

I'm now going to bring one of the previously-live guests on its own
host back to credit2 so I can crash it and try to capture debugging
output for xen-devel as requested. But sched=credit is definitely
what we needed to solve this problem! Thank you all for helping us
get there!

Glen

_______________________________________________
Xen-users mailing list
Xen-users@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-users
Re: xen domU stall on 4.12.1 [ In reply to ]
On Mon, Feb 24, 2020 at 4:55 PM Glen <glenbarney@gmail.com> wrote:

> On Sun, Feb 23, 2020 at 11:12 PM Tomas Mozes <hydrapolic@gmail.com> wrote:
> > As reported in
> https://lists.xenproject.org/archives/html/xen-devel/2020-01/msg00361.html
> and
> https://lists.xenproject.org/archives/html/xen-users/2020-02/msg00042.html,
> switching back to credit1 scheduler seems to make it working again. I've
> migrated 6 machines to Xen 4.12 with sched=credit xen option and haven't
> observed a hang for more than a week now.
>
> My experience is the same. I have migrated all 16 of my physical
> hosts back to OpenSuse 15.1 with Xen 4.12.1 with sched=credit . All
> guests are now running perfectly, without any issues at all. Over
> this past week I performed directed stress-testing against several of
> my guests, and they all survived without any problems at all. I've
> now completely my migration to the new guests, and everyone is happy.
>
> I'm now going to bring one of the previously-live guests on its own
> host back to credit2 so I can crash it and try to capture debugging
> output for xen-devel as requested. But sched=credit is definitely
> what we needed to solve this problem! Thank you all for helping us
> get there!
>
> Glen
>

Thank you too for your report. Hope we'll find the reason why credit2
misbehaves.

Tomas
Re: xen domU stall on 4.12.1 [ In reply to ]
On Mon, Feb 24, 2020 at 6:02 PM Tomas Mozes <hydrapolic@gmail.com> wrote:

>
>
> On Mon, Feb 24, 2020 at 4:55 PM Glen <glenbarney@gmail.com> wrote:
>
>> On Sun, Feb 23, 2020 at 11:12 PM Tomas Mozes <hydrapolic@gmail.com>
>> wrote:
>> > As reported in
>> https://lists.xenproject.org/archives/html/xen-devel/2020-01/msg00361.html
>> and
>> https://lists.xenproject.org/archives/html/xen-users/2020-02/msg00042.html,
>> switching back to credit1 scheduler seems to make it working again. I've
>> migrated 6 machines to Xen 4.12 with sched=credit xen option and haven't
>> observed a hang for more than a week now.
>>
>> My experience is the same. I have migrated all 16 of my physical
>> hosts back to OpenSuse 15.1 with Xen 4.12.1 with sched=credit . All
>> guests are now running perfectly, without any issues at all. Over
>> this past week I performed directed stress-testing against several of
>> my guests, and they all survived without any problems at all. I've
>> now completely my migration to the new guests, and everyone is happy.
>>
>> I'm now going to bring one of the previously-live guests on its own
>> host back to credit2 so I can crash it and try to capture debugging
>> output for xen-devel as requested. But sched=credit is definitely
>> what we needed to solve this problem! Thank you all for helping us
>> get there!
>>
>> Glen
>>
>
> Thank you too for your report. Hope we'll find the reason why credit2
> misbehaves.
>
> Tomas
>

Just tested Xen 4.12.3, but a domU hanged again with credit2. It works rock
solid with credit1.

Tomas
Re: xen domU stall on 4.12.1 [ In reply to ]
Tomas -

On Tue, Jun 2, 2020 at 7:43 PM Tomas Mozes <hydrapolic@gmail.com> wrote:
>> On Mon, Feb 24, 2020 at 4:55 PM Glen <glenbarney@gmail.com> wrote:
>>> I'm now going to bring one of the previously-live guests on its own
>>> host back to credit2 so I can crash it and try to capture debugging
>>> output for xen-devel as requested. But sched=credit is definitely
>>> what we needed to solve this problem! Thank you all for helping us
>>> get there!
> Just tested Xen 4.12.3, but a domU hanged again with credit2. It works rock solid with credit1.

I have several hosts back on credit2, no problems so far. But the
bulk of my production hosts are still on credit1, and they do seem to
run "better" (subjectively, looking at responsiveness and load
averages) but of course by subjectively I mean that I have no real
data to back this feeling.

I was hoping one of my domU's on credit2 would crash so I could grab
debugging info for the development team - I hope you are/were able to
grab and submit data on that crash???

Glen
Re: xen domU stall on 4.12.1 [ In reply to ]
On Wed, Jun 3, 2020 at 5:30 PM Glen <glenbarney@gmail.com> wrote:

> Tomas -
>
> On Tue, Jun 2, 2020 at 7:43 PM Tomas Mozes <hydrapolic@gmail.com> wrote:
> >> On Mon, Feb 24, 2020 at 4:55 PM Glen <glenbarney@gmail.com> wrote:
> >>> I'm now going to bring one of the previously-live guests on its own
> >>> host back to credit2 so I can crash it and try to capture debugging
> >>> output for xen-devel as requested. But sched=credit is definitely
> >>> what we needed to solve this problem! Thank you all for helping us
> >>> get there!
> > Just tested Xen 4.12.3, but a domU hanged again with credit2. It works
> rock solid with credit1.
>
> I have several hosts back on credit2, no problems so far. But the
> bulk of my production hosts are still on credit1, and they do seem to
> run "better" (subjectively, looking at responsiveness and load
> averages) but of course by subjectively I mean that I have no real
> data to back this feeling.
>
> I was hoping one of my domU's on credit2 would crash so I could grab
> debugging info for the development team - I hope you are/were able to
> grab and submit data on that crash???
>
> Glen
>

Unfortunately no, it was one of my production hosts so I wanted to get it
back working as quickly as possible.

Tomas