Mailing List Archive: System hangs when NVMe is under load

Hello,

I would like to be excused beforehand if i am sending something the the
wrong folks.

We have a strange situation going on here with a couple of our servers.
We've been experiencing issues with the combination of
Debian+XEN+Samsung NVMe.

Problem:

It all began with
https://serverfault.com/questions/1006366/samsung-nvme-disappears-when-server-on-average-to-high-load

The situation is close to the one described above with some differences.
*Now It can be reproduced.*

* OS: 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2+deb10u1
* CPUS: Intel(R) Xeon(R) CPU E5-1650 v4
* NVMe: Samsung MZ1LB1T9HALS-00007
* xen_version : 4.11.4-pre
* Server: Supermicro Super Server/X10SRW-F, BIOS 3.2

We've gathered some more information - It happens only when XEN is loaded.

The command that breaks everything is the following and it breaks it
fast. In the following situation it just needs approx 20 secs to hang
the whole system. I am attaching the Call trace which occurs during the
hang up.

date; echo; fio --filename=/dev/nvme0n1 --direct=1 --rw=randread --bs=4k
--ioengine=libaio --iodepth=256 --runtime=345600 --numjobs=10
--time_based --group_reporting --name=iops-test-job --readonly
--output=fio_log.randread4k.log; date

I have currently ran the test on one of the nodes where I have booted
/without /xen. Have in mind that all servers are provisioned with
Ansible and are the same.

What is tried so far:

Setting kernel option nvme_core.default_ps_max_latency_us to 5500/200 as
read
https://wiki.archlinux.org/index.php/Solid_state_drive/NVMe#Samsung_drive_errors_on_Linux_4.10
and
https://askubuntu.com/questions/905710/ext4-fs-error-after-ubuntu-17-04-upgrade

Setting kernel option nvme_core.force_apst=1 thus trying to force APST
since (nvme id-ctrl /dev/nvme0n1 | grep apst
apsta : 0 )

* First try - no success.
* Forcing APST to Y - no success.

I have kind of "overheated" on the subject right now and could be
possibly missing something important out. Let me know if you need any
more information.

NB: We began testing this cluster because it was showing really slow
disk related operations (on the nvme). For comparison - the other
cluster (mentioned in serverfault), never showed any performance issues.

Best Regards,

--
Stanislav Ivanov
System Administrator
–––––––––––––––––––––––––
Abilix Soft LTD.
?????, ??."??????????" ?1?, ???? 24?
Support: +359 700 911 44
https://abscloud.eu

On 7/16/20 6:34 AM, Stanislav wrote:
> Hello,
>
> I would like to be excused beforehand if i am sending something the the wrong folks.
>
> We have a strange situation going on here with a couple of our servers. We've been experiencing issues with the combination of Debian+XEN+Samsung NVMe.
>
> Problem:
>
> It all began with https://serverfault.com/questions/1006366/samsung-nvme-disappears-when-server-on-average-to-high-load
>
> The situation is close to the one described above with some differences. *Now It can be reproduced.*
>
> * OS: 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2+deb10u1
> * CPUS: Intel(R) Xeon(R) CPU E5-1650 v4
> * NVMe: Samsung MZ1LB1T9HALS-00007
> * xen_version : 4.11.4-pre
> * Server: Supermicro Super Server/X10SRW-F, BIOS 3.2
>
> We've gathered some more information - It happens only when XEN is loaded.
>
> The command that breaks everything is the following and it breaks it fast. In the following situation it just needs approx 20 secs to hang the whole
> system. I am attaching the Call trace which occurs during the hang up.

Does the system still respond to sysrq https://www.kernel.org/doc/html/latest/admin-guide/sysrq.html ? The dom0 is a PV system so you'll need to use
ctrl-o to send a break if you're on the serial console. You could try to use that to dump a backtrace for all the CPUs.

You can also try sending a command to xen. Xen has a debug handler, which unfortunately I can't find good documentation for (this seems like something
really basic missing, oops.) In any case, if you use 'ctrl-a' three times on your console that should switch between Xen and the dom0. From there 'h'
shows the commands. I do not have useful advice on what to collect there.

You may also be interesting in trying a debug kernel build. We did one already for Debian for other reasons so we may be able to help you with that.

--Sarah

Mailing List Archive

Mailing List Archive

Attached Files: