Hello,
I would like to be excused beforehand if i am sending something the the
wrong folks.
We have a strange situation going on here with a couple of our servers.
We've been experiencing issues with the combination of
Debian+XEN+Samsung NVMe.
Problem:
It all began with
https://serverfault.com/questions/1006366/samsung-nvme-disappears-when-server-on-average-to-high-load
The situation is close to the one described above with some differences.
*Now It can be reproduced.*
* OS: 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2+deb10u1
* CPUS: Intel(R) Xeon(R) CPU E5-1650 v4
* NVMe: Samsung MZ1LB1T9HALS-00007
* xen_version : 4.11.4-pre
* Server: Supermicro Super Server/X10SRW-F, BIOS 3.2
We've gathered some more information - It happens only when XEN is loaded.
The command that breaks everything is the following and it breaks it
fast. In the following situation it just needs approx 20 secs to hang
the whole system. I am attaching the Call trace which occurs during the
hang up.
date; echo; fio --filename=/dev/nvme0n1 --direct=1 --rw=randread --bs=4k
--ioengine=libaio --iodepth=256 --runtime=345600 --numjobs=10
--time_based --group_reporting --name=iops-test-job --readonly
--output=fio_log.randread4k.log; date
I have currently ran the test on one of the nodes where I have booted
/without /xen. Have in mind that all servers are provisioned with
Ansible and are the same.
What is tried so far:
Setting kernel option nvme_core.default_ps_max_latency_us to 5500/200 as
read
https://wiki.archlinux.org/index.php/Solid_state_drive/NVMe#Samsung_drive_errors_on_Linux_4.10
and
https://askubuntu.com/questions/905710/ext4-fs-error-after-ubuntu-17-04-upgrade
Setting kernel option nvme_core.force_apst=1 thus trying to force APST
since (nvme id-ctrl /dev/nvme0n1 | grep apst
apsta : 0 )
* First try - no success.
* Forcing APST to Y - no success.
I have kind of "overheated" on the subject right now and could be
possibly missing something important out. Let me know if you need any
more information.
NB: We began testing this cluster because it was showing really slow
disk related operations (on the nvme). For comparison - the other
cluster (mentioned in serverfault), never showed any performance issues.
Best Regards,
--
Stanislav Ivanov
System Administrator
–––––––––––––––––––––––––
Abilix Soft LTD.
?????, ??."??????????" ?1?, ???? 24?
Support: +359 700 911 44
https://abscloud.eu
I would like to be excused beforehand if i am sending something the the
wrong folks.
We have a strange situation going on here with a couple of our servers.
We've been experiencing issues with the combination of
Debian+XEN+Samsung NVMe.
Problem:
It all began with
https://serverfault.com/questions/1006366/samsung-nvme-disappears-when-server-on-average-to-high-load
The situation is close to the one described above with some differences.
*Now It can be reproduced.*
* OS: 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2+deb10u1
* CPUS: Intel(R) Xeon(R) CPU E5-1650 v4
* NVMe: Samsung MZ1LB1T9HALS-00007
* xen_version : 4.11.4-pre
* Server: Supermicro Super Server/X10SRW-F, BIOS 3.2
We've gathered some more information - It happens only when XEN is loaded.
The command that breaks everything is the following and it breaks it
fast. In the following situation it just needs approx 20 secs to hang
the whole system. I am attaching the Call trace which occurs during the
hang up.
date; echo; fio --filename=/dev/nvme0n1 --direct=1 --rw=randread --bs=4k
--ioengine=libaio --iodepth=256 --runtime=345600 --numjobs=10
--time_based --group_reporting --name=iops-test-job --readonly
--output=fio_log.randread4k.log; date
I have currently ran the test on one of the nodes where I have booted
/without /xen. Have in mind that all servers are provisioned with
Ansible and are the same.
What is tried so far:
Setting kernel option nvme_core.default_ps_max_latency_us to 5500/200 as
read
https://wiki.archlinux.org/index.php/Solid_state_drive/NVMe#Samsung_drive_errors_on_Linux_4.10
and
https://askubuntu.com/questions/905710/ext4-fs-error-after-ubuntu-17-04-upgrade
Setting kernel option nvme_core.force_apst=1 thus trying to force APST
since (nvme id-ctrl /dev/nvme0n1 | grep apst
apsta : 0 )
* First try - no success.
* Forcing APST to Y - no success.
I have kind of "overheated" on the subject right now and could be
possibly missing something important out. Let me know if you need any
more information.
NB: We began testing this cluster because it was showing really slow
disk related operations (on the nvme). For comparison - the other
cluster (mentioned in serverfault), never showed any performance issues.
Best Regards,
--
Stanislav Ivanov
System Administrator
–––––––––––––––––––––––––
Abilix Soft LTD.
?????, ??."??????????" ?1?, ???? 24?
Support: +359 700 911 44
https://abscloud.eu