Mailing List Archive: PCI-passthrough hvm guest not working after domu restart

I am passing an NVIDIA A10 GPU through to a domu running in hvm mode.
nvidia-smi detects the GPU OK.

I restart the VM, and GPU is no longer detected by nvidia-smi.

lspci output includes the device even when nvidia-smi detect does not.

Any ideas where I start to debug this? What output I should include here
to assist? Options I should try on the host/dom0?

I'd tried resetting a few things on the host (no difference).
echo 1 > /sys/bus/pci/devices/0000\:06\:00.0/remove
echo 1 > /sys/bus/pci/rescan
echo "0" > "/sys/bus/pci/devices/0000:06:00.0/power"
echo "1" > /sys/bus/pci/devices/0000\:06\:00.0/reset

I'd tried with a Xen 4.17. Also with 4.18.3. Same behavior on each.

root@gpu:~# lspci | grep NVIDIA
00:05.0 3D controller: NVIDIA Corporation GA102GL [A10] (rev a1)
00:06.0 3D controller: NVIDIA Corporation GA102GL [A10] (rev a1)

domu dmesg:
[ 4.798048] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 525.105.17
Tue Mar 28 18:02:59 UTC 2023
[ 4.801411] cirrus 0000:00:03.0: [drm]
drm_plane_enable_fb_damage_clips() not called
[ 4.976379] Console: switching to colour frame buffer device 128x48
[ 5.122610] cirrus 0000:00:03.0: [drm] fb0: cirrusdrmfb frame buffer
device
[ 5.247082] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver
for UNIX platforms 525.105.17 Tue Mar 28 22:18:37 UTC 2023
[ 5.656447] [drm] [nvidia-drm] [GPU ID 0x00000005] Loading driver
[ 5.656451] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:00:05.0
on minor 1
[ 5.656570] [drm] [nvidia-drm] [GPU ID 0x00000006] Loading driver
[ 5.656572] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:00:06.0
on minor 2
[ 94.161919] nvidia 0000:00:05.0: firmware: direct-loading firmware
nvidia/525.105.17/gsp_tu10x.bin
[ 94.328842] NVRM: GPU 0000:00:05.0: RmInitAdapter failed!
(0x62:0xffff:2351)
[ 94.329981] NVRM: GPU 0000:00:05.0: rm_init_adapter failed, device minor
number 0
[ 94.348750] nvidia 0000:00:05.0: firmware: direct-loading firmware
nvidia/525.105.17/gsp_tu10x.bin
[ 94.749557] NVRM: GPU 0000:00:05.0: RmInitAdapter failed!
(0x62:0xffff:2351)
[ 94.750732] NVRM: GPU 0000:00:05.0: rm_init_adapter failed, device minor
number 0

Thanks in advance for any guidance.
- Peter

On 26 Jun 2023 03:05, Peter Gimmemail wrote:
> I am passing an NVIDIA A10 GPU through to a domu running in hvm mode.
> nvidia-smi detects the GPU OK.
>
> I restart the VM, and GPU is no longer detected by nvidia-smi.
>
> lspci output includes the device even when nvidia-smi detect does not.
>
> Any ideas where I start to debug this? What output I should include here
> to assist? Options I should try on the host/dom0?
>
> I'd tried resetting a few things on the host (no difference).
> echo 1 > /sys/bus/pci/devices/0000\:06\:00.0/remove
> echo 1 > /sys/bus/pci/rescan
> echo "0" > "/sys/bus/pci/devices/0000:06:00.0/power"
> echo "1" > /sys/bus/pci/devices/0000\:06\:00.0/reset
>
> I'd tried with a Xen 4.17. Also with 4.18.3. Same behavior on each.
>
> root@gpu:~# lspci | grep NVIDIA
> 00:05.0 3D controller: NVIDIA Corporation GA102GL [A10] (rev a1)
> 00:06.0 3D controller: NVIDIA Corporation GA102GL [A10] (rev a1)
>
> domu dmesg:
> [ 4.798048] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 525.105.17
> Tue Mar 28 18:02:59 UTC 2023
> [ 4.801411] cirrus 0000:00:03.0: [drm]
> drm_plane_enable_fb_damage_clips() not called
> [ 4.976379] Console: switching to colour frame buffer device 128x48
> [ 5.122610] cirrus 0000:00:03.0: [drm] fb0: cirrusdrmfb frame buffer
> device
> [ 5.247082] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver
> for UNIX platforms 525.105.17 Tue Mar 28 22:18:37 UTC 2023
> [ 5.656447] [drm] [nvidia-drm] [GPU ID 0x00000005] Loading driver
> [ 5.656451] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:00:05.0
> on minor 1
> [ 5.656570] [drm] [nvidia-drm] [GPU ID 0x00000006] Loading driver
> [ 5.656572] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:00:06.0
> on minor 2
> [ 94.161919] nvidia 0000:00:05.0: firmware: direct-loading firmware
> nvidia/525.105.17/gsp_tu10x.bin
> [ 94.328842] NVRM: GPU 0000:00:05.0: RmInitAdapter failed!
> (0x62:0xffff:2351)
> [ 94.329981] NVRM: GPU 0000:00:05.0: rm_init_adapter failed, device minor
> number 0
> [ 94.348750] nvidia 0000:00:05.0: firmware: direct-loading firmware
> nvidia/525.105.17/gsp_tu10x.bin
> [ 94.749557] NVRM: GPU 0000:00:05.0: RmInitAdapter failed!
> (0x62:0xffff:2351)
> [ 94.750732] NVRM: GPU 0000:00:05.0: rm_init_adapter failed, device minor
> number 0
>
> Thanks in advance for any guidance.
> - Peter
>

I'll first ask a few generic questions, then give my experience.
Note that I'm only an average user, nothing more ; )

Did you try "poweroff" then "xl create" instead of rebooting the domU ?
I remember Xen 4.11 had a bug like that when using PT adapters in the
config file (use "on_reboot=destroy" then).

How do you PT the GPU ? What options ?
Is the vgaarb stuff kicking in (dom0/domU dmesg) ?

Does your card have FLReset ?
root@dom0 # lspci -vvv -s 06:00.0 | grep -i flr
(as root, a user account doesn't get that info)

Do you get error messages when you "xl create" the domU (like
libxl__device_pci_reset) ?
Do "xl dmesg" and logs in /var/log/xen tell you something useful ?

My experience now, maybe it will help ?

With my AMD GPU (no FLReset), I have the same problem : *sometimes* a
domU (re)boot does not pick the GPU, usually following a domU crash.
I solve this by running the same commands as you do, but sometimes I
have to run them *several times* before it works again.

So :
echo 1 > /sys/bus/pci/devices/0000\:06\:00.0/remove
echo 1 > /sys/bus/pci/devices/0000\:06\:00.1/remove
echo 1 > /sys/bus/pci/rescan
(the .1 is the soundcard, but I think you don't have one)

I've never used "power" and "reset".
BTW, shouldn't you run "rescan" -after- trying "power/reset" ?

Sometimes, above commands are not enough, and I have to remove then
re-add the devices from/to the PCI assignable pool.

xl pci-assignable-remove 06:00.0
xl pci-assignable-remove 06:00.1
(run sysfs commands again)
xl pci-assignable-add 06:00.0
xl pci-assignable-add 06:00.1
(run sysfs commands again)

Sometimes, I have to run -remove a few times before they get out of the
pool.
Sometimes, I cannot even remove them from the pool, xl does not agree,
but even then, the GPU eventually gets PT again.

Sometimes, a few rounds of "start/stop domU", "run sysfs commands" and
"remove/add to the PCI pool" is needed.

What's strange in my case is that there's no "exact" solution to what
looks like the same problem, as you've read I used "sometimes" a lot.
So, nothing scientific really, just trial and error !
It's like I have no clue and I'm "pushing all the buttons till it
works", but it has always worked like that ^^ I've never have to reboot
dom0 to re-assign the buggy GPU (yeah should read logs and compare ...).
But as usual, YMMV !