Mailing List Archive

pci-passthrough loses msi-x interrupts ability after domain destroy
Hi Xen-devel,

I'm using PCI pass-through to map a PCIe (intel i210) controller into
a HVM domain. The system uses xen-pciback to hide the appropriate PCI
device from Dom0.

When creating the HVM domain after an hypervisor cold boot, the HVM
domain can access and use the PCIe controller without problem.

However, if the HVM domain is destroyed then restarted, it won't be
able to use the pass-through PCI device anymore. The PCI device is
seen and can be mapped, however, the interrupts will not be passed to
the HVM domain anymore (this is visible under a Linux guest as
/proc/interrupts counters remain 0). The behavior on a Windows10 guest
is the same.

A few interesting hints I noticed:

- On Dom0, 'lspci -vv' on that PCIe device between the "working" and
the "muted interrupts" states, I noted a difference between the
MSI-X caps:

- Capabilities: [70] MSI-X: Enable- Count=5 Masked- <-- IRQs will work if domain started
+ Capabilities: [70] MSI-X: Enable- Count=5 Masked+ <-- IRQs won't work if domain started
^^^^^^^

- When the HVM OS is Linux, rmmod'ing the i210 (igb) driver from
inside that domain before destroying the domain provides a way to
keep the device working during the next destroy/create cycle: in the
lspci view above, the MSI-X caps will not appear as 'Masked+' if the
driver was unloaded prior to destroy.

- However, if the domain was destroyed without that precaution, I
found no way to bring it back to a working state.

I tried a few methods without success:

- Removing / rescanning the device from the PCI bus in Dom0.
- echo 1 >reset in the device's PCI sysfs

Am I missing something, or is there something I can try to
troubleshoot this? Any hint will be helpful.

Best,
Jerome



Setup uses the following:

- Xen 4.8.1
- Linux 4.8 [ xen-pciback.hide=(07:00.0) ] //
- iommu is enabled on Core i5-5350U

- The domain config file:

---snip---
builder = 'hvm'
memory = 4096
vcpus = 2
name = "LiveCD"

disk = [ 'file:/data/ubuntu.iso,xvdc:cdrom,r', 'format=raw, vdev=hdb, access=rw, backendtype=qdisk, target=/dev/sda5' ]
boot = "c"
acpi = 1
device_model_version = "qemu-xen"
sdl = 0
vnc = 1
vnclisten = '10.0.0.1:0'

# i210 pass-through
pci = ['07:00.0']

usb = 1
usbdevice = ['tablet']
-----snap------

- xl dmesg (loglvl=debug):

(XEN) Xen version 4.8.1 (@) (gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3) debug=n Tue Sep 19 17:22:36 UTC 2017
(XEN) Latest ChangeSet:
(XEN) Bootloader: GRUB 2.00
(XEN) Command line: loglvl=debug dom0_mem=4096M,max:4096M dom0_max_vcpus=2
(XEN) Video information:
(XEN) VGA is text mode 80x25, font 8x16
(XEN) VBE/DDC methods: V2; EDID transfer time: 1 seconds
(XEN) Disc information:
(XEN) Found 2 MBR signatures
(XEN) Found 3 EDD information structures
(XEN) Xen-e820 RAM map:
(XEN) 0000000000000000 - 000000000009d800 (usable)
(XEN) 000000000009d800 - 00000000000a0000 (reserved)
(XEN) 00000000000e0000 - 0000000000100000 (reserved)
(XEN) 0000000000100000 - 00000000d80bb000 (usable)
(XEN) 00000000d80bb000 - 00000000d83f9000 (reserved)
(XEN) 00000000d83f9000 - 00000000dc364000 (usable)
(XEN) 00000000dc364000 - 00000000dc3c4000 (reserved)
(XEN) 00000000dc3c4000 - 00000000dc5b4000 (usable)
(XEN) 00000000dc5b4000 - 00000000dcd39000 (ACPI NVS)
(XEN) 00000000dcd39000 - 00000000dcfff000 (reserved)
(XEN) 00000000dcfff000 - 00000000dd000000 (usable)
(XEN) 00000000dd800000 - 00000000e0000000 (reserved)
(XEN) 00000000f8000000 - 00000000fc000000 (reserved)
(XEN) 00000000fec00000 - 00000000fec01000 (reserved)
(XEN) 00000000fed00000 - 00000000fed04000 (reserved)
(XEN) 00000000fed1c000 - 00000000fed20000 (reserved)
(XEN) 00000000fee00000 - 00000000fee01000 (reserved)
(XEN) 00000000ff000000 - 0000000100000000 (reserved)
(XEN) 0000000100000000 - 000000041e000000 (usable)
(XEN) ACPI: RSDP 000F0580, 0024 (r2 ALASKA)
(XEN) ACPI: XSDT DCCFA090, 00A4 (r1 ALASKA A M I 1072009 AMI 10013)
(XEN) ACPI: FACP DCD10478, 010C (r5 ALASKA A M I 1072009 AMI 10013)
(XEN) ACPI: DSDT DCCFA1D0, 162A8 (r2 ALASKA A M I 1072009 INTL 20120913)
(XEN) ACPI: FACS DCD37F80, 0040
(XEN) ACPI: APIC DCD10588, 0084 (r3 ALASKA A M I 1072009 AMI 10013)
(XEN) ACPI: FPDT DCD10610, 0044 (r1 ALASKA A M I 1072009 AMI 10013)
(XEN) ACPI: FIDT DCD10658, 009C (r1 ALASKA A M I 1072009 AMI 10013)
(XEN) ACPI: MCFG DCD106F8, 003C (r1 ALASKA A M I 1072009 MSFT 97)
(XEN) ACPI: HPET DCD10738, 0038 (r1 ALASKA A M I 1072009 AMI. 5)
(XEN) ACPI: SSDT DCD10770, 0315 (r1 SataRe SataTabl 1000 INTL 20120913)
(XEN) ACPI: UEFI DCD10A88, 0042 (r1 0 0)
(XEN) ACPI: SSDT DCD10AD0, 08F4 (r2 Ther_R Ther_Rvp 1000 INTL 20120913)
(XEN) ACPI: ASF! DCD113C8, 00A0 (r32 INTEL HCG 1 TFSM F4240)
(XEN) ACPI: TCPA DCD11468, 0032 (r2 ALASKA NAPAASF 1 MSFT 1000013)
(XEN) ACPI: SSDT DCD114A0, 0518 (r2 PmRef Cpu0Ist 3000 INTL 20120913)
(XEN) ACPI: SSDT DCD119B8, 0B74 (r2 CpuRef CpuSsdt 3000 INTL 20120913)
(XEN) ACPI: SSDT DCD12530, 5CF6 (r2 SaSsdt SaSsdt 3000 INTL 20120913)
(XEN) ACPI: DMAR DCD18228, 00F8 (r1 INTEL BDW 1 INTL 1)
(XEN) ACPI: CSRT DCD18320, 00C4 (r1 INTL BDW-ULT 1 INTL 20100528)
(XEN) System RAM: 16289MB (16680656kB)
(XEN) No NUMA configuration found
(XEN) Faking a node at 0000000000000000-000000041e000000
(XEN) Domain heap initialised
(XEN) CPU Vendor: Intel, Family 6 (0x6), Model 61 (0x3d), Stepping 4 (raw 000306d4)
(XEN) found SMP MP-table at 000fd8e0
(XEN) DMI 2.8 present.
(XEN) Using APIC driver default
(XEN) ACPI: PM-Timer IO Port: 0x1808 (32 bits)
(XEN) ACPI: v5 SLEEP INFO: control[0:0], status[0:0]
(XEN) ACPI: SLEEP INFO: pm1x_cnt[1:1804,1:0], pm1x_evt[1:1800,1:0]
(XEN) ACPI: 32/64X FACS address mismatch in FADT - dcd37f80/0000000000000000, using 32
(XEN) ACPI: wakeup_vec[dcd37f8c], vec_size[20]
(XEN) ACPI: Local APIC address 0xfee00000
(XEN) ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
(XEN) ACPI: LAPIC (acpi_id[0x02] lapic_id[0x02] enabled)
(XEN) ACPI: LAPIC (acpi_id[0x03] lapic_id[0x01] enabled)
(XEN) ACPI: LAPIC (acpi_id[0x04] lapic_id[0x03] enabled)
(XEN) ACPI: LAPIC_NMI (acpi_id[0x01] dfl res lint[0x44])
(XEN) ACPI: NMI not connected to LINT 1!
(XEN) ACPI: LAPIC_NMI (acpi_id[0x02] dfl dfl lint[0])
(XEN) ACPI: NMI not connected to LINT 1!
(XEN) ACPI: LAPIC_NMI (acpi_id[0x03] low dfl lint[0xc3])
(XEN) ACPI: NMI not connected to LINT 1!
(XEN) ACPI: LAPIC_NMI (acpi_id[0x04] dfl res lint[0x8])
(XEN) ACPI: NMI not connected to LINT 1!
(XEN) ACPI: IOAPIC (id[0x02] address[0xfec00000] gsi_base[0])
(XEN) IOAPIC[0]: apic_id 2, version 32, address 0xfec00000, GSI 0-39
(XEN) ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
(XEN) ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
(XEN) ACPI: IRQ0 used by override.
(XEN) ACPI: IRQ2 used by override.
(XEN) ACPI: IRQ9 used by override.
(XEN) Enabling APIC mode: Flat. Using 1 I/O APICs
(XEN) ACPI: HPET id: 0x8086a701 base: 0xfed00000
(XEN) ERST table was not found
(XEN) Using ACPI (MADT) for SMP configuration information
(XEN) SMP: Allowing 4 CPUs (0 hotplug CPUs)
(XEN) IRQ limits: 40 GSI, 744 MSI/MSI-X
(XEN) Not enabling x2APIC (upon firmware request)
(XEN) xstate: size: 0x340 and states: 0x7
(XEN) Thermal monitoring handled by SMI
(XEN) Intel machine check reporting enabled
(XEN) Using scheduler: SMP Credit Scheduler (credit)
(XEN) Platform timer is 14.318MHz HPET
(XEN) Detected 1795.844 MHz processor.
(XEN) Initing memory sharing.
(XEN) alt table ffff82d0802bef60 -> ffff82d0802c06a0
(XEN) spurious 8259A interrupt: IRQ7.
(XEN) PCI: MCFG configuration 0: base f8000000 segment 0000 buses 00 - 3f
(XEN) PCI: MCFG area at f8000000 reserved in E820
(XEN) PCI: Using MCFG for segment 0000 bus 00-3f
(XEN) Intel VT-d iommu 0 supported page sizes: 4kB, 2MB, 1GB.
(XEN) Intel VT-d iommu 1 supported page sizes: 4kB, 2MB, 1GB.
(XEN) Intel VT-d Snoop Control not enabled.
(XEN) Intel VT-d Dom0 DMA Passthrough not enabled.
(XEN) Intel VT-d Queued Invalidation enabled.
(XEN) Intel VT-d Interrupt Remapping enabled.
(XEN) Intel VT-d Posted Interrupt not enabled.
(XEN) Intel VT-d Shared EPT tables enabled.
(XEN) I/O virtualisation enabled
(XEN) - Dom0 mode: Relaxed
(XEN) Interrupt remapping enabled
(XEN) nr_sockets: 1
(XEN) Enabled directed EOI with ioapic_ack_old on!
(XEN) ENABLING IO-APIC IRQs
(XEN) -> Using old ACK method
(XEN) ..TIMER: vector=0xF0 apic1=0 pin1=2 apic2=0 pin2=0
(XEN) TSC deadline timer enabled
(XEN) Allocated console ring of 32 KiB.
(XEN) mwait-idle: MWAIT substates: 0x11142120
(XEN) mwait-idle: v0.4.1 model 0x3d
(XEN) mwait-idle: lapic_timer_reliable_states 0xffffffff
(XEN) mwait-idle: max C-state count of 8 reached
(XEN) VMX: Supported advanced features:
(XEN) - APIC MMIO access virtualisation
(XEN) - APIC TPR shadow
(XEN) - Extended Page Tables (EPT)
(XEN) - Virtual-Processor Identifiers (VPID)
(XEN) - Virtual NMI
(XEN) - MSR direct-access bitmap
(XEN) - Unrestricted Guest
(XEN) - VMCS shadowing
(XEN) - VM Functions
(XEN) - Virtualisation Exceptions
(XEN) HVM: ASIDs enabled.
(XEN) HVM: VMX enabled
(XEN) HVM: Hardware Assisted Paging (HAP) detected
(XEN) HVM: HAP page sizes: 4kB, 2MB, 1GB
(XEN) [VT-D]INTR-REMAP: Request device [0000:f0:1f.0] fault index 0, iommu reg = ffff82c000203000
(XEN) [VT-D]INTR-REMAP: reason 25 - Blocked a compatibility format interrupt request
(XEN) mwait-idle: max C-state count of 8 reached
(XEN) mwait-idle: max C-state count of 8 reached
(XEN) mwait-idle: max C-state count of 8 reached
(XEN) Brought up 4 CPUs
(XEN) build-id: e119032a1c69cee07ab82491a5eab6892747eac4
(XEN) ACPI sleep modes: S3
(XEN) VPMU: disabled
(XEN) mcheck_poll: Machine check polling timer started.
(XEN) Dom0 has maximum 424 PIRQs
(XEN) NX (Execute Disable) protection active
(XEN) *** LOADING DOMAIN 0 ***
(XEN) Xen kernel: 64-bit, lsb, compat32
(XEN) Dom0 kernel: 64-bit, PAE, lsb, paddr 0x1000000 -> 0x1c63000
(XEN) PHYSICAL MEMORY ARRANGEMENT:
(XEN) Dom0 alloc.: 000000040e000000->0000000410000000 (1040384 pages to be allocated)
(XEN) VIRTUAL MEMORY ARRANGEMENT:
(XEN) Loaded kernel: ffffffff81000000->ffffffff81c63000
(XEN) Init. ramdisk: 0000000000000000->0000000000000000
(XEN) Phys-Mach map: 0000008000000000->0000008000800000
(XEN) Start info: ffffffff81c63000->ffffffff81c634b4
(XEN) Page tables: ffffffff81c64000->ffffffff81c77000
(XEN) Boot stack: ffffffff81c77000->ffffffff81c78000
(XEN) TOTAL: ffffffff80000000->ffffffff82000000
(XEN) ENTRY ADDRESS: ffffffff8189c180
(XEN) Dom0 has maximum 2 VCPUs
(XEN) Bogus DMIBAR 0xfed18001 on 0000:00:00.0
(XEN) Scrubbing Free RAM on 1 nodes using 2 CPUs
(XEN) ..................................................................done.
(XEN) Initial low memory virq threshold set at 0x4000 pages.
(XEN) Std. Loglevel: All
(XEN) Guest Loglevel: Nothing (Rate-limited: Errors and warnings)
(XEN) Xen is relinquishing VGA console.
(XEN) *** Serial input -> DOM0 (type 'CTRL-a' three times to switch input to Xen)
(XEN) Freed 316kB init memory
(XEN) Bogus DMIBAR 0xfed18001 on 0000:00:00.0
(XEN) PCI add device 0000:00:00.0
(XEN) PCI add device 0000:00:02.0
(XEN) PCI add device 0000:00:03.0
(XEN) PCI add device 0000:00:14.0
(XEN) PCI add device 0000:00:16.0
(XEN) PCI add device 0000:00:19.0
(XEN) PCI add device 0000:00:1b.0
(XEN) PCI add device 0000:00:1c.0
(XEN) PCI add device 0000:00:1c.1
(XEN) PCI add device 0000:00:1c.2
(XEN) PCI add device 0000:00:1c.3
(XEN) PCI add device 0000:00:1d.0
(XEN) PCI add device 0000:00:1f.0
(XEN) PCI add device 0000:00:1f.2
(XEN) PCI add device 0000:00:1f.3
(XEN) PCI add device 0000:01:00.0
(XEN) PCI add device 0000:02:01.0
(XEN) PCI add device 0000:02:02.0
(XEN) PCI add device 0000:02:03.0
(XEN) PCI add device 0000:04:00.0
(XEN) PCI add device 0000:05:00.0
(XEN) PCI add device 0000:06:00.0
(XEN) PCI add device 0000:07:00.0


- Device's lspci -vv:

07:00.0 Class 0200: Device 8086:1537 (rev 03)
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 18
Region 0: Memory at f5c00000 (32-bit, non-prefetchable) [disabled] [size=512K]
Region 2: I/O ports at c000 [disabled] [size=32]
Region 3: Memory at f5c80000 (32-bit, non-prefetchable) [disabled] [size=16K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
Address: 0000000000000000 Data: 0000
Masking: 00000000 Pending: 00000000
Capabilities: [70] MSI-X: Enable- Count=5 Masked+
Vector table: BAR=3 offset=00000000
PBA: BAR=3 offset=00002000
Capabilities: [a0] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <2us, L1 <16us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
Capabilities: [140 v1] Device Serial Number 00-50-d2-ff-ff-10-34-b6
Capabilities: [1a0 v1] Transaction Processing Hints
Device specific mode supported
Steering table in TPH capability structure
Kernel driver in use: pciback

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
Re: pci-passthrough loses msi-x interrupts ability after domain destroy [ In reply to ]
On Wed, Sep 20, 2017 at 03:50:35PM -0400, Jérôme Oufella wrote:
> Hi Xen-devel,
>
> I'm using PCI pass-through to map a PCIe (intel i210) controller into
> a HVM domain. The system uses xen-pciback to hide the appropriate PCI
> device from Dom0.
>
> When creating the HVM domain after an hypervisor cold boot, the HVM
> domain can access and use the PCIe controller without problem.
>
> However, if the HVM domain is destroyed then restarted, it won't be
> able to use the pass-through PCI device anymore. The PCI device is
> seen and can be mapped, however, the interrupts will not be passed to
> the HVM domain anymore (this is visible under a Linux guest as
> /proc/interrupts counters remain 0). The behavior on a Windows10 guest
> is the same.
>
> A few interesting hints I noticed:
>
> - On Dom0, 'lspci -vv' on that PCIe device between the "working" and
> the "muted interrupts" states, I noted a difference between the
> MSI-X caps:
>
> - Capabilities: [70] MSI-X: Enable- Count=5 Masked- <-- IRQs will work if domain started
> + Capabilities: [70] MSI-X: Enable- Count=5 Masked+ <-- IRQs won't work if domain started
> ^^^^^^^

IMHO it seems that either your device is not able to perform a reset
successfully, or Linux is not correctly performing such reset. I don't
think there's a lot that can be done from the Xen side.

Thanks, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
Re: pci-passthrough loses msi-x interrupts ability after domain destroy [ In reply to ]
>>> On 20.09.17 at 21:50, <jerome.oufella@savoirfairelinux.com> wrote:
> - On Dom0, 'lspci -vv' on that PCIe device between the "working" and
> the "muted interrupts" states, I noted a difference between the
> MSI-X caps:
>
> - Capabilities: [70] MSI-X: Enable- Count=5 Masked- <-- IRQs will work if
> domain started
> + Capabilities: [70] MSI-X: Enable- Count=5 Masked+ <-- IRQs won't work if
> domain started
> ^^^^^^^

And did you verify that the OS actually makes an attempt to clear
this mask-all flag? If such an attempt doesn't have the intended
effect, finding the problem location in the code and sending a
fix can't be that difficult. If otoh the guest doesn't do this, then
we'd need to figure out whether we leave the device in a wrong
state after de-assigning it from the original guest instance (albeit,
as Roger said, the reset the device is supposed to go through
would be expected to clear it). I can certainly see an OS not
necessarily expecting the bit to be set when first gaining control
of the device. For this, look at the lspci output for the device in
Dom0 between shutting down and then restarting the guest.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
Re: pci-passthrough loses msi-x interrupts ability after domain destroy [ In reply to ]
Thursday, September 21, 2017, 10:39:52 AM, you wrote:

> On Wed, Sep 20, 2017 at 03:50:35PM -0400, J?r?me Oufella wrote:
>> Hi Xen-devel,
>>
>> I'm using PCI pass-through to map a PCIe (intel i210) controller into
>> a HVM domain. The system uses xen-pciback to hide the appropriate PCI
>> device from Dom0.
>>
>> When creating the HVM domain after an hypervisor cold boot, the HVM
>> domain can access and use the PCIe controller without problem.
>>
>> However, if the HVM domain is destroyed then restarted, it won't be
>> able to use the pass-through PCI device anymore. The PCI device is
>> seen and can be mapped, however, the interrupts will not be passed to
>> the HVM domain anymore (this is visible under a Linux guest as
>> /proc/interrupts counters remain 0). The behavior on a Windows10 guest
>> is the same.
>>
>> A few interesting hints I noticed:
>>
>> - On Dom0, 'lspci -vv' on that PCIe device between the "working" and
>> the "muted interrupts" states, I noted a difference between the
>> MSI-X caps:
>>
>> - Capabilities: [70] MSI-X: Enable- Count=5 Masked- <-- IRQs will work if domain started
>> + Capabilities: [70] MSI-X: Enable- Count=5 Masked+ <-- IRQs won't work if domain started
>> ^^^^^^^

> IMHO it seems that either your device is not able to perform a reset
> successfully, or Linux is not correctly performing such reset. I don't
> think there's a lot that can be done from the Xen side.

Unfortunately for a lot of pci-devices a simple reset as performed by default isn't enough,
but also almost none support a real pci FLR.

In the distant past Konrad has made a patchset that implemented a bus reset and
reseting config space. (It piggy backed on already existing libxl mechanism of
trying to call on a syfs "do_flr" attribute which triggers pciback to perform
the busreset and rewrite of config space for the device.

I use that patchset ever since for my pci-passtrough needs and it works pretty
well. I can shutdown an restart VM's with pci devices passed trhough (also AMD
Radeon graphic cards).

J?r?me:
Although your mileage may vary, you could try the attached patch, it's
for a > 4.9 linux kernel's pciback, although it probably also would apply with
for a > minimal adjustments (mostly non matching line numbers) to an some what earlier kernel.

If it works for you as well, perhaps it deserves a mention on the Xen wiki in the
pci-passtrough section.

Roger:
I follow your PVH (dom0) patches shallowly, from my understanding it will result
in Xen having more inteference with the handling of PCI devices ?
If that's correct will this also impact the resetting logic, or will most stay
in the dom0 kernel/pciback ?

--
Sander

> Thanks, Roger.
Re: pci-passthrough loses msi-x interrupts ability after domain destroy [ In reply to ]
On Thu, Sep 21, 2017 at 1:27 PM, Sander Eikelenboom
<linux@eikelenboom.it> wrote:
>
> On Thu, September 21, 2017, 10:39:52 AM, Roger Pau Monné wrote:
>
>> On Wed, Sep 20, 2017 at 03:50:35PM -0400, Jérôme Oufella wrote:
>>>
>>> I'm using PCI pass-through to map a PCIe (intel i210) controller into
>>> a HVM domain. The system uses xen-pciback to hide the appropriate PCI
>>> device from Dom0.
>>>
>>> When creating the HVM domain after an hypervisor cold boot, the HVM
>>> domain can access and use the PCIe controller without problem.
>>>
>>> However, if the HVM domain is destroyed then restarted, it won't be
>>> able to use the pass-through PCI device anymore. The PCI device is
>>> seen and can be mapped, however, the interrupts will not be passed to
>>> the HVM domain anymore (this is visible under a Linux guest as
>>> /proc/interrupts counters remain 0). The behavior on a Windows10 guest
>>> is the same.
>>>
>>> A few interesting hints I noticed:
>>>
>>> - On Dom0, 'lspci -vv' on that PCIe device between the "working" and
>>> the "muted interrupts" states, I noted a difference between the
>>> MSI-X caps:
>>>
>>> - Capabilities: [70] MSI-X: Enable- Count=5 Masked- <-- IRQs will work if domain started
>>> + Capabilities: [70] MSI-X: Enable- Count=5 Masked+ <-- IRQs won't work if domain started
>>> ^^^^^^^
>
>> IMHO it seems that either your device is not able to perform a reset
>> successfully, or Linux is not correctly performing such reset. I don't
>> think there's a lot that can be done from the Xen side.
>
> Unfortunately for a lot of pci-devices a simple reset as performed by default isn't enough,
> but also almost none support a real pci FLR.
>
> In the distant past Konrad has made a patchset that implemented a bus reset and
> reseting config space. (It piggy backed on already existing libxl mechanism of
> trying to call on a syfs "do_flr" attribute which triggers pciback to perform
> the busreset and rewrite of config space for the device.
>
> I use that patchset ever since for my pci-passtrough needs and it works pretty
> well. I can shutdown an restart VM's with pci devices passed trhough (also AMD
> Radeon graphic cards).

Just to confirm the utility of that piece of work: OpenXT also uses an
extended version of that same patch to perform device reset for
passthrough.

I've attached a copy of that OpenXT patch to this message and it can
also be obtained from our git repository:
https://github.com/OpenXT/xenclient-oe/blob/f8d3b282a87231d9ae717b13d506e8e7e28c78c4/recipes-kernel/linux/4.9/patches/thorough-reset-interface-to-pciback-s-sysfs.patch
This version creates a sysfs node named "reset_device" and the OpenXT
libxl toolstack is patched to use that node instead of "do_flr".

Konrad's original work encountered pushback on upstream acceptance at
the time it was developed. I'm not sure I've found where that
discussion ended. Is there any prospect of a more comprehensive reset
mechanism being accepted into xen-pciback, or elsewhere in the kernel?

As noted in the original LKML threads, vfio has similar relevant pci
device reset retry logic. (Thanks to Rich Persaud for this pointer:)
http://elixir.free-electrons.com/linux/v4.14-rc1/source/drivers/vfio/pci/vfio_pci.c#L1353

libvirt also performs similar reset logic, using a direct low level
interface to config space (Thanks to Marek for this pointer, libvirt
is used by Qubes:)
https://github.com/libvirt/libvirt/blob/master/src/util/virpci.c#L929
I thinks this indicates that it would be possible to extend libxl to
do something similar, but that seems less satisfactory compared to
performing the work in a kernel-provided implementation.

Is there a way forward to providing this functionality within Xen
software or Linux?

Christopher
--

openxt.org
Re: pci-passthrough loses msi-x interrupts ability after domain destroy [ In reply to ]
Hi,

On Thu, Sep 21, 2017 at 07:09:12PM -0700, Christopher Clark wrote:
> On Thu, Sep 21, 2017 at 1:27 PM, Sander Eikelenboom
> <linux@eikelenboom.it> wrote:
> >
> > On Thu, September 21, 2017, 10:39:52 AM, Roger Pau Monné wrote:
> >
> >> On Wed, Sep 20, 2017 at 03:50:35PM -0400, Jérôme Oufella wrote:
> >>>
> >>> I'm using PCI pass-through to map a PCIe (intel i210) controller into
> >>> a HVM domain. The system uses xen-pciback to hide the appropriate PCI
> >>> device from Dom0.
> >>>
> >>> When creating the HVM domain after an hypervisor cold boot, the HVM
> >>> domain can access and use the PCIe controller without problem.
> >>>
> >>> However, if the HVM domain is destroyed then restarted, it won't be
> >>> able to use the pass-through PCI device anymore. The PCI device is
> >>> seen and can be mapped, however, the interrupts will not be passed to
> >>> the HVM domain anymore (this is visible under a Linux guest as
> >>> /proc/interrupts counters remain 0). The behavior on a Windows10 guest
> >>> is the same.
> >>>
> >>> A few interesting hints I noticed:
> >>>
> >>> - On Dom0, 'lspci -vv' on that PCIe device between the "working" and
> >>> the "muted interrupts" states, I noted a difference between the
> >>> MSI-X caps:
> >>>
> >>> - Capabilities: [70] MSI-X: Enable- Count=5 Masked- <-- IRQs will work if domain started
> >>> + Capabilities: [70] MSI-X: Enable- Count=5 Masked+ <-- IRQs won't work if domain started
> >>> ^^^^^^^
> >
> >> IMHO it seems that either your device is not able to perform a reset
> >> successfully, or Linux is not correctly performing such reset. I don't
> >> think there's a lot that can be done from the Xen side.
> >
> > Unfortunately for a lot of pci-devices a simple reset as performed by default isn't enough,
> > but also almost none support a real pci FLR.
> >
> > In the distant past Konrad has made a patchset that implemented a bus reset and
> > reseting config space. (It piggy backed on already existing libxl mechanism of
> > trying to call on a syfs "do_flr" attribute which triggers pciback to perform
> > the busreset and rewrite of config space for the device.
> >
> > I use that patchset ever since for my pci-passtrough needs and it works pretty
> > well. I can shutdown an restart VM's with pci devices passed trhough (also AMD
> > Radeon graphic cards).
>
> Just to confirm the utility of that piece of work: OpenXT also uses an
> extended version of that same patch to perform device reset for
> passthrough.
>
> I've attached a copy of that OpenXT patch to this message and it can
> also be obtained from our git repository:
> https://github.com/OpenXT/xenclient-oe/blob/f8d3b282a87231d9ae717b13d506e8e7e28c78c4/recipes-kernel/linux/4.9/patches/thorough-reset-interface-to-pciback-s-sysfs.patch
> This version creates a sysfs node named "reset_device" and the OpenXT
> libxl toolstack is patched to use that node instead of "do_flr".
>
> Konrad's original work encountered pushback on upstream acceptance at
> the time it was developed. I'm not sure I've found where that
> discussion ended. Is there any prospect of a more comprehensive reset
> mechanism being accepted into xen-pciback, or elsewhere in the kernel?
>
> As noted in the original LKML threads, vfio has similar relevant pci
> device reset retry logic. (Thanks to Rich Persaud for this pointer:)
> http://elixir.free-electrons.com/linux/v4.14-rc1/source/drivers/vfio/pci/vfio_pci.c#L1353
>
> libvirt also performs similar reset logic, using a direct low level
> interface to config space (Thanks to Marek for this pointer, libvirt
> is used by Qubes:)
> https://github.com/libvirt/libvirt/blob/master/src/util/virpci.c#L929
> I thinks this indicates that it would be possible to extend libxl to
> do something similar, but that seems less satisfactory compared to
> performing the work in a kernel-provided implementation.
>
> Is there a way forward to providing this functionality within Xen
> software or Linux?
>

Adding Konrad to CC ..


-- Pasi

> Christopher
> --
>
> openxt.org

> From d686351d8ea4a1ea1d755d0a10f6f14d1c870911 Mon Sep 17 00:00:00 2001
> From: Kyle Temkin <ktemkin@binghamton.edu>
> Date: Wed, 8 Apr 2015 00:58:24 -0400
> Subject: [PATCH] Add thorough reset interface to pciback's sysfs.
>
> --------------------------------------------------------------------------------
> SHORT DESCRIPTION:
> --------------------------------------------------------------------------------
> Adds an interface that allows "more thorough" resets to be performed
> on devices which don't support Function Level Resets (FLRs). This
> interface should allow the toolstack to ensure that a PCI device is in a
> known state prior to passing it through to a VM.
>
> --------------------------------------------------------------------------------
> LONG DESCRIPTION:
> --------------------------------------------------------------------------------
>
> From Konrad Rzeszutek Wilk's original post to xen-devel and the LKML:
>
> The life-cycle of a PCI device in Xen pciback is complex
> and is constrained by the PCI generic locking mechanism.
>
> It starts with the device being binded to us - for which
> we do a device function reset (and done via SysFS
> so the PCI lock is held)
>
> If the device is unbinded from us - we also do a function
> reset (also done via SysFS so the PCI lock is held).
>
> If the device is un-assigned from a guest - we do a function
> reset (no PCI lock).
>
> All on the individual PCI function level (so bus:device:function).
>
> Unfortunatly a function reset is not adequate for certain
> PCIe devices. The reset for an individual PCI function "means
> device must support FLR (PCIe or AF), PM reset on D3hot->D0
> device specific reset, or be a singleton device on a bus
> a secondary bus reset. FLR does not have widespread support,
> reset is not very reliable, and bus topology is dictated by the
> and device design. We need to provide a means for a user to
> a bus reset in cases where the existing mechanisms are not
> or not reliable. " (Adam Williamson, 'vfio-pci: PCI hot reset
> interface' commit 8b27ee60bfd6bbb84d2df28fa706c5c5081066ca).
>
> As such to do a slot or a bus reset is we need another mechanism.
> This is not exposed SysFS as there is no good way of exposing
> a bus topology there.
>
> This is due to the complexity - we MUST know that the different
> functions off a PCIe device are not in use by other drivers, or
> if they are in use (say one of them is assigned to a guest
> and the other is idle) - it is still OK to reset the slot
> (assuming both of them are owned by Xen pciback).
>
> This patch does that by doing an slot or bus reset (if
> slot not supported) if all of the functions of a PCIe
> device belong to Xen PCIback. We do not care if the device is
> in-use as we depend on the toolstack to be aware of this -
> however if it is we will WARN the user.
>
> Due to the complexity with the PCI lock we cannot do
> the reset when a device is binded ('echo $BDF > bind')
> or when unbinded ('echo $BDF > unbind') as the pci_[slot|bus]_reset
> also take the same lock resulting in a dead-lock.
>
> Putting the reset function in a workqueue or thread
> won't work either - as we have to do the reset function
> outside the 'unbind' context (it holds the PCI lock).
> But once you 'unbind' a device the device is no longer
> under the ownership of Xen pciback and the pci_set_drvdata
> has been reset so we cannot use a thread for this.
>
> Instead of doing all this complex dance, we depend on the toolstack
> doing the right thing. As such implement [... a SysFS attribute]
> which [... the toolstack] uses when a device is detached or attached
> from/to a guest. It bypasses the need to worry about the PCI lock.
>
> To not inadvertly do a bus reset that would affect devices that
> are in use by other drivers (other than Xen pciback) prior
> to the reset we check that all of the devices under the bridge
> are owned by Xen pciback. If they are not we do not do
> the bus (or slot) reset.
>
> We also warn the user if the device is in use - but still
> continue with the reset. This should not happen as the toolstack
> also does the check.
>
> --
>
> Our version of the patch has been modified to use a less confusing
> sysfs name. The original name ('do_flr') is inappropriate, as it
> implies a function level reset; it is entirely possible that the patch
> code will use a bus-level reset when appropriate.
>
> The new sysfs entry is located at:
>
> /sys/bus/pci/drivers/pciback/reset_device
>
> and can be activated by writing a domain:bus:device:function device
> identifier into the sysfs file. As an example:
>
> echo "0000:01:00.0" > /sys/bus/pci/drivers/pciback/reset_device
>
> would reset the device matching the D:BDF descriptor above.
>
> --------------------------------------------------------------------------------
> CHANGELOG:
> --------------------------------------------------------------------------------
> This is a port of a patch that likely had many authors, including:
> -Konrad Rzeszutek Wilk
> -Alex Williamson
> -Ross Phillipson <rphilipson@ainfosec.com>
> Ported to OpenXT by: Kyle J. Temkin <temkink@ainfosec.com>, 4/8/15
> Rewrite by: Kyle J. Temkin <temkink@ainfosec.com>, 4/10/15
>
> --------------------------------------------------------------------------------
> DEPENDENCIES
> --------------------------------------------------------------------------------
> This patch requires ONE of the following:
> -A relatively modern linux kernel (3.18+) as a base; which provides
> the PCI functions used; or
> -Our PCI reset backports patch (backport-pci-reset-functionality.patch),
> which backports the relevant functionality to 3.11.
>
> To take advantage of this patch, the utilized toolstack should be
> changed to:
> -Use the provided "reset_device" property, rather than the PCI
> device's sysfs "reset" entry. This enables resets beyond a FLR to be
> used.
> -Ensure that all functions of a given device are passed through
> together. This allows us to use some of the more thorugh resetting
> techniques, when possible.
>
> --------------------------------------------------------------------------------
> REMOVAL
> --------------------------------------------------------------------------------
> This patch provides a service which is necessary for proper passthrough
> of many PCI cards: a generalized ability to reset PCI devices, without
> requiring that the device support FLR or power-management based resets.
>
> This patch will be necessary until either the Linux PCI subsystem or Xen
> PCIback drivers are modified to provide this support; or until cards
> without proper FLR support are no longer supported.
>
> --------------------------------------------------------------------------------
> UPSTREAM PLAN
> --------------------------------------------------------------------------------
>
> This code is taken from a patch which was originally proposed and
> rejected from upstream on the LKML and xen-devel. An upstream
> implementation of the functionality of this patch is still necessary;
> and can and should be implemented.
>
> This patch will hopefully be replaced with an upstream version when
> community concensus has produced a single "blessed" method of
> accomplishing its functionality.
>
> --------------------------------------------------------------------------------
> PATCHES
> --------------------------------------------------------------------------------
> ---
> drivers/xen/xen-pciback/pci_stub.c | 338 ++++++++++++++++++++++++++++++++++---
> 1 file changed, 312 insertions(+), 26 deletions(-)
>
> Index: linux-4.9.40/drivers/xen/xen-pciback/pci_stub.c
> ===================================================================
> --- linux-4.9.40.orig/drivers/xen/xen-pciback/pci_stub.c
> +++ linux-4.9.40/drivers/xen/xen-pciback/pci_stub.c
> @@ -102,10 +102,9 @@ static void pcistub_device_release(struc
>
> xen_unregister_device_domain_owner(dev);
>
> - /* Call the reset function which does not take lock as this
> - * is called from "unbind" which takes a device_lock mutex.
> - */
> - __pci_reset_function_locked(dev);
> +
> + /* Reset is done by the toolstack by using 'reset_device' on the
> + * SysFS. */
> if (pci_load_and_free_saved_state(dev, &dev_data->pci_saved_state))
> dev_info(&dev->dev, "Could not reload PCI state\n");
> else
> @@ -125,9 +124,6 @@ static void pcistub_device_release(struc
> err);
> }
>
> - /* Disable the device */
> - xen_pcibk_reset_device(dev);
> -
> kfree(dev_data);
> pci_set_drvdata(dev, NULL);
>
> @@ -224,6 +220,271 @@ struct pci_dev *pcistub_get_pci_dev_by_s
> return found_dev;
> }
>
> +
> +/**
> + * Returns true iff the given device supports PCIe FLRs.
> + */
> +static bool __device_supports_pcie_flr(struct pci_dev *dev)
> +{
> + u32 cap;
> +
> + /*
> + * Read the device's capabilities. Note that this can be used even on legacy
> + * PCI devices (and not just on PCIe devices)-- it indicates that no capabilities
> + * are supported if the device is legacy PCI by setting cap to 0.
> + */
> + pcie_capability_read_dword(dev, PCI_EXP_DEVCAP, &cap);
> +
> + /* Return true iff the device advertises supporting an FLR. */
> + return (cap & PCI_EXP_DEVCAP_FLR);
> +}
> +
> +
> +/**
> + * Returns true iff the given device supports PCI Advanced Functionality (AF) FLRs.
> + */
> +static bool __device_supports_pci_af_flr(struct pci_dev *dev)
> +{
> + int pos;
> + u8 capability_flags;
> +
> + /* First, try to find the location of the PCI Advanced Functionality capability byte. */
> + pos = pci_find_capability(dev, PCI_CAP_ID_AF);
> +
> + /*
> + * If we weren't able to find the capability byte, this device doesn't support
> + * the Advanced Functionality extensions, and thus won't support AF FLR.
> + */
> + if (!pos)
> + return false;
> +
> + /* Read the capabilities advertised in the AF capability byte. */
> + pci_read_config_byte(dev, pos + PCI_AF_CAP, &capability_flags);
> +
> + /*
> + * If the device does support AF, it will advertise FLR support via the
> + * PCI_AF_CAP_FLR bit. We'll also check for the Transactions Pending (TP)
> + * mechanism, as the kernel requires this extension to issue an AF FLR.
> + * (Internally, the PCI reset code needs to be able to wait for all
> + * pending transactions to complete prior to issuing the AF FLR.)
> + */
> + return (capability_flags & PCI_AF_CAP_TP) && (capability_flags & PCI_AF_CAP_FLR);
> +}
> +
> +
> +/**
> + * Returns true iff the given device adverstises supporting function-
> + * level-reset (FLR).
> + */
> +static bool device_supports_flr(struct pci_dev *dev)
> +{
> + return __device_supports_pci_af_flr(dev) || __device_supports_pcie_flr(dev);
> +}
> +
> +
> +/**
> + * Returns true iff the given device is located in a slot that
> + * supports hotplugging slot resets.
> + */
> +static bool device_supports_slot_reset(struct pci_dev *dev)
> +{
> + return !pci_probe_reset_slot(dev->slot);
> +}
> +
> +
> +/**
> + * Returns true iff the given device is located on a bus that
> + * we can reset. Note that root bridges are excluded, as this
> + * would cause more than just an SBR.
> + */
> +static bool device_supports_bus_reset(struct pci_dev *dev)
> +{
> + return !pci_is_root_bus(dev->bus) && !pci_probe_reset_bus(dev->bus);
> +}
> +
> +
> +/**
> + * Out argument for the __safe_to_sbr_device_callback function.
> + */
> +struct safe_to_sbr_arguments {
> +
> + //Stores the most recently encountered PCI device that does
> + //not belong to pciback. As used below, this is the result of a
> + //search for a non-pciback device on a bus; we stop upon finding
> + //the first non-pciback device.
> + struct pci_dev *last_non_pciback_device;
> +
> + //Stores the number of pciback devices that appear to be in use
> + //on the bus in question.
> + int use_count;
> +
> +};
> +
> +
> +/**
> + * A callback function which determines if a given PCI device is owned by pciback,
> + * and whether the given device is in use. Used by safe_to_sbr_device.
> + *
> + * @param dev The PCI device to be checked.
> + * @param data An out argument of type struct safe_to_sbr_device_callback_arguments.
> + * Updated to indicate the result of the search. See the struct's definition
> + * for more details.
> + *
> + */
> +static int __safe_to_sbr_device_callback(struct pci_dev *dev, void *data)
> +{
> +
> + struct pcistub_device *psdev;
> +
> + bool device_owned_by_pciback = false;
> + struct safe_to_sbr_arguments *arg = data;
> +
> + unsigned long flags;
> +
> + //Ensure that we have exclusive access to the list of PCI devices,
> + //so we can traverse it.
> + spin_lock_irqsave(&pcistub_devices_lock, flags);
> +
> + //Iterate over all PCI devices owned by the pci stub.
> + list_for_each_entry(psdev, &pcistub_devices, dev_list) {
> +
> + //If the given device is owned by pciback...
> + if (psdev->dev == dev) {
> +
> + //mark it as a pciback device.
> + device_owned_by_pciback = true;
> +
> + //If we have a physical device associated with the pciback device,
> + //mark this device as in-use.
> + if (psdev->pdev)
> + arg->use_count++;
> +
> + //Stop searching; we've found a the PCIback device associated with this one.
> + break;
> + }
> + }
> +
> + //Release the PCI device lock...
> + spin_unlock_irqrestore(&pcistub_devices_lock, flags);
> +
> + //... and report if we've found a device that's not owned by pciback.
> + dev_dbg(&dev->dev, "%s\n", device_owned_by_pciback ? "is owned by pciback, and can be reset if not in use."
> + : "not owned by pciback, and thus cannot be reset.");
> +
> + //If we've found a device that's not owned by pciback, update our data
> + //argument so it points to the most recent unowned device. (We check
> + //this like a flag, later: if it's never set, no one owns the device!)
> + if (!device_owned_by_pciback)
> + arg->last_non_pciback_device = dev;
> +
> + //If we've found a device that's not owned by pciback, return false--
> + //this indicates that pci_walk_bus should cease its walk.
> + return !device_owned_by_pciback;
> +}
> +
> +
> +/**
> + * Returns true iff it should be safe to issue a secondary bus reset
> + * to the device; that is, if an SBR can be issued without disrupting
> + * other devices.
> + */
> +static bool safe_to_sbr_device(struct pci_dev *dev)
> +{
> + struct safe_to_sbr_arguments walk_result = { .last_non_pciback_device = NULL, .use_count = 0 };
> +
> + //Walk the PCI bus, attempting to find if any of the given devices
> + pci_walk_bus(dev->bus, __safe_to_sbr_device_callback, &walk_result);
> +
> + //If the device is in use, emit a warning error.
> + if(walk_result.use_count > 0)
> + dev_dbg(&dev->dev, "is in use; currently not safe to SBR device.\n");
> +
> + //Return true iff we did not pick up any other devices
> + //that were either in use, or not owned by pciback.
> + return (walk_result.last_non_pciback_device == NULL) && (walk_result.use_count == 0);
> +}
> +
> +
> +/**
> + * Attempt a raw reset of the provided PCI device-- via any
> + * method available to us. This method prefers the gentlest
> + * possible reset method-- currently an FLR, which many
> + * PCIe devices should support.
> + *
> + * @param dev The pci device to be reset.
> + * @return Zero on success, or the error code generated by the reset method on failure.
> + */
> +static int __pcistub_raw_device_reset(struct pci_dev *dev)
> +{
> + //Determine if bus resetting techniques (SBR, slot resets)
> + //are safe, and thus should be allowed.
> + int allow_bus_reset = safe_to_sbr_device(dev);
> +
> + //If FLRs are supported; we'll try to let the linux kernel
> + //manually reset the device.
> + if(device_supports_flr(dev)) {
> + dev_dbg(&dev->dev, "Resetting device using an FLR.");
> + return pci_reset_function(dev);
> + }
> +
> + //Next, we'll try the next gentlest: a hotplugging reset
> + //of the PCI slot.
> + if(allow_bus_reset && device_supports_slot_reset(dev)) {
> + dev_dbg(&dev->dev, "Resetting device using a slot reset.");
> + return pci_try_reset_slot(dev->slot);
> + }
> +
> + //Finally, we'll try the most drastic: resetting the parent
> + //PCI bus-- which we can only do conditionally.
> + if(allow_bus_reset && device_supports_bus_reset(dev)) {
> + dev_dbg(&dev->dev, "Resetting device using an SBR.");
> + return pci_try_reset_bus(dev->bus);
> + }
> +
> + //If we weren't able to reset the device by any of our known-good methods,
> + //fall back to the linux kernel's reset function. Unfortunately, this considers a
> + //power management reset to be a valid reset; though this doesn't work for many devices--
> + //especially GPUs.
> + dev_err(&dev->dev, "No reset methods available for %s. Falling back to kernel reset.", pci_name(dev));
> + pci_reset_function(dev);
> +
> + //Return an error code, indicating that we likely did not reset the device correctly.
> + return -ENOTTY;
> +}
> +
> +
> +/**
> + * Resets the target (pciback-owned) PCI device. Primarily intended
> + * for use by the toolstack, so it can ensure a consistent PCI device
> + * state on VM startup.
> + *
> + * @param dev The device to be reset.
> + * @return Zero on success, or a negated error code on failure.
> + */
> +static int pcistub_reset_pci_dev(struct pci_dev *dev)
> +{
> + int rc;
> +
> + if (!dev)
> + return -EINVAL;
> +
> + /*
> + * Takes the PCI lock. OK to do it as we are never called
> + * from 'unbind' state and don't deadlock.
> + */
> + rc =__pcistub_raw_device_reset(dev);
> + pci_restore_state(dev);
> +
> + /* This disables the device. */
> + xen_pcibk_reset_device(dev);
> +
> + /* And cleanup up our emulated fields. */
> + xen_pcibk_config_reset_dev(dev);
> + return rc;
> +}
> +
> +
> +
> struct pci_dev *pcistub_get_pci_dev(struct xen_pcibk_device *pdev,
> struct pci_dev *dev)
> {
> @@ -279,11 +540,13 @@ void pcistub_put_pci_dev(struct pci_dev
> * pcistub and xen_pcibk when AER is in processing
> */
> down_write(&pcistub_sem);
> - /* Cleanup our device
> - * (so it's ready for the next domain)
> - */
> device_lock_assert(&dev->dev);
> - __pci_reset_function_locked(dev);
> + /*
> + * Reset is up to the toolstack.
> + * The toolstack has to call 'reset_device' before
> + * providing the PCI device to a guest (see pcistub_reset_device).
> + */
> + //__pci_reset_function_locked(dev);
>
> dev_data = pci_get_drvdata(dev);
> ret = pci_load_saved_state(dev, dev_data->pci_saved_state);
> @@ -1460,6 +1723,41 @@ static ssize_t restrictive_add(struct de
> }
> static DRIVER_ATTR(restrictive, S_IWUSR, NULL, restrictive_add);
>
> +/**
> + * Handles the "reset_device" sysfs attribute. This is the primary reset interface
> + * utilized by the toolstack.
> + */
> +static ssize_t pcistub_sysfs_reset_device(struct device_driver *drv, const char *buf, size_t count)
> +{
> + int domain, bus, slot, func, err;
> + struct pcistub_device *psdev;
> +
> + //Attempt to convert the user's string to a BDF/slot.
> + err = str_to_slot(buf, &domain, &bus, &slot, &func);
> + if (err)
> + return -ENODEV;
> +
> + //... and then use that slot to find the pciback device.
> + psdev = pcistub_device_find(domain, bus, slot, func);
> +
> + //If we have a device, attempt to reset it using our internal reset path.
> + if (psdev) {
> + err = pcistub_reset_pci_dev(psdev->dev);
> + pcistub_device_put(psdev);
> +
> + //If we were not able to reset the device, return the relevant error code.
> + if(err)
> + err = -ENODEV;
> + }
> + //Otherwise, indicate that there's no such device.
> + else {
> + err = -ENODEV;
> + }
> +
> + return err ? err : count;
> +
> +}
> +static DRIVER_ATTR(reset_device, S_IWUSR, NULL, pcistub_sysfs_reset_device);
>
> static void pcistub_exit(void)
> {
> @@ -1476,6 +1774,8 @@ static void pcistub_exit(void)
> &driver_attr_irq_handlers);
> driver_remove_file(&xen_pcibk_pci_driver.driver,
> &driver_attr_irq_handler_state);
> + driver_remove_file(&xen_pcibk_pci_driver.driver,
> + &driver_attr_reset_device);
> pci_unregister_driver(&xen_pcibk_pci_driver);
> }
>
> @@ -1572,6 +1872,9 @@ static int __init pcistub_init(void)
> if (!err)
> err = driver_create_file(&xen_pcibk_pci_driver.driver,
> &driver_attr_irq_handler_state);
> + if (!err)
> + err = driver_create_file(&xen_pcibk_pci_driver.driver,
> + &driver_attr_reset_device);
> if (err)
> pcistub_exit();
>

> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
Re: pci-passthrough loses msi-x interrupts ability after domain destroy [ In reply to ]
On 22/09/17 04:09, Christopher Clark wrote:
> On Thu, Sep 21, 2017 at 1:27 PM, Sander Eikelenboom
> <linux@eikelenboom.it> wrote:
>>
>> On Thu, September 21, 2017, 10:39:52 AM, Roger Pau Monné wrote:
>>
>>> On Wed, Sep 20, 2017 at 03:50:35PM -0400, Jérôme Oufella wrote:
>>>>
>>>> I'm using PCI pass-through to map a PCIe (intel i210) controller into
>>>> a HVM domain. The system uses xen-pciback to hide the appropriate PCI
>>>> device from Dom0.
>>>>
>>>> When creating the HVM domain after an hypervisor cold boot, the HVM
>>>> domain can access and use the PCIe controller without problem.
>>>>
>>>> However, if the HVM domain is destroyed then restarted, it won't be
>>>> able to use the pass-through PCI device anymore. The PCI device is
>>>> seen and can be mapped, however, the interrupts will not be passed to
>>>> the HVM domain anymore (this is visible under a Linux guest as
>>>> /proc/interrupts counters remain 0). The behavior on a Windows10 guest
>>>> is the same.
>>>>
>>>> A few interesting hints I noticed:
>>>>
>>>> - On Dom0, 'lspci -vv' on that PCIe device between the "working" and
>>>> the "muted interrupts" states, I noted a difference between the
>>>> MSI-X caps:
>>>>
>>>> - Capabilities: [70] MSI-X: Enable- Count=5 Masked- <-- IRQs will work if domain started
>>>> + Capabilities: [70] MSI-X: Enable- Count=5 Masked+ <-- IRQs won't work if domain started
>>>> ^^^^^^^
>>
>>> IMHO it seems that either your device is not able to perform a reset
>>> successfully, or Linux is not correctly performing such reset. I don't
>>> think there's a lot that can be done from the Xen side.
>>
>> Unfortunately for a lot of pci-devices a simple reset as performed by default isn't enough,
>> but also almost none support a real pci FLR.
>>
>> In the distant past Konrad has made a patchset that implemented a bus reset and
>> reseting config space. (It piggy backed on already existing libxl mechanism of
>> trying to call on a syfs "do_flr" attribute which triggers pciback to perform
>> the busreset and rewrite of config space for the device.
>>
>> I use that patchset ever since for my pci-passtrough needs and it works pretty
>> well. I can shutdown an restart VM's with pci devices passed trhough (also AMD
>> Radeon graphic cards).
>
> Just to confirm the utility of that piece of work: OpenXT also uses an
> extended version of that same patch to perform device reset for
> passthrough.
>
> I've attached a copy of that OpenXT patch to this message and it can
> also be obtained from our git repository:
> https://github.com/OpenXT/xenclient-oe/blob/f8d3b282a87231d9ae717b13d506e8e7e28c78c4/recipes-kernel/linux/4.9/patches/thorough-reset-interface-to-pciback-s-sysfs.patch
> This version creates a sysfs node named "reset_device" and the OpenXT
> libxl toolstack is patched to use that node instead of "do_flr".

Nice to hear there are more users of this patch. On #xen on IRC there were from time to time
also users who tried pci-passtrough and ran into this issue (and probably abandonning the idea
since having to restart your host before being able to use your pass throughed device again
defies much of the use case).

> Konrad's original work encountered pushback on upstream acceptance at
> the time it was developed. I'm not sure I've found where that
> discussion ended. Is there any prospect of a more comprehensive reset
> mechanism being accepted into xen-pciback, or elsewhere in the kernel?

Yeah it was nacked by David Vrabel and the discussion somewhat bleeded to death.
From what i remember the main issue was with the naming, since it doesn't do a FLR,
the sysfs hook shouldn't be called "do_flr".

Some other perhaps minor issues i can think of are:
- No way to excempt pci-devices from this new way of resetting them.
Perhaps there could be pci devices/topologies were this way of
resetting causes more problems than it solves and could cause a
regression. Unfortunately auto detecting what works doesn't seem to
be possible. On the other hand (though only with my n=10) i haven't encountered
such a device yet.

- The communication path between libxl and the kernel via sysfs.
I think the preference was for a:
a) having it use a more common used Xen communication channel or
b) having it all self-contained in pci-back. (from my memory and the openxt patch description
there could be some locking issue when trying to implement it this way,
but the vfio guys had that solved for there reset implementation if i
from one of the comments in there source code (patches by Alex Williamson
if i remember correctly).

- Not an issue back then when the patch was made, but as the question earlier to Roger,
the hypervisor seems to grow more interference with pci devices with the PVH dom0 work.
If and hoow does that relate to pci-back and pci-passthrough and (the location of) resetting mechanisms ?


So i think David's NACK was mostly for the patchset having some hackish cosmetics.

On the upside one can conclude that this patchset is now pretty well tested over the years ;)

Since David has left, perhaps Jurgen/Boris/Konrad could express their views (again) ?
(CC'ed them as well)

> As noted in the original LKML threads, vfio has similar relevant pci
> device reset retry logic. (Thanks to Rich Persaud for this pointer:)
> http://elixir.free-electrons.com/linux/v4.14-rc1/source/drivers/vfio/pci/vfio_pci.c#L1353
>
> libvirt also performs similar reset logic, using a direct low level
> interface to config space (Thanks to Marek for this pointer, libvirt
> is used by Qubes:)
> https://github.com/libvirt/libvirt/blob/master/src/util/virpci.c#L929
> I thinks this indicates that it would be possible to extend libxl to
> do something similar, but that seems less satisfactory compared to
> performing the work in a kernel-provided implementation.
>
> Is there a way forward to providing this functionality within Xen
> software or Linux> Christopher
> --
>
> openxt.org
>


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
Re: pci-passthrough loses msi-x interrupts ability after domain destroy [ In reply to ]
On Thu, Sep 21, 2017 at 10:27:01PM +0200, Sander Eikelenboom wrote:
> Roger:
> I follow your PVH (dom0) patches shallowly, from my understanding it will result
> in Xen having more inteference with the handling of PCI devices ?

Yes, that's correct.

> If that's correct will this also impact the resetting logic, or will most stay
> in the dom0 kernel/pciback ?

It's not clear, IMHO it would be better to handle the reset logic in
Xen itself, but I haven't looked into it to know whether that's
something feasible or not.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
Re: pci-passthrough loses msi-x interrupts ability after domain destroy [ In reply to ]
On Fri, Sep 22, 2017 at 09:35:40AM +0200, Sander Eikelenboom wrote:
> On 22/09/17 04:09, Christopher Clark wrote:
> > On Thu, Sep 21, 2017 at 1:27 PM, Sander Eikelenboom
> > <linux@eikelenboom.it> wrote:
> >>
> >> On Thu, September 21, 2017, 10:39:52 AM, Roger Pau Monné wrote:
> >>
> >>> On Wed, Sep 20, 2017 at 03:50:35PM -0400, Jérôme Oufella wrote:
> >>>>
> >>>> I'm using PCI pass-through to map a PCIe (intel i210) controller into
> >>>> a HVM domain. The system uses xen-pciback to hide the appropriate PCI
> >>>> device from Dom0.
> >>>>
> >>>> When creating the HVM domain after an hypervisor cold boot, the HVM
> >>>> domain can access and use the PCIe controller without problem.
> >>>>
> >>>> However, if the HVM domain is destroyed then restarted, it won't be
> >>>> able to use the pass-through PCI device anymore. The PCI device is
> >>>> seen and can be mapped, however, the interrupts will not be passed to
> >>>> the HVM domain anymore (this is visible under a Linux guest as
> >>>> /proc/interrupts counters remain 0). The behavior on a Windows10 guest
> >>>> is the same.
> >>>>
> >>>> A few interesting hints I noticed:
> >>>>
> >>>> - On Dom0, 'lspci -vv' on that PCIe device between the "working" and
> >>>> the "muted interrupts" states, I noted a difference between the
> >>>> MSI-X caps:
> >>>>
> >>>> - Capabilities: [70] MSI-X: Enable- Count=5 Masked- <-- IRQs will work if domain started
> >>>> + Capabilities: [70] MSI-X: Enable- Count=5 Masked+ <-- IRQs won't work if domain started
> >>>> ^^^^^^^
> >>
> >>> IMHO it seems that either your device is not able to perform a reset
> >>> successfully, or Linux is not correctly performing such reset. I don't
> >>> think there's a lot that can be done from the Xen side.
> >>
> >> Unfortunately for a lot of pci-devices a simple reset as performed by default isn't enough,
> >> but also almost none support a real pci FLR.
> >>
> >> In the distant past Konrad has made a patchset that implemented a bus reset and
> >> reseting config space. (It piggy backed on already existing libxl mechanism of
> >> trying to call on a syfs "do_flr" attribute which triggers pciback to perform
> >> the busreset and rewrite of config space for the device.
> >>
> >> I use that patchset ever since for my pci-passtrough needs and it works pretty
> >> well. I can shutdown an restart VM's with pci devices passed trhough (also AMD
> >> Radeon graphic cards).
> >
> > Just to confirm the utility of that piece of work: OpenXT also uses an
> > extended version of that same patch to perform device reset for
> > passthrough.
> >
> > I've attached a copy of that OpenXT patch to this message and it can
> > also be obtained from our git repository:
> > https://github.com/OpenXT/xenclient-oe/blob/f8d3b282a87231d9ae717b13d506e8e7e28c78c4/recipes-kernel/linux/4.9/patches/thorough-reset-interface-to-pciback-s-sysfs.patch
> > This version creates a sysfs node named "reset_device" and the OpenXT
> > libxl toolstack is patched to use that node instead of "do_flr".
>
> Nice to hear there are more users of this patch. On #xen on IRC there were from time to time
> also users who tried pci-passtrough and ran into this issue (and probably abandonning the idea
> since having to restart your host before being able to use your pass throughed device again
> defies much of the use case).
>
> > Konrad's original work encountered pushback on upstream acceptance at
> > the time it was developed. I'm not sure I've found where that
> > discussion ended. Is there any prospect of a more comprehensive reset
> > mechanism being accepted into xen-pciback, or elsewhere in the kernel?
>
> Yeah it was nacked by David Vrabel and the discussion somewhat bleeded to death.
> >From what i remember the main issue was with the naming, since it doesn't do a FLR,
> the sysfs hook shouldn't be called "do_flr".
>
> Some other perhaps minor issues i can think of are:
> - No way to excempt pci-devices from this new way of resetting them.
> Perhaps there could be pci devices/topologies were this way of
> resetting causes more problems than it solves and could cause a
> regression. Unfortunately auto detecting what works doesn't seem to
> be possible. On the other hand (though only with my n=10) i haven't encountered
> such a device yet.
>
> - The communication path between libxl and the kernel via sysfs.
> I think the preference was for a:
> a) having it use a more common used Xen communication channel or
> b) having it all self-contained in pci-back. (from my memory and the openxt patch description
> there could be some locking issue when trying to implement it this way,
> but the vfio guys had that solved for there reset implementation if i
> from one of the comments in there source code (patches by Alex Williamson
> if i remember correctly).
>
> - Not an issue back then when the patch was made, but as the question earlier to Roger,
> the hypervisor seems to grow more interference with pci devices with the PVH dom0 work.
> If and hoow does that relate to pci-back and pci-passthrough and (the location of) resetting mechanisms ?
>
>
> So i think David's NACK was mostly for the patchset having some hackish cosmetics.

He didn't like 'do_flr' which made sense as the patchset did not do FLR. It made a bus-reset
for more than one device (if those devices were assigned to pciback).

>
> On the upside one can conclude that this patchset is now pretty well tested over the years ;)
>
> Since David has left, perhaps Jurgen/Boris/Konrad could express their views (again) ?
> (CC'ed them as well)

I've asked Govinda (CC-ed) to refresh the patchset against the lastest kernel and
repost it and see where it goes.

>
> > As noted in the original LKML threads, vfio has similar relevant pci
> > device reset retry logic. (Thanks to Rich Persaud for this pointer:)
> > http://elixir.free-electrons.com/linux/v4.14-rc1/source/drivers/vfio/pci/vfio_pci.c#L1353
> >
> > libvirt also performs similar reset logic, using a direct low level
> > interface to config space (Thanks to Marek for this pointer, libvirt
> > is used by Qubes:)
> > https://github.com/libvirt/libvirt/blob/master/src/util/virpci.c#L929
> > I thinks this indicates that it would be possible to extend libxl to
> > do something similar, but that seems less satisfactory compared to
> > performing the work in a kernel-provided implementation.
> >
> > Is there a way forward to providing this functionality within Xen
> > software or Linux> Christopher
> > --
> >
> > openxt.org
> >
>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
Re: pci-passthrough loses msi-x interrupts ability after domain destroy [ In reply to ]
On Fri, Sep 22, 2017 at 09:35:40AM +0200, Sander Eikelenboom wrote:
> - Not an issue back then when the patch was made, but as the question earlier to Roger,
> the hypervisor seems to grow more interference with pci devices with the PVH dom0 work.
> If and hoow does that relate to pci-back and pci-passthrough and (the location of) resetting mechanisms ?

pci-back will still be required in order to do pci-passthrough to PV
guests. The aim is to use the new pci-emulation layer to perform
pci-passthrough to both HVM and PVH guests, deprecating the
passthrough code inside of QEMU.

I would prefer to have the reset mechanism inside of Xen, but I'm not
sure how feasible it is to put this code inside of Xen because I
haven't looked at it yet.

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
Re: pci-passthrough loses msi-x interrupts ability after domain destroy [ In reply to ]
>
> ?[snip]?
>
>
> > So i think David's NACK was mostly for the patchset having some hackish
> cosmetics.
>
> He didn't like 'do_flr' which made sense as the patchset did not do FLR.
> It made a bus-reset
> for more than one device (if those devices were assigned to pciback).
>

?When I first wrote this, FLR was not yet implemented in the kernel so I
was actually adding FLR support; thus the name do_flr. So that is just
cargo from years ago :)?



>
> >
> > On the upside one can conclude that this patchset is now pretty well
> tested over the years ;)
> >
> > Since David has left, perhaps Jurgen/Boris/Konrad could express their
> views (again) ?
> > (CC'ed them as well)
>
> I've asked Govinda (CC-ed) to refresh the patchset against the lastest
> kernel and
> repost it and see where it goes.
>
> >
> > > As noted in the original LKML threads, vfio has similar relevant pci
> > > device reset retry logic. (Thanks to Rich Persaud for this pointer:)
> > > http://elixir.free-electrons.com/linux/v4.14-rc1/source/
> drivers/vfio/pci/vfio_pci.c#L1353
> > >
> > > libvirt also performs similar reset logic, using a direct low level
> > > interface to config space (Thanks to Marek for this pointer, libvirt
> > > is used by Qubes:)
> > > https://github.com/libvirt/libvirt/blob/master/src/util/virpci.c#L929
> > > I thinks this indicates that it would be possible to extend libxl to
> > > do something similar, but that seems less satisfactory compared to
> > > performing the work in a kernel-provided implementation.
> > >
> > > Is there a way forward to providing this functionality within Xen
> > > software or Linux> Christopher
> > > --
> > >
> > > openxt.org
> > >
> >
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel
>



--
Ross Philipson
Re: pci-passthrough loses msi-x interrupts ability after domain destroy [ In reply to ]
Hi Jan and al,

----- On Sep 21, 2017, at 9:12 AM, Jan Beulich JBeulich@suse.com wrote:

> And did you verify that the OS actually makes an attempt to clear
> this mask-all flag? If such an attempt doesn't have the intended
> effect, finding the problem location in the code and sending a
> fix can't be that difficult. If otoh the guest doesn't do this, then
> we'd need to figure out whether we leave the device in a wrong
> state after de-assigning it from the original guest instance (albeit,
> as Roger said, the reset the device is supposed to go through
> would be expected to clear it).

> I can certainly see an OS not
> necessarily expecting the bit to be set when first gaining control
> of the device. For this, look at the lspci output for the device in
> Dom0 between shutting down and then restarting the guest.
>
> Jan

Thanks for the good hints and reset patches.

I tried the various device reset patches posted on this discussion
(do_flr, Christopher's "more thorough" reset_device) but without luck.

After reset, I could notice that lspci shows the device's Masked
state has been cleared but a newly started guest will still fail
to receive any interrupts.

However, I just found a sequence that makes the device survive a guest
destruction: issuing a remove_slot, unbind, bind, new_slot on that
specific device in pciback's sysfs corner, then re-creating the
domain. This allows interrupts flowing back into both windows and
linux domains, without involving an additional reset step.

It's a quick fix for me, but I'm wondering what path differs between
the two methods, and how what it leads to could become part of the
standard reset path.

Regards,
Jerome

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
Re: pci-passthrough loses msi-x interrupts ability after domain destroy [ In reply to ]
>>> On 25.09.17 at 18:10, <jerome.oufella@savoirfairelinux.com> wrote:
> I tried the various device reset patches posted on this discussion
> (do_flr, Christopher's "more thorough" reset_device) but without luck.
>
> After reset, I could notice that lspci shows the device's Masked
> state has been cleared but a newly started guest will still fail
> to receive any interrupts.

And that's even with most recent Xen and qemu (there were some
changes by Roger here a few weeks ago)? If so, I'm afraid is all
will remain guesswork without someone (you?) who is able to
see the issue also debugging it.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
Re: pci-passthrough loses msi-x interrupts ability after domain destroy [ In reply to ]
Hi,

On Fri, Sep 22, 2017 at 03:25:00PM -0400, Konrad Rzeszutek Wilk wrote:
> On Fri, Sep 22, 2017 at 09:35:40AM +0200, Sander Eikelenboom wrote:
> > On 22/09/17 04:09, Christopher Clark wrote:
> > > On Thu, Sep 21, 2017 at 1:27 PM, Sander Eikelenboom
> > > <linux@eikelenboom.it> wrote:
> > >>
> > >> On Thu, September 21, 2017, 10:39:52 AM, Roger Pau Monné wrote:
> > >>
> > >>> On Wed, Sep 20, 2017 at 03:50:35PM -0400, Jérôme Oufella wrote:
> > >>>>
> > >>>> I'm using PCI pass-through to map a PCIe (intel i210) controller into
> > >>>> a HVM domain. The system uses xen-pciback to hide the appropriate PCI
> > >>>> device from Dom0.
> > >>>>
> > >>>> When creating the HVM domain after an hypervisor cold boot, the HVM
> > >>>> domain can access and use the PCIe controller without problem.
> > >>>>
> > >>>> However, if the HVM domain is destroyed then restarted, it won't be
> > >>>> able to use the pass-through PCI device anymore. The PCI device is
> > >>>> seen and can be mapped, however, the interrupts will not be passed to
> > >>>> the HVM domain anymore (this is visible under a Linux guest as
> > >>>> /proc/interrupts counters remain 0). The behavior on a Windows10 guest
> > >>>> is the same.
> > >>>>
> > >>>> A few interesting hints I noticed:
> > >>>>
> > >>>> - On Dom0, 'lspci -vv' on that PCIe device between the "working" and
> > >>>> the "muted interrupts" states, I noted a difference between the
> > >>>> MSI-X caps:
> > >>>>
> > >>>> - Capabilities: [70] MSI-X: Enable- Count=5 Masked- <-- IRQs will work if domain started
> > >>>> + Capabilities: [70] MSI-X: Enable- Count=5 Masked+ <-- IRQs won't work if domain started
> > >>>> ^^^^^^^
> > >>
> > >>> IMHO it seems that either your device is not able to perform a reset
> > >>> successfully, or Linux is not correctly performing such reset. I don't
> > >>> think there's a lot that can be done from the Xen side.
> > >>
> > >> Unfortunately for a lot of pci-devices a simple reset as performed by default isn't enough,
> > >> but also almost none support a real pci FLR.
> > >>
> > >> In the distant past Konrad has made a patchset that implemented a bus reset and
> > >> reseting config space. (It piggy backed on already existing libxl mechanism of
> > >> trying to call on a syfs "do_flr" attribute which triggers pciback to perform
> > >> the busreset and rewrite of config space for the device.
> > >>
> > >> I use that patchset ever since for my pci-passtrough needs and it works pretty
> > >> well. I can shutdown an restart VM's with pci devices passed trhough (also AMD
> > >> Radeon graphic cards).
> > >
> > > Just to confirm the utility of that piece of work: OpenXT also uses an
> > > extended version of that same patch to perform device reset for
> > > passthrough.
> > >
> > > I've attached a copy of that OpenXT patch to this message and it can
> > > also be obtained from our git repository:
> > > https://github.com/OpenXT/xenclient-oe/blob/f8d3b282a87231d9ae717b13d506e8e7e28c78c4/recipes-kernel/linux/4.9/patches/thorough-reset-interface-to-pciback-s-sysfs.patch
> > > This version creates a sysfs node named "reset_device" and the OpenXT
> > > libxl toolstack is patched to use that node instead of "do_flr".
> >
> > Nice to hear there are more users of this patch. On #xen on IRC there were from time to time
> > also users who tried pci-passtrough and ran into this issue (and probably abandonning the idea
> > since having to restart your host before being able to use your pass throughed device again
> > defies much of the use case).
> >
> > > Konrad's original work encountered pushback on upstream acceptance at
> > > the time it was developed. I'm not sure I've found where that
> > > discussion ended. Is there any prospect of a more comprehensive reset
> > > mechanism being accepted into xen-pciback, or elsewhere in the kernel?
> >
> > Yeah it was nacked by David Vrabel and the discussion somewhat bleeded to death.
> > >From what i remember the main issue was with the naming, since it doesn't do a FLR,
> > the sysfs hook shouldn't be called "do_flr".
> >
> > Some other perhaps minor issues i can think of are:
> > - No way to excempt pci-devices from this new way of resetting them.
> > Perhaps there could be pci devices/topologies were this way of
> > resetting causes more problems than it solves and could cause a
> > regression. Unfortunately auto detecting what works doesn't seem to
> > be possible. On the other hand (though only with my n=10) i haven't encountered
> > such a device yet.
> >
> > - The communication path between libxl and the kernel via sysfs.
> > I think the preference was for a:
> > a) having it use a more common used Xen communication channel or
> > b) having it all self-contained in pci-back. (from my memory and the openxt patch description
> > there could be some locking issue when trying to implement it this way,
> > but the vfio guys had that solved for there reset implementation if i
> > from one of the comments in there source code (patches by Alex Williamson
> > if i remember correctly).
> >
> > - Not an issue back then when the patch was made, but as the question earlier to Roger,
> > the hypervisor seems to grow more interference with pci devices with the PVH dom0 work.
> > If and hoow does that relate to pci-back and pci-passthrough and (the location of) resetting mechanisms ?
> >
> >
> > So i think David's NACK was mostly for the patchset having some hackish cosmetics.
>
> He didn't like 'do_flr' which made sense as the patchset did not do FLR. It made a bus-reset
> for more than one device (if those devices were assigned to pciback).
>
> >
> > On the upside one can conclude that this patchset is now pretty well tested over the years ;)
> >
> > Since David has left, perhaps Jurgen/Boris/Konrad could express their views (again) ?
> > (CC'ed them as well)
>
> I've asked Govinda (CC-ed) to refresh the patchset against the lastest kernel and
> repost it and see where it goes.
>

Nice. Looking forward to seeing the refreshed patchset hit the mailinglist! :)


Thanks,

-- Pasi

> >
> > > As noted in the original LKML threads, vfio has similar relevant pci
> > > device reset retry logic. (Thanks to Rich Persaud for this pointer:)
> > > http://elixir.free-electrons.com/linux/v4.14-rc1/source/drivers/vfio/pci/vfio_pci.c#L1353
> > >
> > > libvirt also performs similar reset logic, using a direct low level
> > > interface to config space (Thanks to Marek for this pointer, libvirt
> > > is used by Qubes:)
> > > https://github.com/libvirt/libvirt/blob/master/src/util/virpci.c#L929
> > > I thinks this indicates that it would be possible to extend libxl to
> > > do something similar, but that seems less satisfactory compared to
> > > performing the work in a kernel-provided implementation.
> > >
> > > Is there a way forward to providing this functionality within Xen
> > > software or Linux> Christopher
> > > --
> > >
> > > openxt.org
> > >
> >

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel