Mailing List Archive

[RFC 03/20] vfio: Add vfio_[un]register_device()
With /dev/vfio/devices introduced, now a vfio device driver has three
options to expose its device to userspace:

a) only legacy group interface, for devices which haven't been moved to
iommufd (e.g. platform devices, sw mdev, etc.);

b) both legacy group interface and new device-centric interface, for
devices which supports iommufd but also wants to keep backward
compatibility (e.g. pci devices in this RFC);

c) only new device-centric interface, for new devices which don't carry
backward compatibility burden (e.g. hw mdev/subdev with pasid);

This patch introduces vfio_[un]register_device() helpers for the device
drivers to specify the device exposure policy to vfio core. Hence the
existing vfio_[un]register_group_dev() become the wrapper of the new
helper functions. The new device-centric interface is described as
'nongroup' to differentiate from existing 'group' stuff.

TBD: this patch needs to rebase on top of below series from Christoph in
next version.

"cleanup vfio iommu_group creation"

Legacy userspace continues to follow the legacy group interface.

Newer userspace can first try the new device-centric interface if the
device is present under /dev/vfio/devices. Otherwise fall back to the
group interface.

One open about how to organize the device nodes under /dev/vfio/devices/.
This RFC adopts a simple policy by keeping a flat layout with mixed devname
from all kinds of devices. The prerequisite of this model is that devnames
from different bus types are unique formats:

/dev/vfio/devices/0000:00:14.2 (pci)
/dev/vfio/devices/PNP0103:00 (platform)
/dev/vfio/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 (mdev)

One alternative option is to arrange device nodes in sub-directories based
on the device type. But doing so also adds one trouble to userspace. The
current vfio uAPI is designed to have the user query device type via
VFIO_DEVICE_GET_INFO after opening the device. With this option the user
instead needs to figure out the device type before opening the device, to
identify the sub-directory. Another tricky thing is that "pdev. vs. mdev"
and "pci vs. platform vs. ccw,..." are orthogonal categorizations. Need
more thoughts on whether both or just one category should be used to define
the sub-directories.

Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
---
drivers/vfio/vfio.c | 137 +++++++++++++++++++++++++++++++++++++++----
include/linux/vfio.h | 9 +++
2 files changed, 134 insertions(+), 12 deletions(-)

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index 84436d7abedd..1e87b25962f1 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -51,6 +51,7 @@ static struct vfio {
struct cdev device_cdev;
dev_t device_devt;
struct mutex device_lock;
+ struct list_head device_list;
struct idr device_idr;
} vfio;

@@ -757,7 +758,7 @@ void vfio_init_group_dev(struct vfio_device *device, struct device *dev,
}
EXPORT_SYMBOL_GPL(vfio_init_group_dev);

-int vfio_register_group_dev(struct vfio_device *device)
+static int __vfio_register_group_dev(struct vfio_device *device)
{
struct vfio_device *existing_device;
struct iommu_group *iommu_group;
@@ -794,8 +795,13 @@ int vfio_register_group_dev(struct vfio_device *device)
/* Our reference on group is moved to the device */
device->group = group;

- /* Refcounting can't start until the driver calls register */
- refcount_set(&device->refcount, 1);
+ /*
+ * Refcounting can't start until the driver call register. Don’t
+ * start twice when the device is exposed in both group and nongroup
+ * interfaces.
+ */
+ if (!refcount_read(&device->refcount))
+ refcount_set(&device->refcount, 1);

mutex_lock(&group->device_lock);
list_add(&device->group_next, &group->device_list);
@@ -804,7 +810,78 @@ int vfio_register_group_dev(struct vfio_device *device)

return 0;
}
-EXPORT_SYMBOL_GPL(vfio_register_group_dev);
+
+static int __vfio_register_nongroup_dev(struct vfio_device *device)
+{
+ struct vfio_device *existing_device;
+ struct device *dev;
+ int ret = 0, minor;
+
+ mutex_lock(&vfio.device_lock);
+ list_for_each_entry(existing_device, &vfio.device_list, vfio_next) {
+ if (existing_device == device) {
+ ret = -EBUSY;
+ goto out_unlock;
+ }
+ }
+
+ minor = idr_alloc(&vfio.device_idr, device, 0, MINORMASK + 1, GFP_KERNEL);
+ pr_debug("%s - mnior: %d\n", __func__, minor);
+ if (minor < 0) {
+ ret = minor;
+ goto out_unlock;
+ }
+
+ dev = device_create(vfio.device_class, NULL,
+ MKDEV(MAJOR(vfio.device_devt), minor),
+ device, "%s", dev_name(device->dev));
+ if (IS_ERR(dev)) {
+ idr_remove(&vfio.device_idr, minor);
+ ret = PTR_ERR(dev);
+ goto out_unlock;
+ }
+
+ /*
+ * Refcounting can't start until the driver call register. Don’t
+ * start twice when the device is exposed in both group and nongroup
+ * interfaces.
+ */
+ if (!refcount_read(&device->refcount))
+ refcount_set(&device->refcount, 1);
+
+ device->minor = minor;
+ list_add(&device->vfio_next, &vfio.device_list);
+ dev_info(device->dev, "Creates Device interface successfully!\n");
+out_unlock:
+ mutex_unlock(&vfio.device_lock);
+ return ret;
+}
+
+int vfio_register_device(struct vfio_device *device, u32 flags)
+{
+ int ret = -EINVAL;
+
+ device->minor = -1;
+ device->group = NULL;
+ atomic_set(&device->opened, 0);
+
+ if (flags & ~(VFIO_DEVNODE_GROUP | VFIO_DEVNODE_NONGROUP))
+ return ret;
+
+ if (flags & VFIO_DEVNODE_GROUP) {
+ ret = __vfio_register_group_dev(device);
+ if (ret)
+ return ret;
+ }
+
+ if (flags & VFIO_DEVNODE_NONGROUP) {
+ ret = __vfio_register_nongroup_dev(device);
+ if (ret && device->group)
+ vfio_unregister_device(device);
+ }
+ return ret;
+}
+EXPORT_SYMBOL_GPL(vfio_register_device);

/**
* Get a reference to the vfio_device for a device. Even if the
@@ -861,13 +938,14 @@ static struct vfio_device *vfio_device_get_from_name(struct vfio_group *group,
/*
* Decrement the device reference count and wait for the device to be
* removed. Open file descriptors for the device... */
-void vfio_unregister_group_dev(struct vfio_device *device)
+void vfio_unregister_device(struct vfio_device *device)
{
struct vfio_group *group = device->group;
struct vfio_unbound_dev *unbound;
unsigned int i = 0;
bool interrupted = false;
long rc;
+ int minor = device->minor;

/*
* When the device is removed from the group, the group suddenly
@@ -878,14 +956,20 @@ void vfio_unregister_group_dev(struct vfio_device *device)
* solve this, we track such devices on the unbound_list to bridge
* the gap until they're fully unbound.
*/
- unbound = kzalloc(sizeof(*unbound), GFP_KERNEL);
- if (unbound) {
- unbound->dev = device->dev;
- mutex_lock(&group->unbound_lock);
- list_add(&unbound->unbound_next, &group->unbound_list);
- mutex_unlock(&group->unbound_lock);
+ if (group) {
+ /*
+ * If caller hasn't called vfio_register_group_dev(), this
+ * branch is not necessary.
+ */
+ unbound = kzalloc(sizeof(*unbound), GFP_KERNEL);
+ if (unbound) {
+ unbound->dev = device->dev;
+ mutex_lock(&group->unbound_lock);
+ list_add(&unbound->unbound_next, &group->unbound_list);
+ mutex_unlock(&group->unbound_lock);
+ }
+ WARN_ON(!unbound);
}
- WARN_ON(!unbound);

vfio_device_put(device);
rc = try_wait_for_completion(&device->comp);
@@ -910,6 +994,21 @@ void vfio_unregister_group_dev(struct vfio_device *device)
}
}

+ /* nongroup interface related cleanup */
+ if (minor >= 0) {
+ mutex_lock(&vfio.device_lock);
+ list_del(&device->vfio_next);
+ device->minor = -1;
+ device_destroy(vfio.device_class,
+ MKDEV(MAJOR(vfio.device_devt), minor));
+ idr_remove(&vfio.device_idr, minor);
+ mutex_unlock(&vfio.device_lock);
+ }
+
+ /* No need go further if no group. */
+ if (!group)
+ return;
+
mutex_lock(&group->device_lock);
list_del(&device->group_next);
group->dev_counter--;
@@ -935,6 +1034,18 @@ void vfio_unregister_group_dev(struct vfio_device *device)
/* Matches the get in vfio_register_group_dev() */
vfio_group_put(group);
}
+EXPORT_SYMBOL_GPL(vfio_unregister_device);
+
+int vfio_register_group_dev(struct vfio_device *device)
+{
+ return vfio_register_device(device, VFIO_DEVNODE_GROUP);
+}
+EXPORT_SYMBOL_GPL(vfio_register_group_dev);
+
+void vfio_unregister_group_dev(struct vfio_device *device)
+{
+ vfio_unregister_device(device);
+}
EXPORT_SYMBOL_GPL(vfio_unregister_group_dev);

/**
@@ -2447,6 +2558,7 @@ static int vfio_init_device_class(void)

mutex_init(&vfio.device_lock);
idr_init(&vfio.device_idr);
+ INIT_LIST_HEAD(&vfio.device_list);

/* /dev/vfio/devices/$DEVICE */
vfio.device_class = class_create(THIS_MODULE, "vfio-device");
@@ -2542,6 +2654,7 @@ static int __init vfio_init(void)
static void __exit vfio_cleanup(void)
{
WARN_ON(!list_empty(&vfio.group_list));
+ WARN_ON(!list_empty(&vfio.device_list));

#ifdef CONFIG_VFIO_NOIOMMU
vfio_unregister_iommu_driver(&vfio_noiommu_ops);
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 4a5f3f99eab2..9448b751b663 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -26,6 +26,7 @@ struct vfio_device {
struct list_head group_next;
int minor;
atomic_t opened;
+ struct list_head vfio_next;
};

/**
@@ -73,6 +74,14 @@ enum vfio_iommu_notify_type {
VFIO_IOMMU_CONTAINER_CLOSE = 0,
};

+/* The device can be opened via VFIO_GROUP_GET_DEVICE_FD */
+#define VFIO_DEVNODE_GROUP BIT(0)
+/* The device can be opened via /dev/sys/devices/${DEVICE} */
+#define VFIO_DEVNODE_NONGROUP BIT(1)
+
+extern int vfio_register_device(struct vfio_device *device, u32 flags);
+extern void vfio_unregister_device(struct vfio_device *device);
+
/**
* struct vfio_iommu_driver_ops - VFIO IOMMU driver callbacks
*/
--
2.25.1
Re: [RFC 03/20] vfio: Add vfio_[un]register_device() [ In reply to ]
On Sun, Sep 19, 2021 at 02:38:31PM +0800, Liu Yi L wrote:
> With /dev/vfio/devices introduced, now a vfio device driver has three
> options to expose its device to userspace:
>
> a) only legacy group interface, for devices which haven't been moved to
> iommufd (e.g. platform devices, sw mdev, etc.);
>
> b) both legacy group interface and new device-centric interface, for
> devices which supports iommufd but also wants to keep backward
> compatibility (e.g. pci devices in this RFC);
>
> c) only new device-centric interface, for new devices which don't carry
> backward compatibility burden (e.g. hw mdev/subdev with pasid);

We shouldn't have 'b'? Where does it come from?

> This patch introduces vfio_[un]register_device() helpers for the device
> drivers to specify the device exposure policy to vfio core. Hence the
> existing vfio_[un]register_group_dev() become the wrapper of the new
> helper functions. The new device-centric interface is described as
> 'nongroup' to differentiate from existing 'group' stuff.

Detect what the driver supports based on the ops it declares. There
should be a function provided through the ops for the driver to bind
to the iommufd.

> One open about how to organize the device nodes under /dev/vfio/devices/.
> This RFC adopts a simple policy by keeping a flat layout with mixed devname
> from all kinds of devices. The prerequisite of this model is that devnames
> from different bus types are unique formats:

This isn't reliable, the devname should just be vfio0, vfio1, etc

The userspace can learn the correct major/minor by inspecting the
sysfs.

This whole concept should disappear into the prior patch that adds the
struct device in the first place, and I think most of the code here
can be deleted once the struct device is used properly.

Jason
RE: [RFC 03/20] vfio: Add vfio_[un]register_device() [ In reply to ]
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 12:01 AM
>
> On Sun, Sep 19, 2021 at 02:38:31PM +0800, Liu Yi L wrote:
> > With /dev/vfio/devices introduced, now a vfio device driver has three
> > options to expose its device to userspace:
> >
> > a) only legacy group interface, for devices which haven't been moved to
> > iommufd (e.g. platform devices, sw mdev, etc.);
> >
> > b) both legacy group interface and new device-centric interface, for
> > devices which supports iommufd but also wants to keep backward
> > compatibility (e.g. pci devices in this RFC);
> >
> > c) only new device-centric interface, for new devices which don't carry
> > backward compatibility burden (e.g. hw mdev/subdev with pasid);
>
> We shouldn't have 'b'? Where does it come from?

a vfio-pci device can be opened via the existing group interface. if no b) it
means legacy vfio userspace can never use vfio-pci device any more
once the latter is moved to iommufd.

Thanks
Kevin
Re: [RFC 03/20] vfio: Add vfio_[un]register_device() [ In reply to ]
On Tue, Sep 21, 2021 at 11:10:15PM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, September 22, 2021 12:01 AM
> >
> > On Sun, Sep 19, 2021 at 02:38:31PM +0800, Liu Yi L wrote:
> > > With /dev/vfio/devices introduced, now a vfio device driver has three
> > > options to expose its device to userspace:
> > >
> > > a) only legacy group interface, for devices which haven't been moved to
> > > iommufd (e.g. platform devices, sw mdev, etc.);
> > >
> > > b) both legacy group interface and new device-centric interface, for
> > > devices which supports iommufd but also wants to keep backward
> > > compatibility (e.g. pci devices in this RFC);
> > >
> > > c) only new device-centric interface, for new devices which don't carry
> > > backward compatibility burden (e.g. hw mdev/subdev with pasid);
> >
> > We shouldn't have 'b'? Where does it come from?
>
> a vfio-pci device can be opened via the existing group interface. if no b) it
> means legacy vfio userspace can never use vfio-pci device any more
> once the latter is moved to iommufd.

Sorry, I think I ment a, which I guess you will say is SW mdev devices

But even so, I think the way forward here is to still always expose
the device /dev/vfio/devices/X and some devices may not allow iommufd
usage initially.

Providing an ioctl to bind to a normal VFIO container or group might
allow a reasonable fallback in userspace..

Jason
RE: [RFC 03/20] vfio: Add vfio_[un]register_device() [ In reply to ]
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 12:01 AM
>
> > One open about how to organize the device nodes under
> /dev/vfio/devices/.
> > This RFC adopts a simple policy by keeping a flat layout with mixed
> devname
> > from all kinds of devices. The prerequisite of this model is that devnames
> > from different bus types are unique formats:
>
> This isn't reliable, the devname should just be vfio0, vfio1, etc
>
> The userspace can learn the correct major/minor by inspecting the
> sysfs.
>
> This whole concept should disappear into the prior patch that adds the
> struct device in the first place, and I think most of the code here
> can be deleted once the struct device is used properly.
>

Can you help elaborate above flow? This is one area where we need
more guidance.

When Qemu accepts an option "-device vfio-pci,host=DDDD:BB:DD.F",
how does Qemu identify which vifo0/1/... is associated with the specified
DDDD:BB:DD.F?

Thanks
Kevin
RE: [RFC 03/20] vfio: Add vfio_[un]register_device() [ In reply to ]
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 8:54 AM
>
> On Tue, Sep 21, 2021 at 11:10:15PM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, September 22, 2021 12:01 AM
> > >
> > > On Sun, Sep 19, 2021 at 02:38:31PM +0800, Liu Yi L wrote:
> > > > With /dev/vfio/devices introduced, now a vfio device driver has three
> > > > options to expose its device to userspace:
> > > >
> > > > a) only legacy group interface, for devices which haven't been moved
> to
> > > > iommufd (e.g. platform devices, sw mdev, etc.);
> > > >
> > > > b) both legacy group interface and new device-centric interface, for
> > > > devices which supports iommufd but also wants to keep backward
> > > > compatibility (e.g. pci devices in this RFC);
> > > >
> > > > c) only new device-centric interface, for new devices which don't carry
> > > > backward compatibility burden (e.g. hw mdev/subdev with pasid);
> > >
> > > We shouldn't have 'b'? Where does it come from?
> >
> > a vfio-pci device can be opened via the existing group interface. if no b) it
> > means legacy vfio userspace can never use vfio-pci device any more
> > once the latter is moved to iommufd.
>
> Sorry, I think I ment a, which I guess you will say is SW mdev devices

We listed a) here in case we don't want to move all vfio device types to
use iommufd in one breath. It's supposed to be a type valid only in this
transition phase. In the end only b) and c) are valid.

>
> But even so, I think the way forward here is to still always expose
> the device /dev/vfio/devices/X and some devices may not allow iommufd
> usage initially.
>
> Providing an ioctl to bind to a normal VFIO container or group might
> allow a reasonable fallback in userspace..
>

but doesn't a new ioctl still imply breaking existing vfio userspace?

Thanks
Kevin
Re: [RFC 03/20] vfio: Add vfio_[un]register_device() [ In reply to ]
On Wed, Sep 22, 2021 at 12:54:02AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Wednesday, September 22, 2021 12:01 AM
> >
> > > One open about how to organize the device nodes under
> > /dev/vfio/devices/.
> > > This RFC adopts a simple policy by keeping a flat layout with mixed
> > devname
> > > from all kinds of devices. The prerequisite of this model is that devnames
> > > from different bus types are unique formats:
> >
> > This isn't reliable, the devname should just be vfio0, vfio1, etc
> >
> > The userspace can learn the correct major/minor by inspecting the
> > sysfs.
> >
> > This whole concept should disappear into the prior patch that adds the
> > struct device in the first place, and I think most of the code here
> > can be deleted once the struct device is used properly.
> >
>
> Can you help elaborate above flow? This is one area where we need
> more guidance.
>
> When Qemu accepts an option "-device vfio-pci,host=DDDD:BB:DD.F",
> how does Qemu identify which vifo0/1/... is associated with the specified
> DDDD:BB:DD.F?

When done properly in the kernel the file:

/sys/bus/pci/devices/DDDD:BB:DD.F/vfio/vfioX/dev

Will contain the major:minor of the VFIO device.

Userspace then opens the /dev/vfio/devices/vfioX and checks with fstat
that the major:minor matches.

in the above pattern "pci" and "DDDD:BB:DD.FF" are the arguments passed
to qemu.

You can look at this for some general over engineered code to handle
opening from a sysfs handle like above:

https://github.com/linux-rdma/rdma-core/blob/master/util/open_cdev.c

Jason
RE: [RFC 03/20] vfio: Add vfio_[un]register_device() [ In reply to ]
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 9:00 AM
>
> On Wed, Sep 22, 2021 at 12:54:02AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, September 22, 2021 12:01 AM
> > >
> > > > One open about how to organize the device nodes under
> > > /dev/vfio/devices/.
> > > > This RFC adopts a simple policy by keeping a flat layout with mixed
> > > devname
> > > > from all kinds of devices. The prerequisite of this model is that
> devnames
> > > > from different bus types are unique formats:
> > >
> > > This isn't reliable, the devname should just be vfio0, vfio1, etc
> > >
> > > The userspace can learn the correct major/minor by inspecting the
> > > sysfs.
> > >
> > > This whole concept should disappear into the prior patch that adds the
> > > struct device in the first place, and I think most of the code here
> > > can be deleted once the struct device is used properly.
> > >
> >
> > Can you help elaborate above flow? This is one area where we need
> > more guidance.
> >
> > When Qemu accepts an option "-device vfio-pci,host=DDDD:BB:DD.F",
> > how does Qemu identify which vifo0/1/... is associated with the specified
> > DDDD:BB:DD.F?
>
> When done properly in the kernel the file:
>
> /sys/bus/pci/devices/DDDD:BB:DD.F/vfio/vfioX/dev
>
> Will contain the major:minor of the VFIO device.
>
> Userspace then opens the /dev/vfio/devices/vfioX and checks with fstat
> that the major:minor matches.

ah, that's the trick.

>
> in the above pattern "pci" and "DDDD:BB:DD.FF" are the arguments passed
> to qemu.
>
> You can look at this for some general over engineered code to handle
> opening from a sysfs handle like above:
>
> https://github.com/linux-rdma/rdma-core/blob/master/util/open_cdev.c
>

will check. Thanks for suggestion.
RE: [RFC 03/20] vfio: Add vfio_[un]register_device() [ In reply to ]
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 8:54 AM
>
> On Tue, Sep 21, 2021 at 11:10:15PM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, September 22, 2021 12:01 AM
> > >
> > > On Sun, Sep 19, 2021 at 02:38:31PM +0800, Liu Yi L wrote:
> > > > With /dev/vfio/devices introduced, now a vfio device driver has three
> > > > options to expose its device to userspace:
> > > >
> > > > a) only legacy group interface, for devices which haven't been moved
> to
> > > > iommufd (e.g. platform devices, sw mdev, etc.);
> > > >
> > > > b) both legacy group interface and new device-centric interface, for
> > > > devices which supports iommufd but also wants to keep backward
> > > > compatibility (e.g. pci devices in this RFC);
> > > >
> > > > c) only new device-centric interface, for new devices which don't carry
> > > > backward compatibility burden (e.g. hw mdev/subdev with pasid);
> > >
> > > We shouldn't have 'b'? Where does it come from?
> >
> > a vfio-pci device can be opened via the existing group interface. if no b) it
> > means legacy vfio userspace can never use vfio-pci device any more
> > once the latter is moved to iommufd.
>
> Sorry, I think I ment a, which I guess you will say is SW mdev devices
>
> But even so, I think the way forward here is to still always expose
> the device /dev/vfio/devices/X and some devices may not allow iommufd
> usage initially.

After another thought this should work. Following your comments in
other places, we'll move the handling of BIND_IOMMUFD to vfio core
which then invoke .bind_iommufd() from the driver. For devices which
don't allow iommufd now, the callback is null thus an error is returned.

This leaves the userspace in a try-and-fail mode. It first opens the device
fd and iommufd, and then try to connect the two together. If failed then
fallback to the legacy group interface.

Then we don't need a) at all. and we can even avoid introducing new
vfio_[un]register_device() at this point. Just leverage existing
vfio_[un]register_group_dev() to cover b). new helpers can be introduced
later when c) is supported.

>
> Providing an ioctl to bind to a normal VFIO container or group might
> allow a reasonable fallback in userspace..
>

I didn't get this point though. An error in binding already allows the
user to fall back to the group path. Why do we need introduce another
ioctl to explicitly bind to container via the nongroup interface?

Thanks
Kevin
Re: [RFC 03/20] vfio: Add vfio_[un]register_device() [ In reply to ]
On Wed, Sep 22, 2021 at 09:23:34AM +0000, Tian, Kevin wrote:

> > Providing an ioctl to bind to a normal VFIO container or group might
> > allow a reasonable fallback in userspace..
>
> I didn't get this point though. An error in binding already allows the
> user to fall back to the group path. Why do we need introduce another
> ioctl to explicitly bind to container via the nongroup interface?

New userspace still needs a fallback path if it hits the 'try and
fail'. Keeping the device FD open and just using a different ioctl to
bind to a container/group FD, which new userspace can then obtain as a
fallback, might be OK.

Hard to see without going through the qemu parts, so maybe just keep
it in mind

Jason
RE: [RFC 03/20] vfio: Add vfio_[un]register_device() [ In reply to ]
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, September 22, 2021 8:23 PM
>
> On Wed, Sep 22, 2021 at 09:23:34AM +0000, Tian, Kevin wrote:
>
> > > Providing an ioctl to bind to a normal VFIO container or group might
> > > allow a reasonable fallback in userspace..
> >
> > I didn't get this point though. An error in binding already allows the
> > user to fall back to the group path. Why do we need introduce another
> > ioctl to explicitly bind to container via the nongroup interface?
>
> New userspace still needs a fallback path if it hits the 'try and
> fail'. Keeping the device FD open and just using a different ioctl to
> bind to a container/group FD, which new userspace can then obtain as a
> fallback, might be OK.
>
> Hard to see without going through the qemu parts, so maybe just keep
> it in mind
>

sure. will figure it out when working on the qemu part.
Re: [RFC 03/20] vfio: Add vfio_[un]register_device() [ In reply to ]
On Wed, 22 Sep 2021 09:22:52 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Sep 22, 2021 at 09:23:34AM +0000, Tian, Kevin wrote:
>
> > > Providing an ioctl to bind to a normal VFIO container or group might
> > > allow a reasonable fallback in userspace..
> >
> > I didn't get this point though. An error in binding already allows the
> > user to fall back to the group path. Why do we need introduce another
> > ioctl to explicitly bind to container via the nongroup interface?
>
> New userspace still needs a fallback path if it hits the 'try and
> fail'. Keeping the device FD open and just using a different ioctl to
> bind to a container/group FD, which new userspace can then obtain as a
> fallback, might be OK.
>
> Hard to see without going through the qemu parts, so maybe just keep
> it in mind

If we assume that the container/group/device interface is essentially
deprecated once we have iommufd, it doesn't make a lot of sense to me
to tack on a container/device interface just so userspace can avoid
reverting to the fully legacy interface.

But why would we create vfio device interface files at all if they
can't work? I'm not really on board with creating a try-and-fail
interface for a mechanism that cannot work for a given device. The
existence of the device interface should indicate that it's supported.
Thanks,

Alex
RE: [RFC 03/20] vfio: Add vfio_[un]register_device() [ In reply to ]
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Thursday, September 23, 2021 4:11 AM
>
> On Wed, 22 Sep 2021 09:22:52 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> > On Wed, Sep 22, 2021 at 09:23:34AM +0000, Tian, Kevin wrote:
> >
> > > > Providing an ioctl to bind to a normal VFIO container or group might
> > > > allow a reasonable fallback in userspace..
> > >
> > > I didn't get this point though. An error in binding already allows the
> > > user to fall back to the group path. Why do we need introduce another
> > > ioctl to explicitly bind to container via the nongroup interface?
> >
> > New userspace still needs a fallback path if it hits the 'try and
> > fail'. Keeping the device FD open and just using a different ioctl to
> > bind to a container/group FD, which new userspace can then obtain as a
> > fallback, might be OK.
> >
> > Hard to see without going through the qemu parts, so maybe just keep
> > it in mind
>
> If we assume that the container/group/device interface is essentially
> deprecated once we have iommufd, it doesn't make a lot of sense to me
> to tack on a container/device interface just so userspace can avoid
> reverting to the fully legacy interface.
>
> But why would we create vfio device interface files at all if they
> can't work? I'm not really on board with creating a try-and-fail
> interface for a mechanism that cannot work for a given device. The
> existence of the device interface should indicate that it's supported.
> Thanks,
>

Now it's a try-and-fail model even for devices which support iommufd.
Per Jason's suggestion, a device is always opened with a parked fops
which supports only bind. Binding serves as the contract for handling
exclusive ownership on a device and switching to normal fops if
succeed. So the user has to try-and-fail in case multiple threads attempt
to open a same device. Device which doesn't support iommufd is not
different, except binding request 100% fails (due to missing .bind_iommufd
in kernel driver).

Thanks
Kevin
Re: [RFC 03/20] vfio: Add vfio_[un]register_device() [ In reply to ]
On Wed, 22 Sep 2021 22:34:42 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Thursday, September 23, 2021 4:11 AM
> >
> > On Wed, 22 Sep 2021 09:22:52 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > > On Wed, Sep 22, 2021 at 09:23:34AM +0000, Tian, Kevin wrote:
> > >
> > > > > Providing an ioctl to bind to a normal VFIO container or group might
> > > > > allow a reasonable fallback in userspace..
> > > >
> > > > I didn't get this point though. An error in binding already allows the
> > > > user to fall back to the group path. Why do we need introduce another
> > > > ioctl to explicitly bind to container via the nongroup interface?
> > >
> > > New userspace still needs a fallback path if it hits the 'try and
> > > fail'. Keeping the device FD open and just using a different ioctl to
> > > bind to a container/group FD, which new userspace can then obtain as a
> > > fallback, might be OK.
> > >
> > > Hard to see without going through the qemu parts, so maybe just keep
> > > it in mind
> >
> > If we assume that the container/group/device interface is essentially
> > deprecated once we have iommufd, it doesn't make a lot of sense to me
> > to tack on a container/device interface just so userspace can avoid
> > reverting to the fully legacy interface.
> >
> > But why would we create vfio device interface files at all if they
> > can't work? I'm not really on board with creating a try-and-fail
> > interface for a mechanism that cannot work for a given device. The
> > existence of the device interface should indicate that it's supported.
> > Thanks,
> >
>
> Now it's a try-and-fail model even for devices which support iommufd.
> Per Jason's suggestion, a device is always opened with a parked fops
> which supports only bind. Binding serves as the contract for handling
> exclusive ownership on a device and switching to normal fops if
> succeed. So the user has to try-and-fail in case multiple threads attempt
> to open a same device. Device which doesn't support iommufd is not
> different, except binding request 100% fails (due to missing .bind_iommufd
> in kernel driver).

That's a rather important difference. I don't really see how that's
comparable to the mutually exclusive nature of the legacy vs device
interface. We're not going to present a vfio device interface for SW
mdevs that can't participate in iommufd, right? Thanks,

Alex
RE: [RFC 03/20] vfio: Add vfio_[un]register_device() [ In reply to ]
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Thursday, September 23, 2021 6:45 AM
>
> On Wed, 22 Sep 2021 22:34:42 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
>
> > > From: Alex Williamson <alex.williamson@redhat.com>
> > > Sent: Thursday, September 23, 2021 4:11 AM
> > >
> > > On Wed, 22 Sep 2021 09:22:52 -0300
> > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > > On Wed, Sep 22, 2021 at 09:23:34AM +0000, Tian, Kevin wrote:
> > > >
> > > > > > Providing an ioctl to bind to a normal VFIO container or group might
> > > > > > allow a reasonable fallback in userspace..
> > > > >
> > > > > I didn't get this point though. An error in binding already allows the
> > > > > user to fall back to the group path. Why do we need introduce
> another
> > > > > ioctl to explicitly bind to container via the nongroup interface?
> > > >
> > > > New userspace still needs a fallback path if it hits the 'try and
> > > > fail'. Keeping the device FD open and just using a different ioctl to
> > > > bind to a container/group FD, which new userspace can then obtain as
> a
> > > > fallback, might be OK.
> > > >
> > > > Hard to see without going through the qemu parts, so maybe just keep
> > > > it in mind
> > >
> > > If we assume that the container/group/device interface is essentially
> > > deprecated once we have iommufd, it doesn't make a lot of sense to me
> > > to tack on a container/device interface just so userspace can avoid
> > > reverting to the fully legacy interface.
> > >
> > > But why would we create vfio device interface files at all if they
> > > can't work? I'm not really on board with creating a try-and-fail
> > > interface for a mechanism that cannot work for a given device. The
> > > existence of the device interface should indicate that it's supported.
> > > Thanks,
> > >
> >
> > Now it's a try-and-fail model even for devices which support iommufd.
> > Per Jason's suggestion, a device is always opened with a parked fops
> > which supports only bind. Binding serves as the contract for handling
> > exclusive ownership on a device and switching to normal fops if
> > succeed. So the user has to try-and-fail in case multiple threads attempt
> > to open a same device. Device which doesn't support iommufd is not
> > different, except binding request 100% fails (due to missing .bind_iommufd
> > in kernel driver).
>
> That's a rather important difference. I don't really see how that's
> comparable to the mutually exclusive nature of the legacy vs device

I didn't get the 'comparable' part. Can you elaborate?

> interface. We're not going to present a vfio device interface for SW
> mdevs that can't participate in iommufd, right? Thanks,
>

Did you see any problem if exposing sw mdev now? Following above
explanation the try-and-fail model should still work...

btw I realized another related piece regarding to the new layout that
Jason suggested, which have sys device node include a link to the vfio
devnode:

/sys/bus/pci/devices/DDDD:BB:DD.F/vfio/vfioX/dev

This for sure requires specific vfio driver support to get the link established.
if we only do it for vfio-pci in the start, then for other devices which don't
support iommufd there is no way for the user to identify the corresponding
vfio devnode even it's still exposed. Then try-and-fail model may not even
been reached for those devices.

Thanks
Kevin
Re: [RFC 03/20] vfio: Add vfio_[un]register_device() [ In reply to ]
On Wed, Sep 22, 2021 at 11:45:33PM +0000, Tian, Kevin wrote:
> > From: Alex Williamson <alex.williamson@redhat.com>

> btw I realized another related piece regarding to the new layout that
> Jason suggested, which have sys device node include a link to the vfio
> devnode:
>
> /sys/bus/pci/devices/DDDD:BB:DD.F/vfio/vfioX/dev
>
> This for sure requires specific vfio driver support to get the link
> established.

It doesn't. Just set the parent device of the vfio_device's struct
device to the physical struct device that vfio is already tracking -
ie the struct device providing the IOMMU. The driver core takes care
of everything else.

Jason
Re: [RFC 03/20] vfio: Add vfio_[un]register_device() [ In reply to ]
On Wed, Sep 22, 2021 at 02:10:36PM -0600, Alex Williamson wrote:

> But why would we create vfio device interface files at all if they
> can't work? I'm not really on board with creating a try-and-fail
> interface for a mechanism that cannot work for a given device. The
> existence of the device interface should indicate that it's supported.

I'm a little worried about adding a struct device to vfio_device and
then making it optional.. That is a really weird situation.

I suppose you could create the sysfs presence in the struct device but
not create a cdev.

However, if we ever want to use the device fd for something else, like
querying the device driver capabilities or mode, (ie clean the
driver_api thing wrongly placed in mdev sysfs for instance), we are
blocked as the uAPI will be cdev == must support iommufd..

Jason
RE: [RFC 03/20] vfio: Add vfio_[un]register_device() [ In reply to ]
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, September 23, 2021 7:52 AM
>
> On Wed, Sep 22, 2021 at 11:45:33PM +0000, Tian, Kevin wrote:
> > > From: Alex Williamson <alex.williamson@redhat.com>
>
> > btw I realized another related piece regarding to the new layout that
> > Jason suggested, which have sys device node include a link to the vfio
> > devnode:
> >
> > /sys/bus/pci/devices/DDDD:BB:DD.F/vfio/vfioX/dev
> >
> > This for sure requires specific vfio driver support to get the link
> > established.
>
> It doesn't. Just set the parent device of the vfio_device's struct
> device to the physical struct device that vfio is already tracking -
> ie the struct device providing the IOMMU. The driver core takes care
> of everything else.
>

Thanks for correction. So it's still the same try-and-fail model for both
devices which support iommufd and which do not.
Re: [RFC 03/20] vfio: Add vfio_[un]register_device() [ In reply to ]
Hi,

On 9/22/21 3:00 AM, Jason Gunthorpe wrote:
> On Wed, Sep 22, 2021 at 12:54:02AM +0000, Tian, Kevin wrote:
>>> From: Jason Gunthorpe <jgg@nvidia.com>
>>> Sent: Wednesday, September 22, 2021 12:01 AM
>>>
>>>> One open about how to organize the device nodes under
>>> /dev/vfio/devices/.
>>>> This RFC adopts a simple policy by keeping a flat layout with mixed
>>> devname
>>>> from all kinds of devices. The prerequisite of this model is that devnames
>>>> from different bus types are unique formats:
>>> This isn't reliable, the devname should just be vfio0, vfio1, etc
>>>
>>> The userspace can learn the correct major/minor by inspecting the
>>> sysfs.
>>>
>>> This whole concept should disappear into the prior patch that adds the
>>> struct device in the first place, and I think most of the code here
>>> can be deleted once the struct device is used properly.
>>>
>> Can you help elaborate above flow? This is one area where we need
>> more guidance.
>>
>> When Qemu accepts an option "-device vfio-pci,host=DDDD:BB:DD.F",
>> how does Qemu identify which vifo0/1/... is associated with the specified
>> DDDD:BB:DD.F?
> When done properly in the kernel the file:
>
> /sys/bus/pci/devices/DDDD:BB:DD.F/vfio/vfioX/dev
>
> Will contain the major:minor of the VFIO device.
>
> Userspace then opens the /dev/vfio/devices/vfioX and checks with fstat
> that the major:minor matches.
>
> in the above pattern "pci" and "DDDD:BB:DD.FF" are the arguments passed
> to qemu.
I guess this would be the same for platform devices, for instance
/sys/bus/platform/devices/AMDI8001:01/vfio/vfioX/dev, right?

Thanks

Eric
>
> You can look at this for some general over engineered code to handle
> opening from a sysfs handle like above:
>
> https://github.com/linux-rdma/rdma-core/blob/master/util/open_cdev.c
>
> Jason
>
Re: [RFC 03/20] vfio: Add vfio_[un]register_device() [ In reply to ]
On Thu, Sep 23, 2021 at 09:25:27AM +0200, Eric Auger wrote:
> Hi,
>
> On 9/22/21 3:00 AM, Jason Gunthorpe wrote:
> > On Wed, Sep 22, 2021 at 12:54:02AM +0000, Tian, Kevin wrote:
> >>> From: Jason Gunthorpe <jgg@nvidia.com>
> >>> Sent: Wednesday, September 22, 2021 12:01 AM
> >>>
> >>>> One open about how to organize the device nodes under
> >>> /dev/vfio/devices/.
> >>>> This RFC adopts a simple policy by keeping a flat layout with mixed
> >>> devname
> >>>> from all kinds of devices. The prerequisite of this model is that devnames
> >>>> from different bus types are unique formats:
> >>> This isn't reliable, the devname should just be vfio0, vfio1, etc
> >>>
> >>> The userspace can learn the correct major/minor by inspecting the
> >>> sysfs.
> >>>
> >>> This whole concept should disappear into the prior patch that adds the
> >>> struct device in the first place, and I think most of the code here
> >>> can be deleted once the struct device is used properly.
> >>>
> >> Can you help elaborate above flow? This is one area where we need
> >> more guidance.
> >>
> >> When Qemu accepts an option "-device vfio-pci,host=DDDD:BB:DD.F",
> >> how does Qemu identify which vifo0/1/... is associated with the specified
> >> DDDD:BB:DD.F?
> > When done properly in the kernel the file:
> >
> > /sys/bus/pci/devices/DDDD:BB:DD.F/vfio/vfioX/dev
> >
> > Will contain the major:minor of the VFIO device.
> >
> > Userspace then opens the /dev/vfio/devices/vfioX and checks with fstat
> > that the major:minor matches.
> >
> > in the above pattern "pci" and "DDDD:BB:DD.FF" are the arguments passed
> > to qemu.
> I guess this would be the same for platform devices, for instance
> /sys/bus/platform/devices/AMDI8001:01/vfio/vfioX/dev, right?

Yes, it is the general driver core pattern for creating cdevs below a
parent device

Jason
Re: [RFC 03/20] vfio: Add vfio_[un]register_device() [ In reply to ]
On Sun, Sep 19, 2021 at 02:38:31PM +0800, Liu Yi L wrote:
> With /dev/vfio/devices introduced, now a vfio device driver has three
> options to expose its device to userspace:
>
> a) only legacy group interface, for devices which haven't been moved to
> iommufd (e.g. platform devices, sw mdev, etc.);
>
> b) both legacy group interface and new device-centric interface, for
> devices which supports iommufd but also wants to keep backward
> compatibility (e.g. pci devices in this RFC);
>
> c) only new device-centric interface, for new devices which don't carry
> backward compatibility burden (e.g. hw mdev/subdev with pasid);
>
> This patch introduces vfio_[un]register_device() helpers for the device
> drivers to specify the device exposure policy to vfio core. Hence the
> existing vfio_[un]register_group_dev() become the wrapper of the new
> helper functions. The new device-centric interface is described as
> 'nongroup' to differentiate from existing 'group' stuff.
>
> TBD: this patch needs to rebase on top of below series from Christoph in
> next version.
>
> "cleanup vfio iommu_group creation"
>
> Legacy userspace continues to follow the legacy group interface.
>
> Newer userspace can first try the new device-centric interface if the
> device is present under /dev/vfio/devices. Otherwise fall back to the
> group interface.
>
> One open about how to organize the device nodes under /dev/vfio/devices/.
> This RFC adopts a simple policy by keeping a flat layout with mixed devname
> from all kinds of devices. The prerequisite of this model is that devnames
> from different bus types are unique formats:
>
> /dev/vfio/devices/0000:00:14.2 (pci)
> /dev/vfio/devices/PNP0103:00 (platform)
> /dev/vfio/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 (mdev)

Oof. I really don't think this is a good idea. Ensuring that a
format is "unique" in the sense that it can't collide with any of the
other formats, for *every* value of the parameters on both sides is
actually pretty complicated in general.

I think per-type sub-directories would be helpful here, Jason's
suggestion of just sequential numbers would work as well.

> One alternative option is to arrange device nodes in sub-directories based
> on the device type. But doing so also adds one trouble to userspace. The
> current vfio uAPI is designed to have the user query device type via
> VFIO_DEVICE_GET_INFO after opening the device. With this option the user
> instead needs to figure out the device type before opening the device, to
> identify the sub-directory.

Wouldn't this be up to the operator / configuration, rather than the
actual software though? I would assume that typically the VFIO
program would be pointed at a specific vfio device node file to use,
e.g.
my-vfio-prog -d /dev/vfio/pci/0000:0a:03.1

Or more generally, if you're expecting userspace to know a name in a
uniqu pattern, they can equally well know a "type/name" pair.

> Another tricky thing is that "pdev. vs. mdev"
> and "pci vs. platform vs. ccw,..." are orthogonal categorizations. Need
> more thoughts on whether both or just one category should be used to define
> the sub-directories.
>
> Signed-off-by: Liu Yi L <yi.l.liu@intel.com>
> ---
> drivers/vfio/vfio.c | 137 +++++++++++++++++++++++++++++++++++++++----
> include/linux/vfio.h | 9 +++
> 2 files changed, 134 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
> index 84436d7abedd..1e87b25962f1 100644
> --- a/drivers/vfio/vfio.c
> +++ b/drivers/vfio/vfio.c
> @@ -51,6 +51,7 @@ static struct vfio {
> struct cdev device_cdev;
> dev_t device_devt;
> struct mutex device_lock;
> + struct list_head device_list;
> struct idr device_idr;
> } vfio;
>
> @@ -757,7 +758,7 @@ void vfio_init_group_dev(struct vfio_device *device, struct device *dev,
> }
> EXPORT_SYMBOL_GPL(vfio_init_group_dev);
>
> -int vfio_register_group_dev(struct vfio_device *device)
> +static int __vfio_register_group_dev(struct vfio_device *device)
> {
> struct vfio_device *existing_device;
> struct iommu_group *iommu_group;
> @@ -794,8 +795,13 @@ int vfio_register_group_dev(struct vfio_device *device)
> /* Our reference on group is moved to the device */
> device->group = group;
>
> - /* Refcounting can't start until the driver calls register */
> - refcount_set(&device->refcount, 1);
> + /*
> + * Refcounting can't start until the driver call register. Don’t
> + * start twice when the device is exposed in both group and nongroup
> + * interfaces.
> + */
> + if (!refcount_read(&device->refcount))

Is there a possible race here with something getting in and
incrementing the refcount between the read and set?

> + refcount_set(&device->refcount, 1);
>
> mutex_lock(&group->device_lock);
> list_add(&device->group_next, &group->device_list);
> @@ -804,7 +810,78 @@ int vfio_register_group_dev(struct vfio_device *device)
>
> return 0;
> }
> -EXPORT_SYMBOL_GPL(vfio_register_group_dev);
> +
> +static int __vfio_register_nongroup_dev(struct vfio_device *device)
> +{
> + struct vfio_device *existing_device;
> + struct device *dev;
> + int ret = 0, minor;
> +
> + mutex_lock(&vfio.device_lock);
> + list_for_each_entry(existing_device, &vfio.device_list, vfio_next) {
> + if (existing_device == device) {
> + ret = -EBUSY;
> + goto out_unlock;

This indicates a bug in the caller, doesn't it? Should it be a BUG or
WARN instead?

> + }
> + }
> +
> + minor = idr_alloc(&vfio.device_idr, device, 0, MINORMASK + 1, GFP_KERNEL);
> + pr_debug("%s - mnior: %d\n", __func__, minor);
> + if (minor < 0) {
> + ret = minor;
> + goto out_unlock;
> + }
> +
> + dev = device_create(vfio.device_class, NULL,
> + MKDEV(MAJOR(vfio.device_devt), minor),
> + device, "%s", dev_name(device->dev));
> + if (IS_ERR(dev)) {
> + idr_remove(&vfio.device_idr, minor);
> + ret = PTR_ERR(dev);
> + goto out_unlock;
> + }
> +
> + /*
> + * Refcounting can't start until the driver call register. Don’t
> + * start twice when the device is exposed in both group and nongroup
> + * interfaces.
> + */
> + if (!refcount_read(&device->refcount))
> + refcount_set(&device->refcount, 1);
> +
> + device->minor = minor;
> + list_add(&device->vfio_next, &vfio.device_list);
> + dev_info(device->dev, "Creates Device interface successfully!\n");
> +out_unlock:
> + mutex_unlock(&vfio.device_lock);
> + return ret;
> +}
> +
> +int vfio_register_device(struct vfio_device *device, u32 flags)
> +{
> + int ret = -EINVAL;
> +
> + device->minor = -1;
> + device->group = NULL;
> + atomic_set(&device->opened, 0);
> +
> + if (flags & ~(VFIO_DEVNODE_GROUP | VFIO_DEVNODE_NONGROUP))
> + return ret;
> +
> + if (flags & VFIO_DEVNODE_GROUP) {
> + ret = __vfio_register_group_dev(device);
> + if (ret)
> + return ret;
> + }
> +
> + if (flags & VFIO_DEVNODE_NONGROUP) {
> + ret = __vfio_register_nongroup_dev(device);
> + if (ret && device->group)
> + vfio_unregister_device(device);
> + }
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_register_device);
>
> /**
> * Get a reference to the vfio_device for a device. Even if the
> @@ -861,13 +938,14 @@ static struct vfio_device *vfio_device_get_from_name(struct vfio_group *group,
> /*
> * Decrement the device reference count and wait for the device to be
> * removed. Open file descriptors for the device... */
> -void vfio_unregister_group_dev(struct vfio_device *device)
> +void vfio_unregister_device(struct vfio_device *device)
> {
> struct vfio_group *group = device->group;
> struct vfio_unbound_dev *unbound;
> unsigned int i = 0;
> bool interrupted = false;
> long rc;
> + int minor = device->minor;
>
> /*
> * When the device is removed from the group, the group suddenly
> @@ -878,14 +956,20 @@ void vfio_unregister_group_dev(struct vfio_device *device)
> * solve this, we track such devices on the unbound_list to bridge
> * the gap until they're fully unbound.
> */
> - unbound = kzalloc(sizeof(*unbound), GFP_KERNEL);
> - if (unbound) {
> - unbound->dev = device->dev;
> - mutex_lock(&group->unbound_lock);
> - list_add(&unbound->unbound_next, &group->unbound_list);
> - mutex_unlock(&group->unbound_lock);
> + if (group) {
> + /*
> + * If caller hasn't called vfio_register_group_dev(), this
> + * branch is not necessary.
> + */
> + unbound = kzalloc(sizeof(*unbound), GFP_KERNEL);
> + if (unbound) {
> + unbound->dev = device->dev;
> + mutex_lock(&group->unbound_lock);
> + list_add(&unbound->unbound_next, &group->unbound_list);
> + mutex_unlock(&group->unbound_lock);
> + }
> + WARN_ON(!unbound);
> }
> - WARN_ON(!unbound);
>
> vfio_device_put(device);
> rc = try_wait_for_completion(&device->comp);
> @@ -910,6 +994,21 @@ void vfio_unregister_group_dev(struct vfio_device *device)
> }
> }
>
> + /* nongroup interface related cleanup */
> + if (minor >= 0) {
> + mutex_lock(&vfio.device_lock);
> + list_del(&device->vfio_next);
> + device->minor = -1;
> + device_destroy(vfio.device_class,
> + MKDEV(MAJOR(vfio.device_devt), minor));
> + idr_remove(&vfio.device_idr, minor);
> + mutex_unlock(&vfio.device_lock);
> + }
> +
> + /* No need go further if no group. */
> + if (!group)
> + return;
> +
> mutex_lock(&group->device_lock);
> list_del(&device->group_next);
> group->dev_counter--;
> @@ -935,6 +1034,18 @@ void vfio_unregister_group_dev(struct vfio_device *device)
> /* Matches the get in vfio_register_group_dev() */
> vfio_group_put(group);
> }
> +EXPORT_SYMBOL_GPL(vfio_unregister_device);
> +
> +int vfio_register_group_dev(struct vfio_device *device)
> +{
> + return vfio_register_device(device, VFIO_DEVNODE_GROUP);
> +}
> +EXPORT_SYMBOL_GPL(vfio_register_group_dev);
> +
> +void vfio_unregister_group_dev(struct vfio_device *device)
> +{
> + vfio_unregister_device(device);
> +}
> EXPORT_SYMBOL_GPL(vfio_unregister_group_dev);
>
> /**
> @@ -2447,6 +2558,7 @@ static int vfio_init_device_class(void)
>
> mutex_init(&vfio.device_lock);
> idr_init(&vfio.device_idr);
> + INIT_LIST_HEAD(&vfio.device_list);
>
> /* /dev/vfio/devices/$DEVICE */
> vfio.device_class = class_create(THIS_MODULE, "vfio-device");
> @@ -2542,6 +2654,7 @@ static int __init vfio_init(void)
> static void __exit vfio_cleanup(void)
> {
> WARN_ON(!list_empty(&vfio.group_list));
> + WARN_ON(!list_empty(&vfio.device_list));
>
> #ifdef CONFIG_VFIO_NOIOMMU
> vfio_unregister_iommu_driver(&vfio_noiommu_ops);
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 4a5f3f99eab2..9448b751b663 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -26,6 +26,7 @@ struct vfio_device {
> struct list_head group_next;
> int minor;
> atomic_t opened;
> + struct list_head vfio_next;
> };
>
> /**
> @@ -73,6 +74,14 @@ enum vfio_iommu_notify_type {
> VFIO_IOMMU_CONTAINER_CLOSE = 0,
> };
>
> +/* The device can be opened via VFIO_GROUP_GET_DEVICE_FD */
> +#define VFIO_DEVNODE_GROUP BIT(0)
> +/* The device can be opened via /dev/sys/devices/${DEVICE} */
> +#define VFIO_DEVNODE_NONGROUP BIT(1)
> +
> +extern int vfio_register_device(struct vfio_device *device, u32 flags);
> +extern void vfio_unregister_device(struct vfio_device *device);
> +
> /**
> * struct vfio_iommu_driver_ops - VFIO IOMMU driver callbacks
> */

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
Re: [RFC 03/20] vfio: Add vfio_[un]register_device() [ In reply to ]
On Tue, Sep 21, 2021 at 10:00:14PM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 22, 2021 at 12:54:02AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Wednesday, September 22, 2021 12:01 AM
> > >
> > > > One open about how to organize the device nodes under
> > > /dev/vfio/devices/.
> > > > This RFC adopts a simple policy by keeping a flat layout with mixed
> > > devname
> > > > from all kinds of devices. The prerequisite of this model is that devnames
> > > > from different bus types are unique formats:
> > >
> > > This isn't reliable, the devname should just be vfio0, vfio1, etc
> > >
> > > The userspace can learn the correct major/minor by inspecting the
> > > sysfs.
> > >
> > > This whole concept should disappear into the prior patch that adds the
> > > struct device in the first place, and I think most of the code here
> > > can be deleted once the struct device is used properly.
> > >
> >
> > Can you help elaborate above flow? This is one area where we need
> > more guidance.
> >
> > When Qemu accepts an option "-device vfio-pci,host=DDDD:BB:DD.F",
> > how does Qemu identify which vifo0/1/... is associated with the specified
> > DDDD:BB:DD.F?
>
> When done properly in the kernel the file:
>
> /sys/bus/pci/devices/DDDD:BB:DD.F/vfio/vfioX/dev
>
> Will contain the major:minor of the VFIO device.
>
> Userspace then opens the /dev/vfio/devices/vfioX and checks with fstat
> that the major:minor matches.
>
> in the above pattern "pci" and "DDDD:BB:DD.FF" are the arguments passed
> to qemu.

I thought part of the appeal of the device centric model was less
grovelling around in sysfs for information. Using type/address
directly in /dev seems simpler than having to dig around matching
things here.

Note that this doesn't have to be done in kernel: you could have the
kernel just call them /dev/vfio/devices/vfio0, ... but add udev rules
that create symlinks from say /dev/vfio/pci/DDDD:BB:SS.F - >
../devices/vfioXX based on the sysfs information.

>
> You can look at this for some general over engineered code to handle
> opening from a sysfs handle like above:
>
> https://github.com/linux-rdma/rdma-core/blob/master/util/open_cdev.c
>
> Jason
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
RE: [RFC 03/20] vfio: Add vfio_[un]register_device() [ In reply to ]
> From: David Gibson <david@gibson.dropbear.id.au>
> Sent: Wednesday, September 29, 2021 10:44 AM
>
> >
> > One open about how to organize the device nodes under
> /dev/vfio/devices/.
> > This RFC adopts a simple policy by keeping a flat layout with mixed
> devname
> > from all kinds of devices. The prerequisite of this model is that devnames
> > from different bus types are unique formats:
> >
> > /dev/vfio/devices/0000:00:14.2 (pci)
> > /dev/vfio/devices/PNP0103:00 (platform)
> > /dev/vfio/devices/83b8f4f2-509f-382f-3c1e-e6bfe0fa1001 (mdev)
>
> Oof. I really don't think this is a good idea. Ensuring that a
> format is "unique" in the sense that it can't collide with any of the
> other formats, for *every* value of the parameters on both sides is
> actually pretty complicated in general.
>
> I think per-type sub-directories would be helpful here, Jason's
> suggestion of just sequential numbers would work as well.

we'll follow Jason's suggestion in next version.

> > + /*
> > + * Refcounting can't start until the driver call register. Don’t
> > + * start twice when the device is exposed in both group and
> nongroup
> > + * interfaces.
> > + */
> > + if (!refcount_read(&device->refcount))
>
> Is there a possible race here with something getting in and
> incrementing the refcount between the read and set?

this will not be required in next version, which will always create
both group and nongroup interfaces for every device (then let
driver providing .bind_iommufd() callback for whether nongroup
interface is functional). It will be centrally processed within
existing vfio_[un]register_group_dev(), thus above race is not
a concern any more.

>
> > + refcount_set(&device->refcount, 1);
> >
> > mutex_lock(&group->device_lock);
> > list_add(&device->group_next, &group->device_list);
> > @@ -804,7 +810,78 @@ int vfio_register_group_dev(struct vfio_device
> *device)
> >
> > return 0;
> > }
> > -EXPORT_SYMBOL_GPL(vfio_register_group_dev);
> > +
> > +static int __vfio_register_nongroup_dev(struct vfio_device *device)
> > +{
> > + struct vfio_device *existing_device;
> > + struct device *dev;
> > + int ret = 0, minor;
> > +
> > + mutex_lock(&vfio.device_lock);
> > + list_for_each_entry(existing_device, &vfio.device_list, vfio_next) {
> > + if (existing_device == device) {
> > + ret = -EBUSY;
> > + goto out_unlock;
>
> This indicates a bug in the caller, doesn't it? Should it be a BUG or
> WARN instead?

this call is initiated by userspace. Per Jason's suggestion we don't
even need to check it then no lock is required.

Thanks
Kevin
RE: [RFC 03/20] vfio: Add vfio_[un]register_device() [ In reply to ]
> From: David Gibson <david@gibson.dropbear.id.au>
> Sent: Wednesday, September 29, 2021 10:44 AM
>
> > One alternative option is to arrange device nodes in sub-directories based
> > on the device type. But doing so also adds one trouble to userspace. The
> > current vfio uAPI is designed to have the user query device type via
> > VFIO_DEVICE_GET_INFO after opening the device. With this option the user
> > instead needs to figure out the device type before opening the device, to
> > identify the sub-directory.
>
> Wouldn't this be up to the operator / configuration, rather than the
> actual software though? I would assume that typically the VFIO
> program would be pointed at a specific vfio device node file to use,
> e.g.
> my-vfio-prog -d /dev/vfio/pci/0000:0a:03.1
>
> Or more generally, if you're expecting userspace to know a name in a
> uniqu pattern, they can equally well know a "type/name" pair.
>

You are correct. Currently:

-device, vfio-pci,host=DDDD:BB:DD.F
-device, vfio-pci,sysfdev=/sys/bus/pci/devices/ DDDD:BB:DD.F
-device, vfio-platform,sysdev=/sys/bus/platform/devices/PNP0103:00

above is definitely type/name information to find the related node.

Actually even for Jason's proposal we still need such information to
identify the sysfs path.

Then I feel type-based sub-directory does work. Adding another link
to sysfs sounds unnecessary now. But I'm not sure whether we still
want to create /dev/vfio/devices/vfio0 thing and related udev rule
thing that you pointed out in another mail.

Jason?

Thanks
Kevin
RE: [RFC 03/20] vfio: Add vfio_[un]register_device() [ In reply to ]
On Wed, Sep 29 2021, "Tian, Kevin" <kevin.tian@intel.com> wrote:

>> From: David Gibson <david@gibson.dropbear.id.au>
>> Sent: Wednesday, September 29, 2021 10:44 AM
>>
>> > One alternative option is to arrange device nodes in sub-directories based
>> > on the device type. But doing so also adds one trouble to userspace. The
>> > current vfio uAPI is designed to have the user query device type via
>> > VFIO_DEVICE_GET_INFO after opening the device. With this option the user
>> > instead needs to figure out the device type before opening the device, to
>> > identify the sub-directory.
>>
>> Wouldn't this be up to the operator / configuration, rather than the
>> actual software though? I would assume that typically the VFIO
>> program would be pointed at a specific vfio device node file to use,
>> e.g.
>> my-vfio-prog -d /dev/vfio/pci/0000:0a:03.1
>>
>> Or more generally, if you're expecting userspace to know a name in a
>> uniqu pattern, they can equally well know a "type/name" pair.
>>
>
> You are correct. Currently:
>
> -device, vfio-pci,host=DDDD:BB:DD.F
> -device, vfio-pci,sysfdev=/sys/bus/pci/devices/ DDDD:BB:DD.F
> -device, vfio-platform,sysdev=/sys/bus/platform/devices/PNP0103:00
>
> above is definitely type/name information to find the related node.
>
> Actually even for Jason's proposal we still need such information to
> identify the sysfs path.
>
> Then I feel type-based sub-directory does work. Adding another link
> to sysfs sounds unnecessary now. But I'm not sure whether we still
> want to create /dev/vfio/devices/vfio0 thing and related udev rule
> thing that you pointed out in another mail.

Still reading through this whole thread, but type-based subdirectories
also make the most sense to me. I don't really see userspace wanting to
grab just any device and then figure out whether it is the device it was
looking for, but rather immediately go to a specific device or at least
a device of a specific type.

Sequentially-numbered devices tend to become really unwieldy in my
experience if you are working on a system with loads of devices.

1 2  View All