Mailing List Archive

[patch 09/10] PCI/MSI: Provide pci_msix_expand_vectors[_at]()
Provide a new interface which allows to expand the MSI-X vector space if
the underlying irq domain implementation supports it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
drivers/pci/msi/msi.c | 41 +++++++++++++++++++++++++++++++++++++++++
include/linux/pci.h | 13 +++++++++++++
2 files changed, 54 insertions(+)

--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -1025,6 +1025,47 @@ int pci_alloc_irq_vectors_affinity(struc
EXPORT_SYMBOL(pci_alloc_irq_vectors_affinity);

/**
+ * pci_msix_expand_vectors_at - Expand MSI-X interrupts for a device
+ *
+ * @dev: PCI device to operate on
+ * @at: Allocate at MSI-X index. If @at == PCI_MSI_EXPAND_AUTO
+ * the function expands automatically after the last
+ * active index.
+ * @nvec: Number of vectors to allocate
+ *
+ * Expand the MSI-X vectors of a device after an initial enablement and
+ * allocation.
+ *
+ * Return: 0 if the allocation was successful, an error code otherwise.
+ */
+int pci_msix_expand_vectors_at(struct pci_dev *dev, unsigned int at, unsigned int nvec)
+{
+ struct msi_device_data *md = dev->dev.msi.data;
+ struct msi_range range = { .ndesc = nvec, };
+ unsigned int max_vecs;
+ int ret;
+
+ if (!pci_msi_enable || !dev || !dev->msix_enabled || !md)
+ return -ENOTSUPP;
+
+ if (!pci_msi_domain_supports_expand(dev))
+ return -ENOTSUPP;
+
+ max_vecs = pci_msix_vec_count(dev);
+ if (!nvec || nvec > max_vecs)
+ return -EINVAL;
+
+ range.first = at == PCI_MSIX_EXPAND_AUTO ? md->num_descs : at;
+
+ if (range.first >= max_vecs || nvec > max_vecs - range.first)
+ return -ENOSPC;
+
+ ret = msix_setup_interrupts(dev, dev->msix_base, &range, NULL, NULL, true);
+ return ret <= 0 ? ret : -ENOSPC;;
+}
+EXPORT_SYMBOL_GPL(pci_msix_expand_vectors_at);
+
+/**
* pci_free_irq_vectors - free previously allocated IRQs for a device
* @dev: PCI device to operate on
*
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1534,6 +1534,7 @@ static inline int pci_enable_msix_exact(
int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
unsigned int max_vecs, unsigned int flags,
struct irq_affinity *affd);
+int pci_msix_expand_vectors_at(struct pci_dev *dev, unsigned int at, unsigned int nvec);

void pci_free_irq_vectors(struct pci_dev *dev);
int pci_irq_vector(struct pci_dev *dev, unsigned int nr);
@@ -1565,6 +1566,11 @@ pci_alloc_irq_vectors_affinity(struct pc
return -ENOSPC;
}

+static inline int pci_msix_expand_vectors_at(struct pci_dev *dev, unsigned int at, unsigned int nvec)
+{
+ return -ENOTSUPP;
+}
+
static inline void pci_free_irq_vectors(struct pci_dev *dev)
{
}
@@ -1582,6 +1588,13 @@ static inline const struct cpumask *pci_
}
#endif

+#define PCI_MSIX_EXPAND_AUTO (UINT_MAX)
+
+static inline int pci_msix_expand_vectors(struct pci_dev *dev, unsigned int nvec)
+{
+ return pci_msix_expand_vectors_at(dev, PCI_MSIX_EXPAND_AUTO, nvec);
+}
+
/**
* pci_irqd_intx_xlate() - Translate PCI INTx value to an IRQ domain hwirq
* @d: the INTx IRQ domain
[patch 09/10] PCI/MSI: Provide pci_msix_expand_vectors[_at]() [ In reply to ]
Provide a new interface which allows to expand the MSI-X vector space if
the underlying irq domain implementation supports it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
drivers/pci/msi/msi.c | 41 +++++++++++++++++++++++++++++++++++++++++
include/linux/pci.h | 13 +++++++++++++
2 files changed, 54 insertions(+)

--- a/drivers/pci/msi/msi.c
+++ b/drivers/pci/msi/msi.c
@@ -1025,6 +1025,47 @@ int pci_alloc_irq_vectors_affinity(struc
EXPORT_SYMBOL(pci_alloc_irq_vectors_affinity);

/**
+ * pci_msix_expand_vectors_at - Expand MSI-X interrupts for a device
+ *
+ * @dev: PCI device to operate on
+ * @at: Allocate at MSI-X index. If @at == PCI_MSI_EXPAND_AUTO
+ * the function expands automatically after the last
+ * active index.
+ * @nvec: Number of vectors to allocate
+ *
+ * Expand the MSI-X vectors of a device after an initial enablement and
+ * allocation.
+ *
+ * Return: 0 if the allocation was successful, an error code otherwise.
+ */
+int pci_msix_expand_vectors_at(struct pci_dev *dev, unsigned int at, unsigned int nvec)
+{
+ struct msi_device_data *md = dev->dev.msi.data;
+ struct msi_range range = { .ndesc = nvec, };
+ unsigned int max_vecs;
+ int ret;
+
+ if (!pci_msi_enable || !dev || !dev->msix_enabled || !md)
+ return -ENOTSUPP;
+
+ if (!pci_msi_domain_supports_expand(dev))
+ return -ENOTSUPP;
+
+ max_vecs = pci_msix_vec_count(dev);
+ if (!nvec || nvec > max_vecs)
+ return -EINVAL;
+
+ range.first = at == PCI_MSIX_EXPAND_AUTO ? md->num_descs : at;
+
+ if (range.first >= max_vecs || nvec > max_vecs - range.first)
+ return -ENOSPC;
+
+ ret = msix_setup_interrupts(dev, dev->msix_base, &range, NULL, NULL, true);
+ return ret <= 0 ? ret : -ENOSPC;;
+}
+EXPORT_SYMBOL_GPL(pci_msix_expand_vectors_at);
+
+/**
* pci_free_irq_vectors - free previously allocated IRQs for a device
* @dev: PCI device to operate on
*
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1534,6 +1534,7 @@ static inline int pci_enable_msix_exact(
int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
unsigned int max_vecs, unsigned int flags,
struct irq_affinity *affd);
+int pci_msix_expand_vectors_at(struct pci_dev *dev, unsigned int at, unsigned int nvec);

void pci_free_irq_vectors(struct pci_dev *dev);
int pci_irq_vector(struct pci_dev *dev, unsigned int nr);
@@ -1565,6 +1566,11 @@ pci_alloc_irq_vectors_affinity(struct pc
return -ENOSPC;
}

+static inline int pci_msix_expand_vectors_at(struct pci_dev *dev, unsigned int at, unsigned int nvec)
+{
+ return -ENOTSUPP;
+}
+
static inline void pci_free_irq_vectors(struct pci_dev *dev)
{
}
@@ -1582,6 +1588,13 @@ static inline const struct cpumask *pci_
}
#endif

+#define PCI_MSIX_EXPAND_AUTO (UINT_MAX)
+
+static inline int pci_msix_expand_vectors(struct pci_dev *dev, unsigned int nvec)
+{
+ return pci_msix_expand_vectors_at(dev, PCI_MSIX_EXPAND_AUTO, nvec);
+}
+
/**
* pci_irqd_intx_xlate() - Translate PCI INTx value to an IRQ domain hwirq
* @d: the INTx IRQ domain
Re: [patch 09/10] PCI/MSI: Provide pci_msix_expand_vectors[_at]() [ In reply to ]
Hi Thomas,
On 11/26/2021 5:25 PM, Thomas Gleixner wrote:
> Provide a new interface which allows to expand the MSI-X vector space if
> the underlying irq domain implementation supports it.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> drivers/pci/msi/msi.c | 41 +++++++++++++++++++++++++++++++++++++++++
> include/linux/pci.h | 13 +++++++++++++
> 2 files changed, 54 insertions(+)
>
> --- a/drivers/pci/msi/msi.c
> +++ b/drivers/pci/msi/msi.c
> @@ -1025,6 +1025,47 @@ int pci_alloc_irq_vectors_affinity(struc
> EXPORT_SYMBOL(pci_alloc_irq_vectors_affinity);
>
> /**
> + * pci_msix_expand_vectors_at - Expand MSI-X interrupts for a device
> + *
> + * @dev: PCI device to operate on
> + * @at: Allocate at MSI-X index. If @at == PCI_MSI_EXPAND_AUTO
> + * the function expands automatically after the last
Not sure why some of these changes related to PCI_MSIX_EXPAND_AUTO and
num_descs did not make it to the 'msi' branch.
Is this intentional?
> + * active index.
> + * @nvec: Number of vectors to allocate
> + *
> + * Expand the MSI-X vectors of a device after an initial enablement and
> + * allocation.
> + *
> + * Return: 0 if the allocation was successful, an error code otherwise.
> + */
> +int pci_msix_expand_vectors_at(struct pci_dev *dev, unsigned int at, unsigned int nvec)
> +{
> + struct msi_device_data *md = dev->dev.msi.data;
> + struct msi_range range = { .ndesc = nvec, };
> + unsigned int max_vecs;
> + int ret;
> +
> + if (!pci_msi_enable || !dev || !dev->msix_enabled || !md)
> + return -ENOTSUPP;
> +
> + if (!pci_msi_domain_supports_expand(dev))
> + return -ENOTSUPP;
> +
> + max_vecs = pci_msix_vec_count(dev);
> + if (!nvec || nvec > max_vecs)
> + return -EINVAL;
> +
> + range.first = at == PCI_MSIX_EXPAND_AUTO ? md->num_descs : at;
> +
> + if (range.first >= max_vecs || nvec > max_vecs - range.first)
> + return -ENOSPC;
> +
> + ret = msix_setup_interrupts(dev, dev->msix_base, &range, NULL, NULL, true);
> + return ret <= 0 ? ret : -ENOSPC;;
> +}
> +EXPORT_SYMBOL_GPL(pci_msix_expand_vectors_at);
> +
I am having trouble fully comprehending how this expansion scheme would
work..

For instance, say:
1. Driver requests for 5 vectors:
pci_enable_msix_range(dev, NULL, 5, 5)

2. Driver frees vectors at index 1,2:
range = {1, 2, 2};
pci_msi_teardown_msi_irqs(dev, range)

3. Driver requests for 3 more vectors using the new API:
pci_msix_expand_vectors(dev, 3)
have 3,4 active?
Ideally, we would want index 1,2 and 5 to be allocated for this request
right?

Could you please let me know what I am missing?

With the 'range' approach, the issue is that we are trying to allocate
contiguous indexes. Perhaps, we also need to check if all the indexes in
the requested range are available,
if not, find a contiguous range large enough to accommodate the request.
But there will be fragmentation issues if we choose to go with this way...

I had a version of the dynamic MSI-X patch series (which never got sent
out). For the expansion, I had the following:
pci_add_msix_irq_vector(pdev): On each invocation, add 1 MSI-X vector to
the device and return the msi-x index assigned by the kernel (using a
bitmap)
Correspondingly, pci_free_msix_irq_vector(pdev, irq) frees all the
allocated resources associated with MSI-X interrupt with Linux IRQ
number 'irq'.
I had issues when trying to dynamically allocate more than 1 interrupt
because I didn't have a clean way to communicate to the driver what
indexes were assigned in the current allocation.

-Megha
>
Re: [patch 09/10] PCI/MSI: Provide pci_msix_expand_vectors[_at]() [ In reply to ]
Megha,

On Wed, Dec 01 2021 at 17:08, Megha Dey wrote:
> On 11/26/2021 5:25 PM, Thomas Gleixner wrote:
>> /**
>> + * pci_msix_expand_vectors_at - Expand MSI-X interrupts for a device
>> + *
>> + * @dev: PCI device to operate on
>> + * @at: Allocate at MSI-X index. If @at == PCI_MSI_EXPAND_AUTO
>> + * the function expands automatically after the last
> Not sure why some of these changes related to PCI_MSIX_EXPAND_AUTO and
> num_descs did not make it to the 'msi' branch.
> Is this intentional?

Yes, because I'm not happy about that magic.

>
> For instance, say:
> 1. Driver requests for 5 vectors:
> pci_enable_msix_range(dev, NULL, 5, 5)
> =>num_descs = 5

Driver should not use with pci_enable_msix_range() in the first
place. But yes, it got 5 vectors now.

> 2. Driver frees vectors at index 1,2:
> range = {1, 2, 2};
> pci_msi_teardown_msi_irqs(dev, range)

That function is not accessible by drivers for a reason.

> =>num_descs = 3; Current active vectors are at index: 0, 3, 4

> 3. Driver requests for 3 more vectors using the new API:
> pci_msix_expand_vectors(dev, 3)
> =>range.first = 3 => It will try to allocate index 3-5, but we already
> have 3,4 active?
> Ideally, we would want index 1,2 and 5 to be allocated for this request
> right?
>
> Could you please let me know what I am missing?

You're missing the real world use case. The above is fiction.

If a driver would release 1 and 2 then it should explicitely reallocate
1 and 2 and not let the core decide to magically allocate something.

If the driver wants three more after freeing 1, 2 then the core could
just allocate 5, 6, 7, and would still fulfil the callers request to
allocate three more, right?

And even if it just allocates one, then the caller still has to know the
index upfront. Why? Because it needs to know it in order to get the
Linux interrupt number via pci_irq_vector().

> Correspondingly, pci_free_msix_irq_vector(pdev, irq) frees all the
> allocated resources associated with MSI-X interrupt with Linux IRQ
> number 'irq'.
> I had issues when trying to dynamically allocate more than 1 interrupt
> because I didn't have a clean way to communicate to the driver what
> indexes were assigned in the current allocation.

If the driver is able to free a particular vector then it should exactly
know what it it doing and which index it is freeing. If it needs that
particular vector again, then it knows the index, right?

Let's look how MSI-X works in reality:

Each vector is associated to a particular function in the device. How
that association works is device dependent.

Some devices have hardwired associations, some allow the driver to
program the association in the device configuration and there is also a
combination of both.

So if the driver would free the vector for a particular functionality,
or not allocate it in the first place, then it exactly knows what it
freed and what it needs to allocate when it needs that functionality
(again).

What you are trying to create is a solution in search of a problem. You
cannot declare via a random allocation API how devices work. You neither
can fix the VFIO issue in a sensible way.

VFIO starts with vector #0 allocated. The guest then unmasks vector #50.

With your magic interface VFIO has to allocate 49 vectors and then free
48 of them again or just keep 48 around for nothing which defeats the
purpose of on demand allocation completely.

Thanks,

tglx
Re: [patch 09/10] PCI/MSI: Provide pci_msix_expand_vectors[_at]() [ In reply to ]
Hi Thomas,

On Thu, Dec 02, 2021 at 11:16:39AM +0100, Thomas Gleixner wrote:
> Megha,
>
> On Wed, Dec 01 2021 at 17:08, Megha Dey wrote:
> > On 11/26/2021 5:25 PM, Thomas Gleixner wrote:
> >> /**
> >> + * pci_msix_expand_vectors_at - Expand MSI-X interrupts for a device
> >> + *
> >> + * @dev: PCI device to operate on
> >> + * @at: Allocate at MSI-X index. If @at == PCI_MSI_EXPAND_AUTO
> >> + * the function expands automatically after the last
> > Not sure why some of these changes related to PCI_MSIX_EXPAND_AUTO and
> > num_descs did not make it to the 'msi' branch.
> > Is this intentional?
>
> Yes, because I'm not happy about that magic.
>
> >
> > For instance, say:
> > 1. Driver requests for 5 vectors:
> > pci_enable_msix_range(dev, NULL, 5, 5)
> > =>num_descs = 5
>
> Driver should not use with pci_enable_msix_range() in the first
> place. But yes, it got 5 vectors now.

Bad start with a deprecated interface :-).

>
> > 2. Driver frees vectors at index 1,2:
> > range = {1, 2, 2};
> > pci_msi_teardown_msi_irqs(dev, range)
>
> That function is not accessible by drivers for a reason.
>
> > =>num_descs = 3; Current active vectors are at index: 0, 3, 4
>
> > 3. Driver requests for 3 more vectors using the new API:
> > pci_msix_expand_vectors(dev, 3)
> > =>range.first = 3 => It will try to allocate index 3-5, but we already
> > have 3,4 active?
> > Ideally, we would want index 1,2 and 5 to be allocated for this request
> > right?
> >
> > Could you please let me know what I am missing?
>
> You're missing the real world use case. The above is fiction.

I don't think there is a valid use case for freeing specific vectors. Its
true some are special, IDXD has vector#0 like that. But I expect drivers to
acquire these special vectors once and never free them until driver
tear down time.

But there is a need to free on demand, for a subdevice constructed for idxd
pass-through, when the guest is torn down, host would need to free them.
Only growing on demand seems to only catch one part of the dynamic part.

IDXD also allocates interrupt only when the WQ is enabled, and frees when its
disabled.

>
> If a driver would release 1 and 2 then it should explicitely reallocate
> 1 and 2 and not let the core decide to magically allocate something.
>
> If the driver wants three more after freeing 1, 2 then the core could
> just allocate 5, 6, 7, and would still fulfil the callers request to
> allocate three more, right?

Since the core is already managing what's allocated and free, requiring
drivers to manage each allocated entries seem hard, while the core can
easily manage it. For IDXD cases, we don't really care which ones of the
IMS is being allocated and freed. It just wants one of the available IMS
entries. The assumption is since the driver would have acquired any special
ones upfront with the alloc_irqs().


>
> And even if it just allocates one, then the caller still has to know the
> index upfront. Why? Because it needs to know it in order to get the
> Linux interrupt number via pci_irq_vector().

If we were to allocate one, the new API can simply return the index
directly to the caller, and they call pci_irq_vector() to get the IRQ
number.

>
> > Correspondingly, pci_free_msix_irq_vector(pdev, irq) frees all the
> > allocated resources associated with MSI-X interrupt with Linux IRQ
> > number 'irq'.
> > I had issues when trying to dynamically allocate more than 1 interrupt
> > because I didn't have a clean way to communicate to the driver what
> > indexes were assigned in the current allocation.
>
> If the driver is able to free a particular vector then it should exactly
> know what it it doing and which index it is freeing. If it needs that
> particular vector again, then it knows the index, right?
>
> Let's look how MSI-X works in reality:
>
> Each vector is associated to a particular function in the device. How
> that association works is device dependent.
>
> Some devices have hardwired associations, some allow the driver to
> program the association in the device configuration and there is also a
> combination of both.
>
> So if the driver would free the vector for a particular functionality,
> or not allocate it in the first place, then it exactly knows what it
> freed and what it needs to allocate when it needs that functionality
> (again).

It doesn't *have* to be that all vectors are special. Some of them are
special that they acquired all during driver load time. These are allocated
once and never freed. The rest are for say completion interrupts or such and
such that go with work queues. These can dynamically be allocated and
freed.

The driver doesn't really care which index it wants or what the next index
should be. But it has to remember the allocated ones so it can pass down
for the free. Maybe the one we did a while back

https://lore.kernel.org/lkml/1561162778-12669-1-git-send-email-megha.dey@linux.intel.com/

This has a group handle, and kept adding things to it.

>
> What you are trying to create is a solution in search of a problem. You
> cannot declare via a random allocation API how devices work. You neither
> can fix the VFIO issue in a sensible way.
>
> VFIO starts with vector #0 allocated. The guest then unmasks vector #50.
>
> With your magic interface VFIO has to allocate 49 vectors and then free
> 48 of them again or just keep 48 around for nothing which defeats the
> purpose of on demand allocation completely.

This use case is broken already, the VFIO case sort of assumes things are
growing in sequence. Today it doesn't have a hint on which entry is being
unmasked I suppose. So VFIO simply releases everything, adds N more than
currently allocated.

If there is a real world need for allocating a
specific vector#50, maybe we should add a alloc_exact() type and core can
check if #50 is still available.

Maybe for MSIx we don't have a need to shrink based on current usage. IMS
requires both grow and shrink. But it might be odd to have 2 domains behave
quite differently.

Cheers,
Ashok
Re: [patch 09/10] PCI/MSI: Provide pci_msix_expand_vectors[_at]() [ In reply to ]
Ashok,

On Thu, Dec 02 2021 at 11:21, Ashok Raj wrote:
> On Thu, Dec 02, 2021 at 11:16:39AM +0100, Thomas Gleixner wrote:
>> On Wed, Dec 01 2021 at 17:08, Megha Dey wrote:
>> You're missing the real world use case. The above is fiction.
>
> I don't think there is a valid use case for freeing specific vectors. Its
> true some are special, IDXD has vector#0 like that. But I expect drivers to
> acquire these special vectors once and never free them until driver
> tear down time.
>
> But there is a need to free on demand, for a subdevice constructed for idxd
> pass-through, when the guest is torn down, host would need to free them.
> Only growing on demand seems to only catch one part of the dynamic part.
>
> IDXD also allocates interrupt only when the WQ is enabled, and frees when its
> disabled.

You're talking about IMS not MSI-X here, right? IMS cannot be allocated
via the PCI/MSI interfaces as we established long ago.

And if you are talking about the 8 MSI-X interrupts for IDXD then I
really do not see the point of ever releasing it.

>> If a driver would release 1 and 2 then it should explicitely reallocate
>> 1 and 2 and not let the core decide to magically allocate something.
>>
>> If the driver wants three more after freeing 1, 2 then the core could
>> just allocate 5, 6, 7, and would still fulfil the callers request to
>> allocate three more, right?
>
> Since the core is already managing what's allocated and free, requiring
> drivers to manage each allocated entries seem hard, while the core can
> easily manage it. For IDXD cases, we don't really care which ones of the
> IMS is being allocated and freed. It just wants one of the available IMS
> entries. The assumption is since the driver would have acquired any special
> ones upfront with the alloc_irqs().

For MSI-X the free vector use case does not exist today and even if it
would exist the driver has to know about the index.

If the index -> function accociation is hard wired, it needs to know it
obviously.

If it's not hardwired then it still needs to know the resulting index,
because it has to program that index into a device function register so
that the device knows which entry to use.

IMS is not any different. You need to know the index in order to
associate it to the queue, no? And you need the index in order to figure
out the Linux irq number.

But again, that's not a problem of this very API because this API is
about PCI/MSI and not about IMS.

>> And even if it just allocates one, then the caller still has to know the
>> index upfront. Why? Because it needs to know it in order to get the
>> Linux interrupt number via pci_irq_vector().
>
> If we were to allocate one, the new API can simply return the index
> directly to the caller, and they call pci_irq_vector() to get the IRQ
> number.

That can work, but then we need both variants:

pci_msix_alloc_vector_at() and pci_msix_alloc_vector_any()

Why?

Because pci_msix_alloc_vector_any() cannot solve the VFIO on demand
allocation problem and it cannot be used to replace the sparse
allocations which are done via pci_enable_msix_exact() today.

If there is an MSI-X use case to allocate any vector then we can
implement that. If there is none, then we don't need it, right?

>> So if the driver would free the vector for a particular functionality,
>> or not allocate it in the first place, then it exactly knows what it
>> freed and what it needs to allocate when it needs that functionality
>> (again).
>
> It doesn't *have* to be that all vectors are special. Some of them are
> special that they acquired all during driver load time. These are allocated
> once and never freed. The rest are for say completion interrupts or such and
> such that go with work queues. These can dynamically be allocated and
> freed.
>
> The driver doesn't really care which index it wants or what the next index
> should be. But it has to remember the allocated ones so it can pass down
> for the free. Maybe the one we did a while back
>
> https://lore.kernel.org/lkml/1561162778-12669-1-git-send-email-megha.dey@linux.intel.com/
>
> This has a group handle, and kept adding things to it.

Was it really necessary to bring those memories back?

If we want groups, then surely not with these kind of hacks. I still
need to see the usecase for the groups. The discussion back then just
provided handwaving about internal request which never materialized.

But talking about groups. That's very similar to the other discussion
vs. storing the IMS entries for these sliced devices, queues or
whatever. That's at least a use case.

>> What you are trying to create is a solution in search of a problem. You
>> cannot declare via a random allocation API how devices work. You neither
>> can fix the VFIO issue in a sensible way.
>>
>> VFIO starts with vector #0 allocated. The guest then unmasks vector #50.
>>
>> With your magic interface VFIO has to allocate 49 vectors and then free
>> 48 of them again or just keep 48 around for nothing which defeats the
>> purpose of on demand allocation completely.
>
> This use case is broken already, the VFIO case sort of assumes things are
> growing in sequence. Today it doesn't have a hint on which entry is being
> unmasked I suppose. So VFIO simply releases everything, adds N more than
> currently allocated.

VFIO exactly knows which entry is unmasked simply because the write into
the MSI-X table in the device config space is trapped so it knows
exactly which entry is unmasked, no? Guess how VFIO knows about $N more?

> Maybe for MSIx we don't have a need to shrink based on current usage. IMS
> requires both grow and shrink. But it might be odd to have 2 domains behave
> quite differently.

We are not implementing the full MSI[X] zoo for IMS either. So the
interfaces are different in the first place.

Making them artificially uniform is a horrible idea.

They are two different things, really. The only thing they have in common
is that at the end of the day the device sends a message over the bus
and they happen to share the underlying MSI code infrastructure.

Thanks,

tglx
Re: [patch 09/10] PCI/MSI: Provide pci_msix_expand_vectors[_at]() [ In reply to ]
Hi Thomas

On Thu, Dec 02, 2021 at 09:40:08PM +0100, Thomas Gleixner wrote:
> Ashok,
>
> On Thu, Dec 02 2021 at 11:21, Ashok Raj wrote:
> > On Thu, Dec 02, 2021 at 11:16:39AM +0100, Thomas Gleixner wrote:
> >> On Wed, Dec 01 2021 at 17:08, Megha Dey wrote:
> >> You're missing the real world use case. The above is fiction.
> >
> > I don't think there is a valid use case for freeing specific vectors. Its
> > true some are special, IDXD has vector#0 like that. But I expect drivers to
> > acquire these special vectors once and never free them until driver
> > tear down time.
> >
> > But there is a need to free on demand, for a subdevice constructed for idxd
> > pass-through, when the guest is torn down, host would need to free them.
> > Only growing on demand seems to only catch one part of the dynamic part.
> >
> > IDXD also allocates interrupt only when the WQ is enabled, and frees when its
> > disabled.
>
> You're talking about IMS not MSI-X here, right? IMS cannot be allocated
> via the PCI/MSI interfaces as we established long ago.
>
> And if you are talking about the 8 MSI-X interrupts for IDXD then I
> really do not see the point of ever releasing it.

Not worried about MSI-x for IDXD :), I assumed the purpose of this exercise
was about 2 things.

- Fix the VFIO mask/unmask weirdness ending up disable, reenable with more
interrupts.
- We are only fixing the case by not calling the disable_msi, but just
growing on demand.

- Use this as a case to build IMS. but if we treat MSIx and IMS
differently, IMS would be bit different in how the dynamic parts work.

Although there is no real need for MSIx being dynamic except to avoid host
vector exhausion, do you think we could still allocate specific entries.
Since unmask is per-vector, is there benefit to doing just that vs
allocating current+N?

>
> >> If a driver would release 1 and 2 then it should explicitely reallocate
> >> 1 and 2 and not let the core decide to magically allocate something.
> >>
> >> If the driver wants three more after freeing 1, 2 then the core could
> >> just allocate 5, 6, 7, and would still fulfil the callers request to
> >> allocate three more, right?
> >
> > Since the core is already managing what's allocated and free, requiring
> > drivers to manage each allocated entries seem hard, while the core can
> > easily manage it. For IDXD cases, we don't really care which ones of the
> > IMS is being allocated and freed. It just wants one of the available IMS
> > entries. The assumption is since the driver would have acquired any special
> > ones upfront with the alloc_irqs().
>
> For MSI-X the free vector use case does not exist today and even if it
> would exist the driver has to know about the index.
>
> If the index -> function accociation is hard wired, it needs to know it
> obviously.
>
> If it's not hardwired then it still needs to know the resulting index,
> because it has to program that index into a device function register so
> that the device knows which entry to use.
>
> IMS is not any different. You need to know the index in order to
> associate it to the queue, no? And you need the index in order to figure
> out the Linux irq number.
>
> But again, that's not a problem of this very API because this API is
> about PCI/MSI and not about IMS.

fair enough..the thought was even though MSIx doesn't require that, but the
implementations can be consistent if we aren't breaking MSIx.

but as you said they don't have to be the same and can differ in how they
are implemented.


>
> >> And even if it just allocates one, then the caller still has to know the
> >> index upfront. Why? Because it needs to know it in order to get the
> >> Linux interrupt number via pci_irq_vector().
> >
> > If we were to allocate one, the new API can simply return the index
> > directly to the caller, and they call pci_irq_vector() to get the IRQ
> > number.
>
> That can work, but then we need both variants:
>
> pci_msix_alloc_vector_at() and pci_msix_alloc_vector_any()
>
> Why?
>
> Because pci_msix_alloc_vector_any() cannot solve the VFIO on demand
> allocation problem and it cannot be used to replace the sparse
> allocations which are done via pci_enable_msix_exact() today.
>
> If there is an MSI-X use case to allocate any vector then we can
> implement that. If there is none, then we don't need it, right?

agreed.

>
> >> So if the driver would free the vector for a particular functionality,
> >> or not allocate it in the first place, then it exactly knows what it
> >> freed and what it needs to allocate when it needs that functionality
> >> (again).
> >
> > It doesn't *have* to be that all vectors are special. Some of them are
> > special that they acquired all during driver load time. These are allocated
> > once and never freed. The rest are for say completion interrupts or such and
> > such that go with work queues. These can dynamically be allocated and
> > freed.
> >
> > The driver doesn't really care which index it wants or what the next index
> > should be. But it has to remember the allocated ones so it can pass down
> > for the free. Maybe the one we did a while back
> >
> > https://lore.kernel.org/lkml/1561162778-12669-1-git-send-email-megha.dey@linux.intel.com/
> >
> > This has a group handle, and kept adding things to it.
>
> Was it really necessary to bring those memories back?

:-)

>
> If we want groups, then surely not with these kind of hacks. I still
> need to see the usecase for the groups. The discussion back then just
> provided handwaving about internal request which never materialized.

true, we didn't hear back from the groups that asked for them.
>
> But talking about groups. That's very similar to the other discussion
> vs. storing the IMS entries for these sliced devices, queues or
> whatever. That's at least a use case.

Correct.

>
> >> What you are trying to create is a solution in search of a problem. You
> >> cannot declare via a random allocation API how devices work. You neither
> >> can fix the VFIO issue in a sensible way.
> >>
> >> VFIO starts with vector #0 allocated. The guest then unmasks vector #50.
> >>
> >> With your magic interface VFIO has to allocate 49 vectors and then free
> >> 48 of them again or just keep 48 around for nothing which defeats the
> >> purpose of on demand allocation completely.
> >
> > This use case is broken already, the VFIO case sort of assumes things are
> > growing in sequence. Today it doesn't have a hint on which entry is being
> > unmasked I suppose. So VFIO simply releases everything, adds N more than
> > currently allocated.
>
> VFIO exactly knows which entry is unmasked simply because the write into
> the MSI-X table in the device config space is trapped so it knows
> exactly which entry is unmasked, no? Guess how VFIO knows about $N more?

bah.. i missed that little fact.

When VFIO knows exactly which entry is being unmasked, is it enough to just
allocate exact one, or do we need to all all N? I didn't see why we need to
grown by N additional vectors instead of only allocating 1 for the entry
being unmasked?

>
> > Maybe for MSIx we don't have a need to shrink based on current usage. IMS
> > requires both grow and shrink. But it might be odd to have 2 domains behave
> > quite differently.
>
> We are not implementing the full MSI[X] zoo for IMS either. So the
> interfaces are different in the first place.

The ones that actually differ between the MSIx and IMS are:

- Discovery on where to find that unlike the PCIe standard mechanism.
(Although in initial implementation we have forced a common way to manage
this, but I think people already hinted this isn't going to work
for different vendors or even between gen1/gen2 devices.
- Managing the location of the add/data write, mask/unmask etc.
- Might have other attributes such as PASID etc for the IDXD case exposed
to guest/user space.


Are there other differences between them?

>
> Making them artificially uniform is a horrible idea.

Totally!

>
> They are two different things, really. The only thing they have in common
> is that at the end of the day the device sends a message over the bus
> and they happen to share the underlying MSI code infrastructure.
Re: [patch 09/10] PCI/MSI: Provide pci_msix_expand_vectors[_at]() [ In reply to ]
Ashok,

On Thu, Dec 02 2021 at 16:45, Ashok Raj wrote:
> On Thu, Dec 02, 2021 at 09:40:08PM +0100, Thomas Gleixner wrote:
> Not worried about MSI-x for IDXD :), I assumed the purpose of this exercise
> was about 2 things.
>
> - Fix the VFIO mask/unmask weirdness ending up disable, reenable with more
> interrupts.
> - We are only fixing the case by not calling the disable_msi, but just
> growing on demand.
>
> - Use this as a case to build IMS. but if we treat MSIx and IMS
> differently, IMS would be bit different in how the dynamic parts
> work.

Conceptually they are not different, really. You are mixing concepts
and implementation details.

> Although there is no real need for MSIx being dynamic except to avoid host
> vector exhausion, do you think we could still allocate specific entries.
> Since unmask is per-vector, is there benefit to doing just that vs
> allocating current+N?

This is either a rethoric question or a trick question, right?

>> VFIO exactly knows which entry is unmasked simply because the write into
>> the MSI-X table in the device config space is trapped so it knows
>> exactly which entry is unmasked, no? Guess how VFIO knows about $N more?
>
> bah.. i missed that little fact.
>
> When VFIO knows exactly which entry is being unmasked, is it enough to just
> allocate exact one, or do we need to all all N? I didn't see why we need to
> grown by N additional vectors instead of only allocating 1 for the entry
> being unmasked?

That's exactly the point. The current implementation does so, because
the PCI/MSI infrastructure does not provide a mechanism to allocate a
particular vector post init.

>> > Maybe for MSIx we don't have a need to shrink based on current usage. IMS
>> > requires both grow and shrink. But it might be odd to have 2 domains behave
>> > quite differently.
>>
>> We are not implementing the full MSI[X] zoo for IMS either. So the
>> interfaces are different in the first place.
>
> The ones that actually differ between the MSIx and IMS are:
>
> - Discovery on where to find that unlike the PCIe standard mechanism.
> (Although in initial implementation we have forced a common way to manage
> this, but I think people already hinted this isn't going to work
> for different vendors or even between gen1/gen2 devices.
> - Managing the location of the add/data write, mask/unmask etc.
> - Might have other attributes such as PASID etc for the IDXD case exposed
> to guest/user space.

You are looking at this from the wrong direction, i.e. top down. Why?

Because if you look at it from bottom up, then you'll see the following:

The base of all this is a function block which translates a write of
message data to the message address into an interrupt raised in a
CPU.

The interrupt remap unit is just another translator which converts the
write to the remap table into a write to the CPU function block, if
the encoded protections are valid.

From a device perspective the message address and the message data are
completely opaque and all the device knows about them is that it has to
write message.data to message.address in order to raise an interrupt in
some CPU.

Of course the device needs to have some sort of storage where the OS can
save the composed message data and the message address so that the
device itself can access it when it wants to raise an interrupt.

IOW. The message storage is device specific.

For IO/APIC the message is saved in the per pin routing entry.

HPET has it's own routing entry

PCI/MSI provides standartised storage for the messages.

IMS allows the device to define where the message is stored. That's a
fundametally new concept, right?

No. It is not. All of the above are already IMS implementations. And
each implementation has their own quirks and oddities which is why
we end up with different irqdomains and irqchips.

If you look at other architectures there are various other flavours of
devices which have their own device specific message store, IOW they
all are device specific IMS flavours.

They all have two things in common:

- They provide storage for messages
- To raise an interruupt they write message.data to message.address

So IMS is conceptually nothing new. It's just a marketing brandname for
something which exists since the invention of message based interrupt
for obvious reasons.

What's different for all of the above variants is the way how it is
exposed to the devices. Wired interrupts which end up at the IO/APIC pin
obviously cannot expose the underlying message mechanism because the
device cannot make use of it. And a device cannot allocate a pin either
because it obviously cannot rewire the PCB.

For PCI/MSI we have an PCI/MSI sepcific interface which is aware of the
PCI/MSI way to store the messages and to deal with the quirks and
limitations of PCI/MSI.

For IMS we surely will model an interface which handles all IMS variants
in a uniform way too, but that interface will be different from PCI/MSI
because it does not need any of the PCI/MSI quirks at all.

That interface will more look like the underlying msi irqdomain
interface simply because pretending it is PCI specific is an artificial
exercise.

There is nothing PCI specific about it. The only connection which needs
to be made is through msi_desc::dev and perhaps some meta info so that
the IOMMU can figure out from which PCI device this message will
originate..

Adding a pci_foo() wrapper around it which reliefs the programmer from
writing &pdev->dev and filling in some meta info is just an obvious
conveniance add on.

See?

That interface is really the least of all problems.

We need to settle the other way more important qeustion how to
store/manage MSI descriptors and how to handle the per device IMS
irqdomain first.

You can handwave about the interface as long as you want. It won't
materialize before the underlying infrastructure is not sorted out.

Once that is done then the interface is mostly defined by that
infrastructure and writing it up is not going to be rocket science.

See?

Thanks,

tglx