Mailing List Archive

1 2  View All
Re: Virtio in Xen on Arm (based on IOREQ concept) [ In reply to ]
On 21/07/2020 15:52, Oleksandr wrote:
>
> On 21.07.20 17:32, André Przywara wrote:
>> On 21/07/2020 14:43, Julien Grall wrote:
>
> Hello Andre, Julien
>
>
>>> (+ Andre)
>>>
>>> Hi Oleksandr,
>>>
>>> On 21/07/2020 13:26, Oleksandr wrote:
>>>> On 20.07.20 23:38, Stefano Stabellini wrote:
>>>>> For instance, what's your take on notifications with virtio-mmio? How
>>>>> are they modelled today? Are they good enough or do we need MSIs?
>>>> Notifications are sent from device (backend) to the driver (frontend)
>>>> using interrupts. Additional DM function was introduced for that
>>>> purpose xendevicemodel_set_irq_level() which results in
>>>> vgic_inject_irq() call.
>>>>
>>>> Currently, if device wants to notify a driver it should trigger the
>>>> interrupt by calling that function twice (high level at first, then
>>>> low level).
>>> This doesn't look right to me. Assuming the interrupt is trigger when
>>> the line is high-level, the backend should only issue the hypercall once
>>> to set the level to high. Once the guest has finish to process all the
>>> notifications the backend would then call the hypercall to lower the
>>> interrupt line.
>>>
>>> This means the interrupts should keep firing as long as the interrupt
>>> line is high.
>>>
>>> It is quite possible that I took some shortcut when implementing the
>>> hypercall, so this should be corrected before anyone start to rely on
>>> it.
>> So I think the key question is: are virtio interrupts level or edge
>> triggered? Both QEMU and kvmtool advertise virtio-mmio interrupts as
>> edge-triggered.
>>  From skimming through the virtio spec I can't find any explicit
>> mentioning of the type of IRQ, but the usage of MSIs indeed hints at
>> using an edge property. Apparently reading the PCI ISR status register
>> clears it, which again sounds like edge. For virtio-mmio the driver
>> needs to explicitly clear the interrupt status register, which again
>> says: edge (as it's not the device clearing the status).
>>
>> So the device should just notify the driver once, which would cause one
>> vgic_inject_irq() call. It would be then up to the driver to clear up
>> that status, by reading PCI ISR status or writing to virtio-mmio's
>> interrupt-acknowledge register.
>>
>> Does that make sense?
> When implementing Xen backend, I didn't have an already working example
> so only guessed. I looked how kvmtool behaved when actually triggering
> the interrupt on Arm [1].
>
> Taking into the account that Xen PoC on Arm advertises [2] the same irq
> type (TYPE_EDGE_RISING) as kvmtool [3] I decided to follow the model of
> triggering an interrupt. Could you please explain, is this wrong?

Yes, kvmtool does a double call needlessly (on x86, ppc and arm, mips is
correct).
I just chased it down in the kernel, a KVM_IRQ_LINE ioctl with level=low
is ignored when the target IRQ is configured as edge (which it is,
because the DT says so), check vgic_validate_injection() in the kernel.

So you should only ever need one call to set the line "high" (actually:
trigger the edge pulse).

Cheers,
Andre.

>
>
> [1]
> https://git.kernel.org/pub/scm/linux/kernel/git/will/kvmtool.git/tree/arm/gic.c#n418
>
>
> [2]
> https://github.com/xen-troops/xen/blob/ioreq_4.14_ml/tools/libxl/libxl_arm.c#L727
>
>
> [3]
> https://git.kernel.org/pub/scm/linux/kernel/git/will/kvmtool.git/tree/virtio/mmio.c#n270
>
>
Re: Virtio in Xen on Arm (based on IOREQ concept) [ In reply to ]
On 21.07.20 17:58, André Przywara wrote:
> On 21/07/2020 15:52, Oleksandr wrote:
>> On 21.07.20 17:32, André Przywara wrote:
>>> On 21/07/2020 14:43, Julien Grall wrote:
>> Hello Andre, Julien
>>
>>
>>>> (+ Andre)
>>>>
>>>> Hi Oleksandr,
>>>>
>>>> On 21/07/2020 13:26, Oleksandr wrote:
>>>>> On 20.07.20 23:38, Stefano Stabellini wrote:
>>>>>> For instance, what's your take on notifications with virtio-mmio? How
>>>>>> are they modelled today? Are they good enough or do we need MSIs?
>>>>> Notifications are sent from device (backend) to the driver (frontend)
>>>>> using interrupts. Additional DM function was introduced for that
>>>>> purpose xendevicemodel_set_irq_level() which results in
>>>>> vgic_inject_irq() call.
>>>>>
>>>>> Currently, if device wants to notify a driver it should trigger the
>>>>> interrupt by calling that function twice (high level at first, then
>>>>> low level).
>>>> This doesn't look right to me. Assuming the interrupt is trigger when
>>>> the line is high-level, the backend should only issue the hypercall once
>>>> to set the level to high. Once the guest has finish to process all the
>>>> notifications the backend would then call the hypercall to lower the
>>>> interrupt line.
>>>>
>>>> This means the interrupts should keep firing as long as the interrupt
>>>> line is high.
>>>>
>>>> It is quite possible that I took some shortcut when implementing the
>>>> hypercall, so this should be corrected before anyone start to rely on
>>>> it.
>>> So I think the key question is: are virtio interrupts level or edge
>>> triggered? Both QEMU and kvmtool advertise virtio-mmio interrupts as
>>> edge-triggered.
>>>  From skimming through the virtio spec I can't find any explicit
>>> mentioning of the type of IRQ, but the usage of MSIs indeed hints at
>>> using an edge property. Apparently reading the PCI ISR status register
>>> clears it, which again sounds like edge. For virtio-mmio the driver
>>> needs to explicitly clear the interrupt status register, which again
>>> says: edge (as it's not the device clearing the status).
>>>
>>> So the device should just notify the driver once, which would cause one
>>> vgic_inject_irq() call. It would be then up to the driver to clear up
>>> that status, by reading PCI ISR status or writing to virtio-mmio's
>>> interrupt-acknowledge register.
>>>
>>> Does that make sense?
>> When implementing Xen backend, I didn't have an already working example
>> so only guessed. I looked how kvmtool behaved when actually triggering
>> the interrupt on Arm [1].
>>
>> Taking into the account that Xen PoC on Arm advertises [2] the same irq
>> type (TYPE_EDGE_RISING) as kvmtool [3] I decided to follow the model of
>> triggering an interrupt. Could you please explain, is this wrong?
> Yes, kvmtool does a double call needlessly (on x86, ppc and arm, mips is
> correct).
> I just chased it down in the kernel, a KVM_IRQ_LINE ioctl with level=low
> is ignored when the target IRQ is configured as edge (which it is,
> because the DT says so), check vgic_validate_injection() in the kernel.
>
> So you should only ever need one call to set the line "high" (actually:
> trigger the edge pulse).

Got it, thanks for the explanation. Have just removed an extra action
(setting low level) and checked.


--
Regards,

Oleksandr Tyshchenko
Re: Virtio in Xen on Arm (based on IOREQ concept) [ In reply to ]
On Tue, 21 Jul 2020, Alex Bennée wrote:
> Julien Grall <julien@xen.org> writes:
>
> > Hi Stefano,
> >
> > On 20/07/2020 21:37, Stefano Stabellini wrote:
> >> On Mon, 20 Jul 2020, Roger Pau Monné wrote:
> >>> On Mon, Jul 20, 2020 at 10:40:40AM +0100, Julien Grall wrote:
> >>>>
> >>>>
> >>>> On 20/07/2020 10:17, Roger Pau Monné wrote:
> >>>>> On Fri, Jul 17, 2020 at 09:34:14PM +0300, Oleksandr wrote:
> >>>>>> On 17.07.20 18:00, Roger Pau Monné wrote:
> >>>>>>> On Fri, Jul 17, 2020 at 05:11:02PM +0300, Oleksandr Tyshchenko wrote:
> >>>>>>> Do you have any plans to try to upstream a modification to the VirtIO
> >>>>>>> spec so that grants (ie: abstract references to memory addresses) can
> >>>>>>> be used on the VirtIO ring?
> >>>>>>
> >>>>>> But VirtIO spec hasn't been modified as well as VirtIO infrastructure in the
> >>>>>> guest. Nothing to upsteam)
> >>>>>
> >>>>> OK, so there's no intention to add grants (or a similar interface) to
> >>>>> the spec?
> >>>>>
> >>>>> I understand that you want to support unmodified VirtIO frontends, but
> >>>>> I also think that long term frontends could negotiate with backends on
> >>>>> the usage of grants in the shared ring, like any other VirtIO feature
> >>>>> negotiated between the frontend and the backend.
> >>>>>
> >>>>> This of course needs to be on the spec first before we can start
> >>>>> implementing it, and hence my question whether a modification to the
> >>>>> spec in order to add grants has been considered.
> >>>> The problem is not really the specification but the adoption in the
> >>>> ecosystem. A protocol based on grant-tables would mostly only be used by Xen
> >>>> therefore:
> >>>> - It may be difficult to convince a proprietary OS vendor to invest
> >>>> resource on implementing the protocol
> >>>> - It would be more difficult to move in/out of Xen ecosystem.
> >>>>
> >>>> Both, may slow the adoption of Xen in some areas.
> >>>
> >>> Right, just to be clear my suggestion wasn't to force the usage of
> >>> grants, but whether adding something along this lines was in the
> >>> roadmap, see below.
> >>>
> >>>> If one is interested in security, then it would be better to work with the
> >>>> other interested parties. I think it would be possible to use a virtual
> >>>> IOMMU for this purpose.
> >>>
> >>> Yes, I've also heard rumors about using the (I assume VirtIO) IOMMU in
> >>> order to protect what backends can map. This seems like a fine idea,
> >>> and would allow us to gain the lost security without having to do the
> >>> whole work ourselves.
> >>>
> >>> Do you know if there's anything published about this? I'm curious
> >>> about how and where in the system the VirtIO IOMMU is/should be
> >>> implemented.
> >>
> >> Not yet (as far as I know), but we have just started some discussons on
> >> this topic within Linaro.
> >>
> >>
> >> You should also be aware that there is another proposal based on
> >> pre-shared-memory and memcpys to solve the virtio security issue:
> >>
> >> https://marc.info/?l=linux-kernel&m=158807398403549
> >>
> >> It would be certainly slower than the "virtio IOMMU" solution but it
> >> would take far less time to develop and could work as a short-term
> >> stop-gap.
> >
> > I don't think I agree with this blank statement. In the case of "virtio
> > IOMMU", you would need to potentially map/unmap pages every request
> > which would result to a lot of back and forth to the hypervisor.

Yes, that's true.


> Can a virtio-iommu just set bounds when a device is initialised as to
> where memory will be in the kernel address space?

First let me premise to avoid possible miscommunication that what Julien
and I are calling "virtio IOMMU" is not an existing virtio-iommu driver
of some sort, but an idea for a cross-domain virtual IOMMU for the sake
of the frontends to explicitly permit memory to be accessed by the
backends. Hopefully it was clear already but better be sure :-)


If you are asking whether it would be possible to use the virtual IOMMU
just to setup memory at startup time, then it certainly could, but
effectively we would end up with one of the following scenarios:

1) one pre-shared bounce buffer
Effectively the same as https://marc.info/?l=linux-kernel&m=158807398403549
still requires memcpys
could still be nicer than Qualcomm's proposal because easier to
configure?

2) all domU memory allowed access to the backend
Not actually any more secure than placing the backends in dom0


Otherwise we need the dynamic maps/unmaps.

For completeness, if we could write the whole software stack from
scratch, it would also be possible to architect a protocol (like
virtio-net) and the software stack above it to always allocate memory
from a given buffer (the pre-shared buffer), hence greatly reducing the
amount of required memcpys, maybe even down to zero. In reality, most
interfaces in Linux and POSIX userspace expect the application to be the
one providing the buffer, hence they would require memcpys in the kernel
to move data between the user-provided buffers and the pre-shared buffers.
Re: Virtio in Xen on Arm (based on IOREQ concept) [ In reply to ]
On Tue, 21 Jul 2020, Julien Grall wrote:
> Hi Alex,
>
> Thank you for your feedback!
>
> On 21/07/2020 15:15, Alex Bennée wrote:
> > Julien Grall <julien@xen.org> writes:
> >
> > > (+ Andree for the vGIC).
> > >
> > > Hi Stefano,
> > >
> > > On 20/07/2020 21:38, Stefano Stabellini wrote:
> > > > On Fri, 17 Jul 2020, Oleksandr wrote:
> > > > > > > *A few word about solution:*
> > > > > > > As it was mentioned at [1], in order to implement virtio-mmio Xen
> > > > > > > on Arm
> > > > > > Any plans for virtio-pci? Arm seems to be moving to the PCI bus, and
> > > > > > it would be very interesting from a x86 PoV, as I don't think
> > > > > > virtio-mmio is something that you can easily use on x86 (or even use
> > > > > > at all).
> > > > >
> > > > > Being honest I didn't consider virtio-pci so far. Julien's PoC (we are
> > > > > based
> > > > > on) provides support for the virtio-mmio transport
> > > > >
> > > > > which is enough to start working around VirtIO and is not as complex
> > > > > as
> > > > > virtio-pci. But it doesn't mean there is no way for virtio-pci in Xen.
> > > > >
> > > > > I think, this could be added in next steps. But the nearest target is
> > > > > virtio-mmio approach (of course if the community agrees on that).
> > >
> > > > Aside from complexity and easy-of-development, are there any other
> > > > architectural reasons for using virtio-mmio?
> > >
> > <snip>
> > > >
> > > > For instance, what's your take on notifications with virtio-mmio? How
> > > > are they modelled today?
> > >
> > > The backend will notify the frontend using an SPI. The other way around
> > > (frontend -> backend) is based on an MMIO write.
> > >
> > > We have an interface to allow the backend to control whether the
> > > interrupt level (i.e. low, high). However, the "old" vGIC doesn't handle
> > > properly level interrupts. So we would end up to treat level interrupts
> > > as edge.
> > >
> > > Technically, the problem is already existing with HW interrupts, but the
> > > HW should fire it again if the interrupt line is still asserted. Another
> > > issue is the interrupt may fire even if the interrupt line was
> > > deasserted (IIRC this caused some interesting problem with the Arch
> > > timer).
> > >
> > > I am a bit concerned that the issue will be more proeminent for virtual
> > > interrupts. I know that we have some gross hack in the vpl011 to handle
> > > a level interrupts. So maybe it is time to switch to the new vGIC?
> > >
> > > > Are they good enough or do we need MSIs?
> > >
> > > I am not sure whether virtio-mmio supports MSIs. However for virtio-pci,
> > > MSIs is going to be useful to improve performance. This may mean to
> > > expose an ITS, so we would need to add support for guest.
> >
> > virtio-mmio doesn't support MSI's at the moment although there have been
> > proposals to update the spec to allow them. At the moment the cost of
> > reading the ISR value and then writing an ack in vm_interrupt:
> >
> > /* Read and acknowledge interrupts */
> > status = readl(vm_dev->base + VIRTIO_MMIO_INTERRUPT_STATUS);
> > writel(status, vm_dev->base + VIRTIO_MMIO_INTERRUPT_ACK);
> >
>
> Hmmmm, the current way to handle MMIO is the following:
> * pause the vCPU
> * Forward the access to the backend domain
> * Schedule the backend domain
> * Wait for the access to be handled
> * unpause the vCPU
>
> So the sequence is going to be fairly expensive on Xen.
>
> > puts an extra vmexit cost to trap an emulate each exit. Getting an MSI
> > via an exitless access to the GIC would be better I think.
> > I'm not quite
> > sure what the path to IRQs from Xen is.
>
> vmexit on Xen on Arm is pretty cheap compare to KVM as we don't save a lot of
> things. In this situation, they handling an extra trap for the interrupt is
> likely to be meaningless compare to the sequence above.

+1


> I am assuming the sequence is also going to be used by the MSIs, right?
>
> It feels to me that it would be worth spending time to investigate the cost of
> that sequence. It might be possible to optimize the ACK and avoid to wait for
> the backend to handle the access.

+1
Re: Virtio in Xen on Arm (based on IOREQ concept) [ In reply to ]
On 21/07/2020 17:09, Oleksandr wrote:
>
> On 21.07.20 17:58, André Przywara wrote:
>> On 21/07/2020 15:52, Oleksandr wrote:
>>> On 21.07.20 17:32, André Przywara wrote:
>>>> On 21/07/2020 14:43, Julien Grall wrote:
>>> Hello Andre, Julien
>>>
>>>
>>>>> (+ Andre)
>>>>>
>>>>> Hi Oleksandr,
>>>>>
>>>>> On 21/07/2020 13:26, Oleksandr wrote:
>>>>>> On 20.07.20 23:38, Stefano Stabellini wrote:
>>>>>>> For instance, what's your take on notifications with virtio-mmio?
>>>>>>> How
>>>>>>> are they modelled today? Are they good enough or do we need MSIs?
>>>>>> Notifications are sent from device (backend) to the driver (frontend)
>>>>>> using interrupts. Additional DM function was introduced for that
>>>>>> purpose xendevicemodel_set_irq_level() which results in
>>>>>> vgic_inject_irq() call.
>>>>>>
>>>>>> Currently, if device wants to notify a driver it should trigger the
>>>>>> interrupt by calling that function twice (high level at first, then
>>>>>> low level).
>>>>> This doesn't look right to me. Assuming the interrupt is trigger when
>>>>> the line is high-level, the backend should only issue the hypercall
>>>>> once
>>>>> to set the level to high. Once the guest has finish to process all the
>>>>> notifications the backend would then call the hypercall to lower the
>>>>> interrupt line.
>>>>>
>>>>> This means the interrupts should keep firing as long as the interrupt
>>>>> line is high.
>>>>>
>>>>> It is quite possible that I took some shortcut when implementing the
>>>>> hypercall, so this should be corrected before anyone start to rely on
>>>>> it.
>>>> So I think the key question is: are virtio interrupts level or edge
>>>> triggered? Both QEMU and kvmtool advertise virtio-mmio interrupts as
>>>> edge-triggered.
>>>>   From skimming through the virtio spec I can't find any explicit
>>>> mentioning of the type of IRQ, but the usage of MSIs indeed hints at
>>>> using an edge property. Apparently reading the PCI ISR status register
>>>> clears it, which again sounds like edge. For virtio-mmio the driver
>>>> needs to explicitly clear the interrupt status register, which again
>>>> says: edge (as it's not the device clearing the status).
>>>>
>>>> So the device should just notify the driver once, which would cause one
>>>> vgic_inject_irq() call. It would be then up to the driver to clear up
>>>> that status, by reading PCI ISR status or writing to virtio-mmio's
>>>> interrupt-acknowledge register.
>>>>
>>>> Does that make sense?
>>> When implementing Xen backend, I didn't have an already working example
>>> so only guessed. I looked how kvmtool behaved when actually triggering
>>> the interrupt on Arm [1].
>>>
>>> Taking into the account that Xen PoC on Arm advertises [2] the same irq
>>> type (TYPE_EDGE_RISING) as kvmtool [3] I decided to follow the model of
>>> triggering an interrupt. Could you please explain, is this wrong?
>> Yes, kvmtool does a double call needlessly (on x86, ppc and arm, mips is
>> correct).
>> I just chased it down in the kernel, a KVM_IRQ_LINE ioctl with level=low
>> is ignored when the target IRQ is configured as edge (which it is,
>> because the DT says so), check vgic_validate_injection() in the kernel.
>>
>> So you should only ever need one call to set the line "high" (actually:
>> trigger the edge pulse).
>
> Got it, thanks for the explanation. Have just removed an extra action
> (setting low level) and checked.
>

Just for the records: the KVM API documentation explicitly mentions:
"Note that edge-triggered interrupts require the level to be set to 1
and then back to 0." So kvmtool is just following the book.

Setting it to 0 still does nothing *on ARM*, and the x86 IRQ code is far
to convoluted to easily judge what's really happening here. For MSIs at
least it's equally ignored.

So I guess a clean implementation in Xen does not need two calls, but
some folks with understanding of x86 IRQ handling in Xen should confirm.

Cheers,
Andre.
Re: Virtio in Xen on Arm (based on IOREQ concept) [ In reply to ]
On 21.07.20 17:27, Julien Grall wrote:
> Hi,

Hello Julien


>
> On 17/07/2020 19:34, Oleksandr wrote:
>>
>> On 17.07.20 18:00, Roger Pau Monné wrote:
>>>> requires
>>>> some implementation to forward guest MMIO access to a device model.
>>>> And as
>>>> it
>>>> turned out the Xen on x86 contains most of the pieces to be able to
>>>> use that
>>>> transport (via existing IOREQ concept). Julien has already done a
>>>> big amount
>>>> of work in his PoC (xen/arm: Add support for Guest IO forwarding to a
>>>> device emulator).
>>>> Using that code as a base we managed to create a completely
>>>> functional PoC
>>>> with DomU
>>>> running on virtio block device instead of a traditional Xen PV driver
>>>> without
>>>> modifications to DomU Linux. Our work is mostly about rebasing
>>>> Julien's
>>>> code on the actual
>>>> codebase (Xen 4.14-rc4), various tweeks to be able to run emulator
>>>> (virtio-disk backend)
>>>> in other than Dom0 domain (in our system we have thin Dom0 and keep
>>>> all
>>>> backends
>>>> in driver domain),
>>> How do you handle this use-case? Are you using grants in the VirtIO
>>> ring, or rather allowing the driver domain to map all the guest memory
>>> and then placing gfn on the ring like it's commonly done with VirtIO?
>>
>> Second option. Xen grants are not used at all as well as event
>> channel and Xenbus. That allows us to have guest
>>
>> *unmodified* which one of the main goals. Yes, this may sound (or
>> even sounds) non-secure, but backend which runs in driver domain is
>> allowed to map all guest memory.
>>
>> In current backend implementation a part of guest memory is mapped
>> just to process guest request then unmapped back, there is no
>> mappings in advance. The xenforeignmemory_map
>>
>> call is used for that purpose. For experiment I tried to map all
>> guest memory in advance and just calculated pointer at runtime. Of
>> course that logic performed better.
>
> That works well for a PoC, however I am not sure you can rely on it
> long term as a guest is free to modify its memory layout. For
> instance, Linux may balloon in/out memory. You probably want to
> consider something similar to mapcache in QEMU.
Yes, that was considered and even tried.
Current backend implementation is based on map/unmap only needed part of
guest memory per each request with some kind of mapcache. I borrowed x86
logic on Arm to invalidate mapcache on XENMEM_decrease_reservation call,
so if mapcache is in use it will be cleared. Hopefully DomU without
backends running is not going to balloon in/out memory often.


>
> On a similar topic, I am a bit surprised you didn't encounter memory
> exhaustion when trying to use virtio. Because on how Linux currently
> works (see XSA-300), the backend domain as to have a least as much RAM
> as the domain it serves. For instance, you have serve two domains with
> 1GB of RAM each, then your backend would need at least 2GB + some for
> its own purpose.
I understand these bits. You have already warned me about that. When
playing with mapping the whole guest memory in advance, I gave a DomU
512MB only, that was enough to not encounter memory exhaustion on my
environment. Then switched to "map/unmap at runtime" model.


>>>>
>>>> *A few word about the Xen code:*
>>>> You can find the whole Xen series at [5]. The patches are in RFC state
>>>> because
>>>> some actions in the series should be reconsidered and implemented
>>>> properly.
>>>> Before submitting the final code for the review the first IOREQ patch
>>>> (which is quite
>>>> big) will be split into x86, Arm and common parts. Please note, x86
>>>> part
>>>> wasn’t
>>>> even build-tested so far and could be broken with that series. Also
>>>> the
>>>> series probably
>>>> wants splitting into adding IOREQ on Arm (should be focused first) and
>>>> tools support
>>>> for the virtio-disk (which is going to be the first Virtio driver)
>>>> configuration before going
>>>> into the mailing list.
>>> Sending first a patch series to enable IOREQs on Arm seems perfectly
>>> fine, and it doesn't have to come with the VirtIO backend. In fact I
>>> would recommend that you send that ASAP, so that you don't spend time
>>> working on the backend that would likely need to be modified
>>> according to the review received on the IOREQ series.
>>
>> Completely agree with you, I will send it after splitting IOREQ patch
>> and performing some cleanup.
>>
>> However, it is going to take some time to make it properly taking
>> into the account
>>
>> that personally I won't be able to test on x86.
> I think other member of the community should be able to help here.
> However, nowadays testing Xen on x86 is pretty easy with QEMU :).

That's good.


--
Regards,

Oleksandr Tyshchenko
Re: Virtio in Xen on Arm (based on IOREQ concept) [ In reply to ]
On 21.07.20 17:27, Julien Grall wrote:
> Hi,

Hello


>
> On a similar topic, I am a bit surprised you didn't encounter memory
> exhaustion when trying to use virtio. Because on how Linux currently
> works (see XSA-300), the backend domain as to have a least as much RAM
> as the domain it serves. For instance, you have serve two domains with
> 1GB of RAM each, then your backend would need at least 2GB + some for
> its own purpose.
>
> This probably wants to be resolved by allowing foreign mapping to be
> "paging" out as you would for memory assigned to a userspace.

Didn't notice the last sentence initially. Could you please explain your
idea in detail if possible. Does it mean if implemented it would be
feasible to map all guest memory regardless of how much memory the guest
has? Avoiding map/unmap memory each guest request would allow us to have
better performance (of course with taking care of the fact that guest
memory layout could be changed)... Actually what I understand looking at
kvmtool is the fact it does not map/unmap memory dynamically, just
calculate virt addresses according to the gfn provided.


--
Regards,

Oleksandr Tyshchenko
Re: Virtio in Xen on Arm (based on IOREQ concept) [ In reply to ]
Hi Oleksandr,

On 21/07/2020 19:16, Oleksandr wrote:
>
> On 21.07.20 17:27, Julien Grall wrote:
>> On a similar topic, I am a bit surprised you didn't encounter memory
>> exhaustion when trying to use virtio. Because on how Linux currently
>> works (see XSA-300), the backend domain as to have a least as much RAM
>> as the domain it serves. For instance, you have serve two domains with
>> 1GB of RAM each, then your backend would need at least 2GB + some for
>> its own purpose.
>>
>> This probably wants to be resolved by allowing foreign mapping to be
>> "paging" out as you would for memory assigned to a userspace.
>
> Didn't notice the last sentence initially. Could you please explain your
> idea in detail if possible. Does it mean if implemented it would be
> feasible to map all guest memory regardless of how much memory the guest
> has?
>
> Avoiding map/unmap memory each guest request would allow us to have
> better performance (of course with taking care of the fact that guest
> memory layout could be changed)...

I will explain that below. Before let me comment on KVM first.

> Actually what I understand looking at
> kvmtool is the fact it does not map/unmap memory dynamically, just
> calculate virt addresses according to the gfn provided.

The memory management between KVM and Xen is quite different. In the
case of KVM, the guest RAM is effectively memory from the userspace
(allocated via mmap) and then shared with the guest.

From the userspace PoV, the guest memory will always be accessible from
the same virtual region. However, behind the scene, the pages may not
always reside in memory. They are basically managed the same way as
"normal" userspace memory.

In the case of Xen, we are basically stealing a guest physical page
allocated via kmalloc() and provide no facilities for Linux to reclaim
the page if it needs to do it before the userspace decide to unmap the
foreign mapping.

I think it would be good to handle the foreing mapping the same way as
userspace memory. By that I mean, that Linux could reclaim the physical
page used by the foreing mapping if it needs to.

The process for reclaiming the page would look like:
1) Unmap the foreign page
2) Ballon in the backend domain physical address used by the
foreing mapping (allocate the page in the physmap)

The next time the userspace is trying to access the foreign page, Linux
will receive a data abort that would result to:
1) Allocate a backend domain physical page
2) Balloon out the physical address (remove the page from the physmap)
3) Map the foreing mapping at the new guest physical address
4) Map the guest physical page in the userspace address space

With this approach, we should be able to have backend domain that can
handle frontend domain without require a lot of memory.

Note that I haven't looked at the Linux code yet, so I don't know the
complexity to implement it or all the pitfalls.

One pitfall I could think right now is the frontend guest may have
removed the page from its physmap. Therefore the backend domain wouldn't
be able to re-map the page. We definitely don't want to crash the
backend app in this case. However, I am not entirely sure what would be
the correct action.

Long term, we may want to consider to use separate region in the backend
domain physical address. This may remove the pressure in the backend
domain RAM and reducing the number of page that may be "swapped out".

Cheers,

--
Julien Grall
Re: Virtio in Xen on Arm (based on IOREQ concept) [ In reply to ]
On Tue, Jul 21, 2020 at 10:12:40PM +0100, Julien Grall wrote:
> Hi Oleksandr,
>
> On 21/07/2020 19:16, Oleksandr wrote:
> >
> > On 21.07.20 17:27, Julien Grall wrote:
> > > On a similar topic, I am a bit surprised you didn't encounter memory
> > > exhaustion when trying to use virtio. Because on how Linux currently
> > > works (see XSA-300), the backend domain as to have a least as much
> > > RAM as the domain it serves. For instance, you have serve two
> > > domains with 1GB of RAM each, then your backend would need at least
> > > 2GB + some for its own purpose.
> > >
> > > This probably wants to be resolved by allowing foreign mapping to be
> > > "paging" out as you would for memory assigned to a userspace.
> >
> > Didn't notice the last sentence initially. Could you please explain your
> > idea in detail if possible. Does it mean if implemented it would be
> > feasible to map all guest memory regardless of how much memory the guest
> > has?
> >
> > Avoiding map/unmap memory each guest request would allow us to have
> > better performance (of course with taking care of the fact that guest
> > memory layout could be changed)...
>
> I will explain that below. Before let me comment on KVM first.
>
> > Actually what I understand looking at kvmtool is the fact it does not
> > map/unmap memory dynamically, just calculate virt addresses according to
> > the gfn provided.
>
> The memory management between KVM and Xen is quite different. In the case of
> KVM, the guest RAM is effectively memory from the userspace (allocated via
> mmap) and then shared with the guest.
>
> From the userspace PoV, the guest memory will always be accessible from the
> same virtual region. However, behind the scene, the pages may not always
> reside in memory. They are basically managed the same way as "normal"
> userspace memory.
>
> In the case of Xen, we are basically stealing a guest physical page
> allocated via kmalloc() and provide no facilities for Linux to reclaim the
> page if it needs to do it before the userspace decide to unmap the foreign
> mapping.
>
> I think it would be good to handle the foreing mapping the same way as
> userspace memory. By that I mean, that Linux could reclaim the physical page
> used by the foreing mapping if it needs to.
>
> The process for reclaiming the page would look like:
> 1) Unmap the foreign page
> 2) Ballon in the backend domain physical address used by the foreing
> mapping (allocate the page in the physmap)
>
> The next time the userspace is trying to access the foreign page, Linux will
> receive a data abort that would result to:
> 1) Allocate a backend domain physical page
> 2) Balloon out the physical address (remove the page from the physmap)
> 3) Map the foreing mapping at the new guest physical address
> 4) Map the guest physical page in the userspace address space

This is going to shatter all the super pages in the stage-2
translation.

> With this approach, we should be able to have backend domain that can handle
> frontend domain without require a lot of memory.

Linux on x86 has the option to use empty hotplug memory ranges to map
foreign memory: the balloon driver hotplugs an unpopulated physical
memory range that's not made available to the OS free memory allocator
and it's just used as scratch space to map foreign memory. Not sure
whether Arm has something similar, or if it could be implemented.

You can still use the map-on-fault behaviour as above, but I would
recommend that you try to limit the number of hypercalls issued.
Having to issue a single hypercall for each page fault it's going to
be slow, so I would instead use mmap batch to map the hole range in
unpopulated physical memory and then the OS fault handler just needs to
fill the page tables with the corresponding address.

Roger.
Re: Virtio in Xen on Arm (based on IOREQ concept) [ In reply to ]
Hi Roger,

On 22/07/2020 09:21, Roger Pau Monné wrote:
> On Tue, Jul 21, 2020 at 10:12:40PM +0100, Julien Grall wrote:
>> Hi Oleksandr,
>>
>> On 21/07/2020 19:16, Oleksandr wrote:
>>>
>>> On 21.07.20 17:27, Julien Grall wrote:
>>>> On a similar topic, I am a bit surprised you didn't encounter memory
>>>> exhaustion when trying to use virtio. Because on how Linux currently
>>>> works (see XSA-300), the backend domain as to have a least as much
>>>> RAM as the domain it serves. For instance, you have serve two
>>>> domains with 1GB of RAM each, then your backend would need at least
>>>> 2GB + some for its own purpose.
>>>>
>>>> This probably wants to be resolved by allowing foreign mapping to be
>>>> "paging" out as you would for memory assigned to a userspace.
>>>
>>> Didn't notice the last sentence initially. Could you please explain your
>>> idea in detail if possible. Does it mean if implemented it would be
>>> feasible to map all guest memory regardless of how much memory the guest
>>> has?
>>>
>>> Avoiding map/unmap memory each guest request would allow us to have
>>> better performance (of course with taking care of the fact that guest
>>> memory layout could be changed)...
>>
>> I will explain that below. Before let me comment on KVM first.
>>
>>> Actually what I understand looking at kvmtool is the fact it does not
>>> map/unmap memory dynamically, just calculate virt addresses according to
>>> the gfn provided.
>>
>> The memory management between KVM and Xen is quite different. In the case of
>> KVM, the guest RAM is effectively memory from the userspace (allocated via
>> mmap) and then shared with the guest.
>>
>> From the userspace PoV, the guest memory will always be accessible from the
>> same virtual region. However, behind the scene, the pages may not always
>> reside in memory. They are basically managed the same way as "normal"
>> userspace memory.
>>
>> In the case of Xen, we are basically stealing a guest physical page
>> allocated via kmalloc() and provide no facilities for Linux to reclaim the
>> page if it needs to do it before the userspace decide to unmap the foreign
>> mapping.
>>
>> I think it would be good to handle the foreing mapping the same way as
>> userspace memory. By that I mean, that Linux could reclaim the physical page
>> used by the foreing mapping if it needs to.
>>
>> The process for reclaiming the page would look like:
>> 1) Unmap the foreign page
>> 2) Ballon in the backend domain physical address used by the foreing
>> mapping (allocate the page in the physmap)
>>
>> The next time the userspace is trying to access the foreign page, Linux will
>> receive a data abort that would result to:
>> 1) Allocate a backend domain physical page
>> 2) Balloon out the physical address (remove the page from the physmap)
>> 3) Map the foreing mapping at the new guest physical address
>> 4) Map the guest physical page in the userspace address space
>
> This is going to shatter all the super pages in the stage-2
> translation.

Yes, but this is nothing really new as ballooning would result to
(AFAICT) the same behavior on Linux.

>
>> With this approach, we should be able to have backend domain that can handle
>> frontend domain without require a lot of memory.
>
> Linux on x86 has the option to use empty hotplug memory ranges to map
> foreign memory: the balloon driver hotplugs an unpopulated physical
> memory range that's not made available to the OS free memory allocator
> and it's just used as scratch space to map foreign memory. Not sure
> whether Arm has something similar, or if it could be implemented.

We already discussed that last year :). This was attempted in the past
(I was still at Citrix) and indefinitely paused for Arm.

/proc/iomem can be incomplete on Linux if we didn't load a driver for
all the devices. This means that Linux doesn't have the full view of
what is physical range is freed.

Additionally, in the case of Dom0, all the regions corresponding to the
host RAM are unusable when using the SMMU. This is because we would do
1:1 mapping for the foreign mapping as well.

It might be possible to take advantage of the direct mapping property if
Linux do some bookeeping. Although, this wouldn't work for 32-bit Dom0
using short page tables (e.g some version of Debian does) as it may not
be able to access all the host RAM. Whether we still care about is a
different situation :).

For all the other domains, I think we would want the toolstack to
provide a region that can be safely used for foreign mapping (similar to
what we already do for the grant-table).

>
> You can still use the map-on-fault behaviour as above, but I would
> recommend that you try to limit the number of hypercalls issued.
> Having to issue a single hypercall for each page fault it's going to
> be slow, so I would instead use mmap batch to map the hole range in
> unpopulated physical memory and then the OS fault handler just needs to
> fill the page tables with the corresponding address.
IIUC your proposal, you are assuming that you will have enough free
space in the physical address space to map the foreign mapping.

However that amount of free space is not unlimited and may be quite
small (see above). It would be fairly easy to exhaust it given that a
userspace application can map many times the same guest physical address.

So I still think we need to be able to allow Linux to swap a foreign
page with another page.

Cheers,

--
Julien Grall
Re: Virtio in Xen on Arm (based on IOREQ concept) [ In reply to ]
On Wed, Jul 22, 2020 at 11:47:18AM +0100, Julien Grall wrote:
> Hi Roger,
>
> On 22/07/2020 09:21, Roger Pau Monné wrote:
> > On Tue, Jul 21, 2020 at 10:12:40PM +0100, Julien Grall wrote:
> > > Hi Oleksandr,
> > >
> > > On 21/07/2020 19:16, Oleksandr wrote:
> > > >
> > > > On 21.07.20 17:27, Julien Grall wrote:
> > > > > On a similar topic, I am a bit surprised you didn't encounter memory
> > > > > exhaustion when trying to use virtio. Because on how Linux currently
> > > > > works (see XSA-300), the backend domain as to have a least as much
> > > > > RAM as the domain it serves. For instance, you have serve two
> > > > > domains with 1GB of RAM each, then your backend would need at least
> > > > > 2GB + some for its own purpose.
> > > > >
> > > > > This probably wants to be resolved by allowing foreign mapping to be
> > > > > "paging" out as you would for memory assigned to a userspace.
> > > >
> > > > Didn't notice the last sentence initially. Could you please explain your
> > > > idea in detail if possible. Does it mean if implemented it would be
> > > > feasible to map all guest memory regardless of how much memory the guest
> > > > has?
> > > >
> > > > Avoiding map/unmap memory each guest request would allow us to have
> > > > better performance (of course with taking care of the fact that guest
> > > > memory layout could be changed)...
> > >
> > > I will explain that below. Before let me comment on KVM first.
> > >
> > > > Actually what I understand looking at kvmtool is the fact it does not
> > > > map/unmap memory dynamically, just calculate virt addresses according to
> > > > the gfn provided.
> > >
> > > The memory management between KVM and Xen is quite different. In the case of
> > > KVM, the guest RAM is effectively memory from the userspace (allocated via
> > > mmap) and then shared with the guest.
> > >
> > > From the userspace PoV, the guest memory will always be accessible from the
> > > same virtual region. However, behind the scene, the pages may not always
> > > reside in memory. They are basically managed the same way as "normal"
> > > userspace memory.
> > >
> > > In the case of Xen, we are basically stealing a guest physical page
> > > allocated via kmalloc() and provide no facilities for Linux to reclaim the
> > > page if it needs to do it before the userspace decide to unmap the foreign
> > > mapping.
> > >
> > > I think it would be good to handle the foreing mapping the same way as
> > > userspace memory. By that I mean, that Linux could reclaim the physical page
> > > used by the foreing mapping if it needs to.
> > >
> > > The process for reclaiming the page would look like:
> > > 1) Unmap the foreign page
> > > 2) Ballon in the backend domain physical address used by the foreing
> > > mapping (allocate the page in the physmap)
> > >
> > > The next time the userspace is trying to access the foreign page, Linux will
> > > receive a data abort that would result to:
> > > 1) Allocate a backend domain physical page
> > > 2) Balloon out the physical address (remove the page from the physmap)
> > > 3) Map the foreing mapping at the new guest physical address
> > > 4) Map the guest physical page in the userspace address space
> >
> > This is going to shatter all the super pages in the stage-2
> > translation.
>
> Yes, but this is nothing really new as ballooning would result to (AFAICT)
> the same behavior on Linux.
>
> >
> > > With this approach, we should be able to have backend domain that can handle
> > > frontend domain without require a lot of memory.
> >
> > Linux on x86 has the option to use empty hotplug memory ranges to map
> > foreign memory: the balloon driver hotplugs an unpopulated physical
> > memory range that's not made available to the OS free memory allocator
> > and it's just used as scratch space to map foreign memory. Not sure
> > whether Arm has something similar, or if it could be implemented.
>
> We already discussed that last year :). This was attempted in the past (I
> was still at Citrix) and indefinitely paused for Arm.
>
> /proc/iomem can be incomplete on Linux if we didn't load a driver for all
> the devices. This means that Linux doesn't have the full view of what is
> physical range is freed.
>
> Additionally, in the case of Dom0, all the regions corresponding to the host
> RAM are unusable when using the SMMU. This is because we would do 1:1
> mapping for the foreign mapping as well.

Right, that's a PITA because on x86 PVH dom0 I was planning to use
those RAM regions as scratch space for foreign mapping lacking a
better alternative ATM.

> It might be possible to take advantage of the direct mapping property if
> Linux do some bookeeping. Although, this wouldn't work for 32-bit Dom0 using
> short page tables (e.g some version of Debian does) as it may not be able to
> access all the host RAM. Whether we still care about is a different
> situation :).
>
> For all the other domains, I think we would want the toolstack to provide a
> region that can be safely used for foreign mapping (similar to what we
> already do for the grant-table).

Yes, that would be the plan on x86 also - have some way for the
hypervisor to report safe ranges where a domU can create foreign
mappings.

> >
> > You can still use the map-on-fault behaviour as above, but I would
> > recommend that you try to limit the number of hypercalls issued.
> > Having to issue a single hypercall for each page fault it's going to
> > be slow, so I would instead use mmap batch to map the hole range in
> > unpopulated physical memory and then the OS fault handler just needs to
> > fill the page tables with the corresponding address.
> IIUC your proposal, you are assuming that you will have enough free space in
> the physical address space to map the foreign mapping.
>
> However that amount of free space is not unlimited and may be quite small
> (see above). It would be fairly easy to exhaust it given that a userspace
> application can map many times the same guest physical address.
>
> So I still think we need to be able to allow Linux to swap a foreign page
> with another page.

Right, but you will have to be careful to make sure physical addresses
are not swapped while being used for IO with devices, as in that case
you won't get a recoverable fault. This is safe now because physical
mappings created by privcmd are never swapped out, but if you go the
route you propose you will have to figure a way to correctly populate
physical ranges used for IO with devices, even when the CPU hasn't
accessed them.

Relying solely on CPU page faults to populate them will not be enough,
as the CPU won't necessarily access all the pages that would be send
to devices for IO.

Roger.
Re: Virtio in Xen on Arm (based on IOREQ concept) [ In reply to ]
On 22/07/2020 12:10, Roger Pau Monné wrote:
> On Wed, Jul 22, 2020 at 11:47:18AM +0100, Julien Grall wrote:
>>>
>>> You can still use the map-on-fault behaviour as above, but I would
>>> recommend that you try to limit the number of hypercalls issued.
>>> Having to issue a single hypercall for each page fault it's going to
>>> be slow, so I would instead use mmap batch to map the hole range in
>>> unpopulated physical memory and then the OS fault handler just needs to
>>> fill the page tables with the corresponding address.
>> IIUC your proposal, you are assuming that you will have enough free space in
>> the physical address space to map the foreign mapping.
>>
>> However that amount of free space is not unlimited and may be quite small
>> (see above). It would be fairly easy to exhaust it given that a userspace
>> application can map many times the same guest physical address.
>>
>> So I still think we need to be able to allow Linux to swap a foreign page
>> with another page.
>
> Right, but you will have to be careful to make sure physical addresses
> are not swapped while being used for IO with devices, as in that case
> you won't get a recoverable fault. This is safe now because physical
> mappings created by privcmd are never swapped out, but if you go the
> route you propose you will have to figure a way to correctly populate
> physical ranges used for IO with devices, even when the CPU hasn't
> accessed them.
>
> Relying solely on CPU page faults to populate them will not be enough,
> as the CPU won't necessarily access all the pages that would be send
> to devices for IO.

The problem you described here doesn't seem to be specific to foreign
mapping. So I would really be surprised if Linux doesn't already have
generic mechanism to deal with this.

Hence why I suggested before to deal with foreign mapping the same way
as Linux would do with user memory.

Cheers,

--
Julien Grall
Re: Virtio in Xen on Arm (based on IOREQ concept) [ In reply to ]
On Wed, Jul 22, 2020 at 12:17:26PM +0100, Julien Grall wrote:
>
>
> On 22/07/2020 12:10, Roger Pau Monné wrote:
> > On Wed, Jul 22, 2020 at 11:47:18AM +0100, Julien Grall wrote:
> > > >
> > > > You can still use the map-on-fault behaviour as above, but I would
> > > > recommend that you try to limit the number of hypercalls issued.
> > > > Having to issue a single hypercall for each page fault it's going to
> > > > be slow, so I would instead use mmap batch to map the hole range in
> > > > unpopulated physical memory and then the OS fault handler just needs to
> > > > fill the page tables with the corresponding address.
> > > IIUC your proposal, you are assuming that you will have enough free space in
> > > the physical address space to map the foreign mapping.
> > >
> > > However that amount of free space is not unlimited and may be quite small
> > > (see above). It would be fairly easy to exhaust it given that a userspace
> > > application can map many times the same guest physical address.
> > >
> > > So I still think we need to be able to allow Linux to swap a foreign page
> > > with another page.
> >
> > Right, but you will have to be careful to make sure physical addresses
> > are not swapped while being used for IO with devices, as in that case
> > you won't get a recoverable fault. This is safe now because physical
> > mappings created by privcmd are never swapped out, but if you go the
> > route you propose you will have to figure a way to correctly populate
> > physical ranges used for IO with devices, even when the CPU hasn't
> > accessed them.
> >
> > Relying solely on CPU page faults to populate them will not be enough,
> > as the CPU won't necessarily access all the pages that would be send
> > to devices for IO.
>
> The problem you described here doesn't seem to be specific to foreign
> mapping. So I would really be surprised if Linux doesn't already have
> generic mechanism to deal with this.

Right, Linux will pre-fault and lock the pages before using them for
IO, and unlock them afterwards, in which case it should be safe.

> Hence why I suggested before to deal with foreign mapping the same way as
> Linux would do with user memory.

Should work, on FreeBSD privcmd I also populate the pages in the page
fault handler, but the hypercall to create the foreign mappings is
executed only once when the ioctl is issued.

Roger.

1 2  View All