Mailing List Archive

[Xen-merge] FW: vmware's virtual machine interface
Folks, there's been some disucssion about the VMI interface proposal
between myself and Linus/Andrew. I've appeneded my latest reply.

As regards the VMI proposal itself, I don't think I can forward it, so
if you don't have it you'd better ask Pratap Subrahmanyam
[pratap@vmware.com] for it directly.

Cheers,
Ian

-----Original Message-----
From: Ian Pratt [mailto:m+Ian.Pratt@cl.cam.ac.uk]
Sent: 08 August 2005 20:59
To: Andrew Morton; torvalds@osdl.org
Cc: ian.pratt@cl.cam.ac.uk
Subject: RE: vmware's virtual machine interface

> Ian, the vmware guys were sounding a little anxious that they hadn't
> heard anything back on the VMI proposal and spec?

The first few of their patches are fine -- just cleanups to existing
arch code that we have similar patches for in our own tree. However, our
views on the actual VMI interface haven't changed since the discussion
at OLS, and we have serious reservations about the proposal.

I believe being able to override bits of kernel code and divert
execution through a "ROM" image supplied by the hypervisor is going to
lead to a maintenance nightmare.

People making changes to the kernel won't be able to see what the ROM
code is doing, and hence won't know how their changes effect it.
There'll be pressure to freeze internal APIs, otherwise it will be a
struggle to keep the 'ROM' up to date. I suspect we'll also end up with
a proliferation of hook points that no-one knows whether they're
actually used or not (there are currently 86). There'll also be pressure
to allocate opaque VMI private data areas in various structures such as
struct mm and struct page.

Looking at the VMI hooks themselves, I don't think they've really
thought through the design, at least not for a high-performance
implementation. For example, they have an API for doing batched updates
to PTEs. The problem with this approach is that it's difficult to avoid
read-after-write hazards on queued PTE updates -- you need to sprinkle
flushes liberally throughout arch independent code. Working out where to
put the flushes is tough: Xen 1.0 used this approach and we were never
quite sure we had flushes in all the necessary places in Linux 2.4 --
that's why we abandoned the approach with Xen 2.0 and provided a new
interface that avoids the problem entirely (and is also required for
doing fast atomic updates which are essential to make SMP guests get
good performance).

The current VMI design is mostly looking at things at an instruction
level, providing hooks for all the privileged instructions plus some for
PTE handling. Xen's ABI is a bit different. We discovered that is wasn't
worth creating hooks for many of the privileged instructions since
they're so infrequently executed that you might as well take the trap
and decode and emulate the instruction. The only ones that matter are on
critical paths (such as the context switch path, demand fault, IPI,
interrupt, fork, exec, exit etc), and we've concentrated our efforts on
making these paths go fast, driven by performance data.

As it stands, the VMI design wouldn't support several of the
optimizations that we've found to be very important for getting
near-native performance. The VMI design assumes you're using shadow page
tables, but a substantial part of Xen's performance comes from avoiding
their use. There's also no mention of SMP. This has been one of the
trickiest bits to get right on Xen -- it's essential to be able to
support SMP guests with very low overhead, and this required a few small
but carefully placed changes to the way IPIs and memory management are
handled (some of which have benefits on native too). The API doesn't
address IO virtualization at all.

We tend to think of the hypervisor API like a hardware architecture.
It's fairly fixed but can be extended from time to time in a backward
compatible fashion (after considerable thought and examination of
benchmark data, just as happens for h/w CPUs). The core parts of the Xen
CPU API have been fixed for quite a while (there have been some changes
to the para-virtualized IO driver APIs, but these are not addressed by
VMI at all).

One attractive aspect of the VMI approach is that it's possible to have
one kernel that works on native (at reduced performance) or on
potentially multiple hypervisors. However, the real cost to linux
distros and ISVs of having multiple linux kernels is the fact that they
need to do all the s/w qualification on each one. The VMI approach
doesn't change this at all: they will still have to do qualification
tests on native, Xen, VMware etc just as they do today[*]. Although it
would be nice to be able to move a running kernel between different
hypervisors at run time I really can't see how VMI would make this
feasible. There's far too much hidden state in the ROM and hypervisor
itself.

At an implementation level their design could be improved. Using
function pointers to provide hook points causes unnecessary overhead --
it's better to insert 5 byte NOPs that can be easily patched.

In summary: the cleanup part of their patch is useful, but I think VMI
"ROM" approach is going to be messy and very troublesome to get right.

Chris Wright, Martin Bligh et al are currently make good progress
refactoring the xen patch to get it into a form that should be more
palatable.
[See http://lists.xensource.com/archives/html/xen-merge/ ] It wouldn't
be a big deal to add VMI-like hooks to the Xen sub arch if VMware want
to go down that route (though we'd prefer to do it with NOP padding
rather than by adding an unnecessary indirection).

Cheers,
Ian

[*]Having a single kernel image that works native and on a hypervisor is
quite convenient from a user POV. We've looked into addressing this
problem in a different way, by building multiple kernels and then using
a tool that does a function-by-function 'union' operation, merging the
duplicates and creating a re-write table that can be used to patch the
kernel from native to Xen. This approach has no run time overhead, and
is entirely 'mechanical' rather than having to having to do it as source
level that can be both tricky and messy.





_______________________________________________
Xen-merge mailing list
Xen-merge@lists.xensource.com
http://lists.xensource.com/xen-merge
Re: [Xen-merge] FW: vmware's virtual machine interface [ In reply to ]
On Mon, Aug 08, 2005 at 09:03:21PM +0100, Ian Pratt wrote:
>
> Folks, there's been some disucssion about the VMI interface proposal
> between myself and Linus/Andrew. I've appeneded my latest reply.
>
> As regards the VMI proposal itself, I don't think I can forward it, so
> if you don't have it you'd better ask Pratap Subrahmanyam
> [pratap@vmware.com] for it directly.

FWIW i agree with most of your points.

> [*]Having a single kernel image that works native and on a hypervisor is
> quite convenient from a user POV. We've looked into addressing this
> problem in a different way, by building multiple kernels and then using
> a tool that does a function-by-function 'union' operation, merging the
> duplicates and creating a re-write table that can be used to patch the
> kernel from native to Xen. This approach has no run time overhead, and
> is entirely 'mechanical' rather than having to having to do it as source
> level that can be both tricky and messy.

That sounds incredibly ugly. In particular it will make building
kernels very messy, which is a bad thing.

-Andi

_______________________________________________
Xen-merge mailing list
Xen-merge@lists.xensource.com
http://lists.xensource.com/xen-merge
Re: [Xen-merge] FW: vmware's virtual machine interface [ In reply to ]
--On Monday, August 08, 2005 22:37:57 +0200 Andi Kleen <ak@suse.de> wrote:

> On Mon, Aug 08, 2005 at 09:03:21PM +0100, Ian Pratt wrote:
>>
>> Folks, there's been some disucssion about the VMI interface proposal
>> between myself and Linus/Andrew. I've appeneded my latest reply.
>>
>> As regards the VMI proposal itself, I don't think I can forward it, so
>> if you don't have it you'd better ask Pratap Subrahmanyam
>> [pratap@vmware.com] for it directly.
>
> FWIW i agree with most of your points.
>
>> [*]Having a single kernel image that works native and on a hypervisor is
>> quite convenient from a user POV. We've looked into addressing this
>> problem in a different way, by building multiple kernels and then using
>> a tool that does a function-by-function 'union' operation, merging the
>> duplicates and creating a re-write table that can be used to patch the
>> kernel from native to Xen. This approach has no run time overhead, and
>> is entirely 'mechanical' rather than having to having to do it as source
>> level that can be both tricky and messy.
>
> That sounds incredibly ugly. In particular it will make building
> kernels very messy, which is a bad thing.

Ian, did you look at the generic subarch, and see how that works? Not
sure if that's what you mean or not - may arrive at the same end, but
by an easier path?

If we use function pointers, and do so at a high enough abstraction
level, I don't think the perf impact is too bad. there's always the
possiblity to rewrite some of the code on the fly like the cpu code,
just to shortcut those branches (though with branch prediction on
modern chips, it may not do much). I *think* that's equivalent to
what you're saying above ... but takes away the scary bit about
"multiple kernels" ;-)

M.


_______________________________________________
Xen-merge mailing list
Xen-merge@lists.xensource.com
http://lists.xensource.com/xen-merge
RE: [Xen-merge] FW: vmware's virtual machine interface [ In reply to ]
> >> [*]Having a single kernel image that works native and on a
> hypervisor
> >> is quite convenient from a user POV. We've looked into addressing
> >> this problem in a different way, by building multiple kernels and
> >> then using a tool that does a function-by-function 'union'
> operation,
> >> merging the duplicates and creating a re-write table that
> can be used
> >> to patch the kernel from native to Xen. This approach has
> no run time
> >> overhead, and is entirely 'mechanical' rather than having
> to having
> >> to do it as source level that can be both tricky and messy.
> >
> > That sounds incredibly ugly. In particular it will make building
> > kernels very messy, which is a bad thing.

[.I wish I hadn't mentioned this at all now -- it certainly wasn't
central to the argument I ws making]

As it is, it wouldn't actually make the build process too ugly. You can
build any number of vmlinux files with whatever config options you like,
and then just pass them to a magic program that smashes them together by
doing function-by-function comparisons. For example, you could do this
with a PAE and non-PAE kernel...

> Ian, did you look at the generic subarch, and see how that
> works? Not sure if that's what you mean or not - may arrive
> at the same end, but by an easier path?

The approach is my footnote is hypothetical, but with the experimenting
I did last year I came to the conclusion it would work.

Sure, it could be done at source level with a high-enough abstraction,
but its not immediately obvious to me that such boot-time nastiness
couldn't just be hidden in a tool operating on the binary.

Just looking at what changes would be required to auto-switch between
PAE and non PAE makes me think that the idea shouldn't be immediately
discounted.

Best,
Ian

> If we use function pointers, and do so at a high enough
> abstraction level, I don't think the perf impact is too bad.
> there's always the possiblity to rewrite some of the code on
> the fly like the cpu code, just to shortcut those branches
> (though with branch prediction on modern chips, it may not do
> much). I *think* that's equivalent to what you're saying
> above ... but takes away the scary bit about "multiple kernels" ;-)
>
> M.
>
>
> _______________________________________________
> Xen-merge mailing list
> Xen-merge@lists.xensource.com
> http://lists.xensource.com/xen-merge
>

_______________________________________________
Xen-merge mailing list
Xen-merge@lists.xensource.com
http://lists.xensource.com/xen-merge
RE: [Xen-merge] FW: vmware's virtual machine interface [ In reply to ]
> > As regards the VMI proposal itself, I don't think I can
> forward it, so
> > if you don't have it you'd better ask Pratap Subrahmanyam
> > [pratap@vmware.com] for it directly.
>
> FWIW i agree with most of your points.

That's reassuring -thanks.

I think the VMI approach looks quite seductive from Andrew/Linus's point
of view, so there's a real chance we could be stuck with it unless we
push back with our own patches soon...

Cheers,
Ian

_______________________________________________
Xen-merge mailing list
Xen-merge@lists.xensource.com
http://lists.xensource.com/xen-merge
RE: [Xen-merge] FW: vmware's virtual machine interface [ In reply to ]
--Ian Pratt <m+Ian.Pratt@cl.cam.ac.uk> wrote (on Tuesday, August 09, 2005 00:40:40 +0100):

>
>> > As regards the VMI proposal itself, I don't think I can
>> forward it, so
>> > if you don't have it you'd better ask Pratap Subrahmanyam
>> > [pratap@vmware.com] for it directly.
>>
>> FWIW i agree with most of your points.
>
> That's reassuring -thanks.
>
> I think the VMI approach looks quite seductive from Andrew/Linus's point
> of view, so there's a real chance we could be stuck with it unless we
> push back with our own patches soon...

The good thing about starting this sort of debates (from the point of
view of the above concern) is that what tends to happen is neither
gets merged until we come to a consensus. On the downside, that's a
total PITA from the point of view of maintaining the stack out of
tree.

Is probably best if we play nice, and get each other to agree on the
cleanup type stuff from both sides. Zach does seem to understand
the need for higher level stuff, even if it's not implemented yet.

My main concern is that we spend weeks refactoring this stuff, and
then it all gets rejected ... that's not helping anything. If we
do it piece at a time, we should be able to have a rational discussion.

M.


_______________________________________________
Xen-merge mailing list
Xen-merge@lists.xensource.com
http://lists.xensource.com/xen-merge
Re: [Xen-merge] FW: vmware's virtual machine interface [ In reply to ]
>[*]Having a single kernel image that works native and on a hypervisor
is
>quite convenient from a user POV. We've looked into addressing this
>problem in a different way, by building multiple kernels and then using
>a tool that does a function-by-function 'union' operation, merging the
>duplicates and creating a re-write table that can be used to patch the
>kernel from native to Xen. This approach has no run time overhead, and
>is entirely 'mechanical' rather than having to having to do it as
source
>level that can be both tricky and messy.

If you are going to hide all the Xen logic behind

static inline wibble_with_foo (int bar, int blat) {
#ifdef CONFIG_XEN
do A
#else
do B
#endif
}

anyway, why is it so tricky and messy to instead do

static inline __wibble_with_foo (int bar, int blat) {
do B
}

static inline wibble_with_foo (int bar, int blat) {
#ifdef CONFIG_XEN
if (running_on_xen)
do A
else
#else
__wibble_with_foo(bar,blat);
#endif
}

This is essentially what Xen/ia64 does, it works today,
and the performance impact is negligible ("running_on_xen"
is a set-once-at-boot global variable; if it's read
frequently, it's fast because its always in cache,
if it's not read frequently, by definition its not a
performance issue). Granted there will be more of
these if-xen functions on x86 than on ia64 because
of the memory management paravirtualization, but the
model is still the same.

Structured properly, the Xen-specific code can even
be hidden away in Xen-specific header files (as it is
in Xen/ia64).

I can guarantee that Vmware's solution will be
transparently paravirtualized, *won't* require some funky
complicated linktime tool which massages the kernel binary
in unusual (and possibly error prone) ways, and, as such,
will look even more seductive to the linux developers.

Oh, and "function-by-function 'union' operation" and
"creating a re-write table that can be used to patch"
sound very close to the Vmware solution to me. I suspect
it will be even harder to sell Xen over VMPI to the
Linux developers if they have to deal with an indirection
table anyway, even if it is handled statically instead
of dynamically.

Just my two cents... (OK, maybe three :-)

Dan "transparent paravirtualization R us" Magenheimer

P.S. Last, the "running_on_xen" solution has at least the
hint that it might be devirtualizable... allowing a virtual
machine to be dynamically v-to-p'ed and vice-versa.
(See Dave Lowell's paper in ASPLOS last year.)

_______________________________________________
Xen-merge mailing list
Xen-merge@lists.xensource.com
http://lists.xensource.com/xen-merge